idiap/coqui-ai-TTS · issue #298 · branch fix/298-vits-memory-leak-detach
RSS grew roughly linearly with every iteration of a VITS synthesis loop, even when the caller only retained the wav numpy array. The cause was structural: BaseTTS.synthesize returned a dict containing the full VITS inference output, eight tensors, none of them detached. A small _release helper that does detach().cpu() on every tensor before it crosses the function boundary stops the bleed without changing the API.
The reporter's reproducer is one tight loop. Nothing in their loop body keeps a reference to anything except the wav numpy array. Yet RSS climbs steadily for the entire 200-iteration run.
The VITS inference() method returns 8 tensors of size proportional to the input length: model_outputs, alignments, durations, z, z_p, m_p, logs_p, y_mask. z and z_p and model_outputs are large and grow with T_dec (decoder time steps). All eight are bundled into the dict the caller receives.
The caller (Synthesizer.tts) uses only outputs["wav"] when VITS is the model (because vocoder_model is None, the use_gl=True branch is taken). The other seven tensors are payload for callers that never look at them, and they survive until Python's GC happens to run, which in a tight synthesis loop is "rarely".
flowchart LR L1["loop iter N"] --> S1["BaseTTS.synthesize
returns dict with 8 GPU tensors"] S1 --> U["Synthesizer.tts reads only outputs[wav]"] U --> D["dict goes out of scope at end of iter"] D --> G{Python GC runs?} G -- "not soon" --> Pin["8 tensor storages stay pinned"] Pin --> L2["loop iter N+1
RSS keeps growing"] G -- "fires every few iters" --> Slow["episodic frees, still grows on average"] style Pin fill:#1e0a0a,stroke:#ef4444,color:#fca5a5 style Slow fill:#1e1408,stroke:#fbbf24,color:#fde68a
A small _release helper inside synthesize handles every tensor identically. Non-tensor values pass through. Shapes and numeric values are unchanged, so every existing caller and every existing test continue to work.
.detach() disconnects from the autograd graph (a no-op under @torch.inference_mode() but pairing it with .cpu() forces a copy out of GPU residency on CUDA and removes any residual inference-mode reference on CPU). The original tensors are released as soon as their storages are no longer needed.
gc.collect() / torch.cuda.empty_cache() in the loop. Those are sledgehammers. They pay a real per-call cost (CUDA allocator reset is expensive, gc.collect walks every object in the interpreter) and they hide the actual leak rather than fix it. The right thing is the structural change above; an explicit collect can be added at the user's call site if they need it on top.tests/ continues to pass..cpu() themselves; the rest get the speed win.synthesize is inference-only; training paths use train_step.Local branch fix/298-vits-memory-leak-detach, commit 2441e66. The repository's PR target is dev; instructions in the README.
Idiap took over coqui-ai-TTS maintenance after the original Coqui company dissolved. Maintenance is real but small-team; the reviewer pool is narrow.
Realistic odds for this lead alone: ~30-50% conversation, ~5-15% contract. Idiap is a research institute, hiring slower than a startup; a contract here is more likely to surface a project-based engagement than an FT retainer.