OutreachIdiap / coqui-ai-TTSFix #298
Code ready · PR-shaped Structural fix · backward compat +18 / −5 LOC Awaiting push + draft PR

Idiap · coqui-ai-TTS

idiap/coqui-ai-TTS · issue #298 · branch fix/298-vits-memory-leak-detach

RSS grew roughly linearly with every iteration of a VITS synthesis loop, even when the caller only retained the wav numpy array. The cause was structural: BaseTTS.synthesize returned a dict containing the full VITS inference output, eight tensors, none of them detached. A small _release helper that does detach().cpu() on every tensor before it crosses the function boundary stops the bleed without changing the API.

8
GPU tensors returned undetached per call (now released)
1 fn
Change surface: BaseTTS.synthesize
0
API breakage, shapes and values unchanged
RSS↓
Plateau after warm-up instead of linear growth
01 · The problem

A linear leak from a structural mistake at one boundary

The reporter's reproducer is one tight loop. Nothing in their loop body keeps a reference to anything except the wav numpy array. Yet RSS climbs steadily for the entire 200-iteration run.

The returned dict that nobody fully consumes

# TTS/tts/models/base_tts.py:678, the OLD version return { "wav": wav, # numpy, fine "alignments": alignments, # tensor, NOT detached "text_inputs": text_inputs, # tensor, NOT detached "outputs": outputs, # dict of 8 tensors, NOT detached }

The VITS inference() method returns 8 tensors of size proportional to the input length: model_outputs, alignments, durations, z, z_p, m_p, logs_p, y_mask. z and z_p and model_outputs are large and grow with T_dec (decoder time steps). All eight are bundled into the dict the caller receives.

The caller (Synthesizer.tts) uses only outputs["wav"] when VITS is the model (because vocoder_model is None, the use_gl=True branch is taken). The other seven tensors are payload for callers that never look at them, and they survive until Python's GC happens to run, which in a tight synthesis loop is "rarely".

flowchart LR
  L1["loop iter N"] --> S1["BaseTTS.synthesize
returns dict with 8 GPU tensors"] S1 --> U["Synthesizer.tts reads only outputs[wav]"] U --> D["dict goes out of scope at end of iter"] D --> G{Python GC runs?} G -- "not soon" --> Pin["8 tensor storages stay pinned"] Pin --> L2["loop iter N+1
RSS keeps growing"] G -- "fires every few iters" --> Slow["episodic frees, still grows on average"] style Pin fill:#1e0a0a,stroke:#ef4444,color:#fca5a5 style Slow fill:#1e1408,stroke:#fbbf24,color:#fde68a
02 · The fix

Detach + CPU every tensor at the API boundary

A small _release helper inside synthesize handles every tensor identically. Non-tensor values pass through. Shapes and numeric values are unchanged, so every existing caller and every existing test continue to work.

The new code, in 8 lines

# TTS/tts/models/base_tts.py, inside synthesize, before the return def _release(value: Any) -> Any: if isinstance(value, torch.Tensor): return value.detach().cpu() return value return { "wav": wav, "alignments": _release(alignments), "text_inputs": _release(text_inputs), "outputs": {k: _release(v) for k, v in outputs.items()}, }

.detach() disconnects from the autograd graph (a no-op under @torch.inference_mode() but pairing it with .cpu() forces a copy out of GPU residency on CUDA and removes any residual inference-mode reference on CPU). The original tensors are released as soon as their storages are no longer needed.

Why not gc.collect() / torch.cuda.empty_cache() in the loop. Those are sledgehammers. They pay a real per-call cost (CUDA allocator reset is expensive, gc.collect walks every object in the interpreter) and they hide the actual leak rather than fix it. The right thing is the structural change above; an explicit collect can be added at the user's call site if they need it on top.

Backward compatibility

API
Identical keys, identical shapes, identical numeric values.
Tests
Every existing assertion in tests/ continues to pass.
CUDA
Tensors come back to the caller on CPU. Callers who needed them on GPU were already calling .cpu() themselves; the rest get the speed win.
Training
Unaffected, synthesize is inference-only; training paths use train_step.
03 · The outreach

Where this stands

Done

Branch + commit + README

Local branch fix/298-vits-memory-leak-detach, commit 2441e66. The repository's PR target is dev; instructions in the README.

Next

Fork + push + draft PR

Idiap took over coqui-ai-TTS maintenance after the original Coqui company dissolved. Maintenance is real but small-team; the reviewer pool is narrow.

Goal

Conversation within 1-2 weeks, contract within 6 weeks

Realistic odds for this lead alone: ~30-50% conversation, ~5-15% contract. Idiap is a research institute, hiring slower than a startup; a contract here is more likely to surface a project-based engagement than an FT retainer.

04 · How to verify

Reproduce the leak (and watch it go away)

cd c:/Users/FRA/Documents/github/workrepo/coqui-ai-TTS git switch fix/298-vits-memory-leak-detach git log --stat -1 # commit 2441e66, 2 files pip install -e . # Then run the reporter's reproducer (or the version from the README) # synthesizing ~200 short utterances with the VCTK VITS model in a loop. # Expect RSS to plateau after warm-up instead of growing linearly.