ComfyUI · #14076 · Francesco

01 · The problem

Free RAM is sitting there, but the next prompt still hits the disk

The reporter has 24 GB VRAM and runs a multi-model Wan workflow. When a model gets evicted from VRAM, ComfyUI doesn't keep its weights in CPU RAM as a write-through cache, it drops the only Python handle, the garbage collector reclaims the weights, and the next prompt loads them from disk again. On a 60+ GB workstation with weights that could comfortably live in system memory, every iteration eats 5-15 seconds of disk I/O for nothing.

The exact code path that drops the handle

Two specific places, both in comfy/model_management.py:

model_unload() at line 619, calls self.model.detach(unpatch_weights), then nulls out self.model_finalizer and self.real_model. The model's weight references are gone from the patcher side.
free_memory() at line 706, calls current_loaded_models.pop(i) and appends the popped LoadedModel to a local unloaded_models list that the caller in load_models_gpu() discards.

The offload_device is CPU for nearly every config, so the actual weight tensors live in host RAM up until that pop happens. The problem isn't where the bytes go, it's that nobody keeps a Python reference to them, so Python's garbage collector reclaims them on the next major collection.

flowchart TD
  Run1[Workflow run 1] --> Load1["load_models_gpu()
reads weights from disk"]
  Load1 --> Use1[VRAM model used for prompt]
  Use1 --> Pressure[VRAM pressure]
  Pressure --> Unload["model_unload()
+ free_memory()
pops LoadedModel"]
  Unload --> GC["GC reclaims CPU weights
(only handle was the pop'd LoadedModel)"]
  GC --> Run2[Workflow run 2]
  Run2 --> Load2["load_models_gpu()
reads weights from disk AGAIN"]
  Load2 --> Slow["5-15 s disk I/O
even with 60 GB free RAM"]
  style GC fill:#1e0a0a,stroke:#ef4444,color:#fca5a5
  style Slow fill:#1e0a0a,stroke:#ef4444,color:#fca5a5

02 · Why no drive-by patch

Three ways a naïve "keep weights in RAM" patch would silently break

ComfyUI's memory management is tightly coupled to the patcher / finalizer / weakref lifecycle in comfy/model_patcher.py. The natural reflex, hold a strong reference to the model in a side cache, interacts poorly with three existing systems.

Failure mode 1 · pin-budget underflow

The ensure_pin_budget + dynamic_pins path manages CUDA pinned-memory pools. If a cache holds models with active dynamic pins, racing requests to ensure_pin_registerable can underflow the recorded budget, meaning can_unload_sorted later picks the wrong victim, and a model that should be retained gets evicted instead.

Failure mode 2 · the weakref.finalize chain

At model_management.py:610, every LoadedModel registers weakref.finalize(real_model, cleanup_models). The finalizer fires only when the real_model becomes unreachable. If a cache holds a strong reference to the model, the finalizer never fires, cleanup_models never runs, and current_loaded_models grows monotonically until the process runs out of memory entirely.

Failure mode 3 · divergent offload semantics on MPS / ROCm

On CUDA, the CPU offload device is host RAM and tensor lifetime is straightforward. On MPS (Apple Silicon) and ROCm (AMD), offload can pin host memory differently or involve hidden copies, and the assumption "weights live in plain RAM until I drop the reference" no longer holds. A patch that works flawlessly on a CUDA RTX 5080 can OOM an M3 Max or a RX 7900 XT.

Why this matters for the operator (you, hiring): sending a drive-by PR that "fixes" #14076 but introduces silent regressions on Mac or AMD is worse than not sending one. The proposal flow, design doc + explicit questions + agreement before code, protects both sides. It also signals to the maintainer that I read enough of the codebase to know where the patch is non-trivial, not just what the symptom is.

03 · The proposed fix shape

Opt-in flag, gated LRU cache, auto-tune on high-RAM hosts

Three small primitives. None of them ship until comfyanonymous signs off on the answers to four specific design questions.

1 · New CLI flag

--ram-model-cache-mb N. Default 0 means today's behavior, byte for byte. When set to a positive value, free_memory() stops calling current_loaded_models.pop(i) immediately and instead moves the model into a dedicated ram_cached_models deque with an LRU policy bounded by:

sum(m.model_memory() for m in ram_cached_models) <= args.ram_model_cache_mb * (1024 ** 2)

2 · Cache hit in `load_models_gpu()`

Before initialising a model from disk, load_models_gpu() checks ram_cached_models. On hit, the model is popped out of the cache, model_load() runs (which moves weights back to VRAM), and it gets re-appended to current_loaded_models. No disk read.

3 · Auto-tune on high-RAM hosts

When --ram-model-cache-mb is unset and the host has more than 64 GB of free system RAM at startup, default the cap to min(0.25 * free_ram, 32 GB). Users with 128 GB RAM and 24 GB VRAM (the exact shape of the bug reporter's machine) get the speedup without touching flags. A --no-ram-model-cache switch disables the auto-tune for strict-today-behavior users.

flowchart LR
  R[Run N] --> Q{model in
current_loaded_models?}
  Q -- yes --> Use[Use it]
  Q -- no --> C{model in
ram_cached_models?}
  C -- yes --> H[Pop from cache
model_load to VRAM
append to current_loaded]
  H --> Use
  C -- no --> D[Read from disk]
  D --> Use
  Use --> Done[Run N complete]
  Done --> P{VRAM pressure?}
  P -- yes --> M["Move evicted model
to ram_cached_models
(bounded by cap)"]
  P -- no --> R2[Run N+1]
  M --> R2
  style H fill:#0a1e10,stroke:#10b981,color:#86efac
  style D fill:#1e1408,stroke:#fbbf24,color:#fde68a

Edge cases enumerated in the proposal

Dynamic models: model.is_dynamic() path keeps current behavior (those manage their own pinning).
Multi-GPU: Cache is host-RAM, GPU-agnostic. Works the same on 1 or 8 GPUs.
Custom nodes: No API change. Cache is transparent.
Host RAM pressure: Cache shrinks under psutil.virtual_memory().available pressure. Never causes OOM.
Finalizer order: When a cached model is evicted, model_finalizer.detach() is called once explicitly before GC.
--clear-cache-on-prompt: Open question for the maintainer, should the new cache honor it identically?

04 · The four questions for the maintainer

Where coding waits for alignment

Opt-in flag vs auto-tune-on-by-default. Is the --ram-model-cache-mb opt-in the shape you want, or do you prefer auto-tune ON by default (with --no-ram-model-cache to disable)?
Finalizer teardown order. Does the weakref.finalize chain in model_patcher.py need a specific teardown order I should match? I plan to call model_finalizer.detach() once at cache-eviction time; does that cover it?
Eviction policy. Plain LRU, or LRU-with-cost-awareness (bigger models stay longer because the disk-reload penalty scales with size)?
--clear-cache-on-prompt. Should the new cache honor this switch identically to current_loaded_models? Currently it clears the GPU cache, not host RAM.

Once these four are answered: a real PR follows within 2-3 days, with vitest-equivalent Python tests in tests-unit/comfy/test_model_management.py covering cache hit, cache miss, LRU eviction, cap enforcement, and graceful behavior under host RAM pressure. The verification plan in the proposal runs against the reporter's exact WanWorkflow.json.

05 · The outreach

Where this stands

ComfyUI is run by one core maintainer with strong opinions. Drive-by PRs to model_management.py have historically been bounced. The proposal-first flow is calibrated for that culture.

Done

Proposal document committed

Local branch proposal/14076-ram-model-cache, commit f9f5b59. The document is the deliverable, no executable code yet.

Reach out via Discord (preferred) or the GitHub issue thread

Discord first because comfyanonymous is more responsive there than via email. The message links to the proposal branch and asks the four questions explicitly.

If aligned

PR opens within 2-3 days with tests

Real code, tests, benchmarks against the WanWorkflow. Estimated 3-5 days of total work from alignment to merge.

Goal

Conversation within 1-2 weeks, contract within 6 weeks

Realistic odds for this lead alone: ~15-30% conversation, ~5% contract. ComfyUI is mostly a single-maintainer project; contracts here are rare. More likely outcome: referral to a custom-node-shop or an inference-tooling company.

Free RAM is sitting there, but the next prompt still hits the disk

The exact code path that drops the handle

Three ways a naïve "keep weights in RAM" patch would silently break

Failure mode 1 · pin-budget underflow

Failure mode 2 · the weakref.finalize chain

Failure mode 3 · divergent offload semantics on MPS / ROCm

Opt-in flag, gated LRU cache, auto-tune on high-RAM hosts

1 · New CLI flag

2 · Cache hit in load_models_gpu()

3 · Auto-tune on high-RAM hosts

Edge cases enumerated in the proposal

Where coding waits for alignment

Where this stands

Proposal document committed

Reach out via Discord (preferred) or the GitHub issue thread

PR opens within 2-3 days with tests

Conversation within 1-2 weeks, contract within 6 weeks

Reproduce every claim

2 · Cache hit in `load_models_gpu()`