comfyanonymous/ComfyUI · issue #14076 · branch proposal/14076-ram-model-cache
Models get evicted from VRAM and then re-read from disk on every prompt, even when the host has 60+ GB of free system RAM. The bug is real and the cause is identified, but a drive-by patch would silently break at least three other systems. This branch deliberately ships zero code: it ships an honest design proposal, a failure trace pinned to specific line numbers, and four direct questions for comfyanonymous before any code is touched.
The reporter has 24 GB VRAM and runs a multi-model Wan workflow. When a model gets evicted from VRAM, ComfyUI doesn't keep its weights in CPU RAM as a write-through cache, it drops the only Python handle, the garbage collector reclaims the weights, and the next prompt loads them from disk again. On a 60+ GB workstation with weights that could comfortably live in system memory, every iteration eats 5-15 seconds of disk I/O for nothing.
Two specific places, both in comfy/model_management.py:
model_unload() at line 619, calls self.model.detach(unpatch_weights), then nulls out self.model_finalizer and self.real_model. The model's weight references are gone from the patcher side.free_memory() at line 706, calls current_loaded_models.pop(i) and appends the popped LoadedModel to a local unloaded_models list that the caller in load_models_gpu() discards.The offload_device is CPU for nearly every config, so the actual weight tensors live in host RAM up until that pop happens. The problem isn't where the bytes go, it's that nobody keeps a Python reference to them, so Python's garbage collector reclaims them on the next major collection.
flowchart TD Run1[Workflow run 1] --> Load1["load_models_gpu()
reads weights from disk"] Load1 --> Use1[VRAM model used for prompt] Use1 --> Pressure[VRAM pressure] Pressure --> Unload["model_unload()
+ free_memory()
pops LoadedModel"] Unload --> GC["GC reclaims CPU weights
(only handle was the pop'd LoadedModel)"] GC --> Run2[Workflow run 2] Run2 --> Load2["load_models_gpu()
reads weights from disk AGAIN"] Load2 --> Slow["5-15 s disk I/O
even with 60 GB free RAM"] style GC fill:#1e0a0a,stroke:#ef4444,color:#fca5a5 style Slow fill:#1e0a0a,stroke:#ef4444,color:#fca5a5
ComfyUI's memory management is tightly coupled to the patcher / finalizer / weakref lifecycle in comfy/model_patcher.py. The natural reflex, hold a strong reference to the model in a side cache, interacts poorly with three existing systems.
The ensure_pin_budget + dynamic_pins path manages CUDA pinned-memory pools. If a cache holds models with active dynamic pins, racing requests to ensure_pin_registerable can underflow the recorded budget, meaning can_unload_sorted later picks the wrong victim, and a model that should be retained gets evicted instead.
At model_management.py:610, every LoadedModel registers weakref.finalize(real_model, cleanup_models). The finalizer fires only when the real_model becomes unreachable. If a cache holds a strong reference to the model, the finalizer never fires, cleanup_models never runs, and current_loaded_models grows monotonically until the process runs out of memory entirely.
On CUDA, the CPU offload device is host RAM and tensor lifetime is straightforward. On MPS (Apple Silicon) and ROCm (AMD), offload can pin host memory differently or involve hidden copies, and the assumption "weights live in plain RAM until I drop the reference" no longer holds. A patch that works flawlessly on a CUDA RTX 5080 can OOM an M3 Max or a RX 7900 XT.
Three small primitives. None of them ship until comfyanonymous signs off on the answers to four specific design questions.
--ram-model-cache-mb N. Default 0 means today's behavior, byte for byte. When set to a positive value, free_memory() stops calling current_loaded_models.pop(i) immediately and instead moves the model into a dedicated ram_cached_models deque with an LRU policy bounded by:
load_models_gpu()Before initialising a model from disk, load_models_gpu() checks ram_cached_models. On hit, the model is popped out of the cache, model_load() runs (which moves weights back to VRAM), and it gets re-appended to current_loaded_models. No disk read.
When --ram-model-cache-mb is unset and the host has more than 64 GB of free system RAM at startup, default the cap to min(0.25 * free_ram, 32 GB). Users with 128 GB RAM and 24 GB VRAM (the exact shape of the bug reporter's machine) get the speedup without touching flags. A --no-ram-model-cache switch disables the auto-tune for strict-today-behavior users.
flowchart LR
R[Run N] --> Q{model in
current_loaded_models?}
Q -- yes --> Use[Use it]
Q -- no --> C{model in
ram_cached_models?}
C -- yes --> H[Pop from cache
model_load to VRAM
append to current_loaded]
H --> Use
C -- no --> D[Read from disk]
D --> Use
Use --> Done[Run N complete]
Done --> P{VRAM pressure?}
P -- yes --> M["Move evicted model
to ram_cached_models
(bounded by cap)"]
P -- no --> R2[Run N+1]
M --> R2
style H fill:#0a1e10,stroke:#10b981,color:#86efac
style D fill:#1e1408,stroke:#fbbf24,color:#fde68a
model.is_dynamic() path keeps current behavior (those manage their own pinning).psutil.virtual_memory().available pressure. Never causes OOM.model_finalizer.detach() is called once explicitly before GC.--ram-model-cache-mb opt-in the shape you want, or do you prefer auto-tune ON by default (with --no-ram-model-cache to disable)?weakref.finalize chain in model_patcher.py need a specific teardown order I should match? I plan to call model_finalizer.detach() once at cache-eviction time; does that cover it?current_loaded_models? Currently it clears the GPU cache, not host RAM.tests-unit/comfy/test_model_management.py covering cache hit, cache miss, LRU eviction, cap enforcement, and graceful behavior under host RAM pressure. The verification plan in the proposal runs against the reporter's exact WanWorkflow.json.ComfyUI is run by one core maintainer with strong opinions. Drive-by PRs to model_management.py have historically been bounced. The proposal-first flow is calibrated for that culture.
Local branch proposal/14076-ram-model-cache, commit f9f5b59. The document is the deliverable, no executable code yet.
Discord first because comfyanonymous is more responsive there than via email. The message links to the proposal branch and asks the four questions explicitly.
Real code, tests, benchmarks against the WanWorkflow. Estimated 3-5 days of total work from alignment to merge.
Realistic odds for this lead alone: ~15-30% conversation, ~5% contract. ComfyUI is mostly a single-maintainer project; contracts here are rare. More likely outcome: referral to a custom-node-shop or an inference-tooling company.
There are no tests to run, because there is no code yet. That is the point: alignment first, code after.