Why Ollama Runs Out of Memory
Ollama loads model weights into RAM or VRAM to run inference. If the model is too large for the available memory, one of several things happens: the model offloads layers to RAM (slower), inference fails with an out of memory error, or your system becomes unresponsive as it tries to use swap.
Understanding which layer of the memory stack is failing helps you fix the problem efficiently.
Error Messages to Look For
Common signs of an out of memory problem in Ollama:
CUDA error: out of memory— GPU VRAM exhaustedOOM killermessages in system logs — Linux kernel killed the processexit status 1orsignal: killedmid-inference- The model starts loading then hangs indefinitely
- Your system becomes very slow, disk thrashes (swap use spiking)
- Ollama returns an error response but no specific message — check
ollama logs
On Mac, check the memory pressure in Activity Monitor. On Linux:
dmesg | grep -i "out of memory"
journalctl -u ollama --since "10 minutes ago"
Check How Much Memory a Model Needs
As a rough rule, the memory required is approximately:
- 4-bit quantised (Q4): ~0.5 GB per billion parameters
- 8-bit quantised (Q8): ~1 GB per billion parameters
- Full precision (F16): ~2 GB per billion parameters
So a Q4 7B model needs roughly 3.5–4 GB. A Q4 14B model needs around 7–8 GB. The context window also uses memory — a 32k context adds 1–2 GB on top of the model weights.
Check the model’s actual size before loading:
ollama show llama3.2 --modelinfo | grep size
Fix 1: Use a Smaller or More Quantised Model
The most direct fix. If a 7B model is too large, try a 3B version. If the Q8 variant is failing, use Q4:
# Switch from 7B to 3B
ollama pull llama3.2:3b
# Pull a more aggressively quantised variant
ollama pull llama3.2:latest # Q4_K_M by default — good balance
ollama pull llama3.2:3b-q2_K # Smallest, lowest quality, minimum RAM
For most use cases, Q4_K_M is the sweet spot — it reduces memory significantly with minimal quality loss compared to Q8 or full precision.
Fix 2: Reduce the Context Window
The context window (how many tokens the model can process at once) consumes memory proportionally to its size. If you don’t need long contexts, shrink it:
ollama run llama3.2 --num-ctx 2048
Or set it persistently in a Modelfile:
FROM llama3.2
PARAMETER num_ctx 2048
ollama create llama3.2-small-ctx -f ./Modelfile
ollama run llama3.2-small-ctx
Reducing from 8192 to 2048 can save 1–2 GB depending on the model architecture.
Fix 3: Set Layer Offloading (GPU + CPU Split)
If the model is slightly too large for your VRAM, you can tell Ollama to load some layers into RAM instead of VRAM. This is slower but prevents the crash:
OLLAMA_GPU_OVERHEAD=256MiB ollama serve
Alternatively, set the number of GPU layers explicitly via the API:
curl http://localhost:11434/api/generate -d '{
"model": "llama3.2",
"prompt": "Hello",
"options": { "num_gpu": 20 }
}'
Reduce num_gpu until the model loads without error. Each layer uses roughly the same amount of VRAM, so halving the layers roughly halves VRAM use.
Fix 4: Free Up VRAM Before Loading
Other processes may be consuming GPU memory. On Linux, check what’s using VRAM:
nvidia-smi
Close GPU-intensive applications (games, other ML workloads, browser hardware acceleration on some systems). On Windows, you can set hardware-accelerated GPU scheduling settings in Display Settings to reduce OS VRAM overhead.
Fix 5: Increase Virtual Memory / Swap (CPU-only)
If you’re running CPU-only and have limited RAM, you can extend it with swap space. This is much slower than RAM but allows larger models to run. On Linux:
# Create a 16 GB swap file
sudo fallocate -l 16G /swapfile
sudo chmod 600 /swapfile
sudo mkswap /swapfile
sudo swapon /swapfile
Performance will be poor — expect 5–20x slower inference than running fully in RAM. But it allows you to run models that wouldn’t fit otherwise.
On Windows, Windows manages virtual memory automatically. You can increase it in System → Advanced System Settings → Performance → Virtual Memory.
Fix 6: Unload Idle Models
By default, Ollama keeps models loaded in memory for 5 minutes after last use. If you’ve run several models, they may all still be sitting in VRAM. Unload them:
# Unload a specific model immediately
curl http://localhost:11434/api/generate -d '{
"model": "llama3.2",
"keep_alive": 0
}'
Or reduce the global keep-alive time so models unload faster after being idle:
OLLAMA_KEEP_ALIVE=1m ollama serve
Fix 7: Increase the OLLAMA_MAX_LOADED_MODELS Limit
By default, Ollama limits how many models can be loaded concurrently. If you’re loading multiple models and one is failing, check this setting:
OLLAMA_MAX_LOADED_MODELS=1 ollama serve
Setting it to 1 ensures only one model is loaded at a time, which frees memory used by any previously loaded model before the new one loads.
Fix 8: On Apple Silicon — Adjust Memory Allocation
On Mac with Apple Silicon (M1/M2/M3/M4), RAM is shared between CPU and GPU. macOS reserves some memory for the OS. If Ollama is failing to load a model that should fit, check that no other apps are consuming the shared memory pool.
You can run the following to monitor unified memory pressure:
sudo powermetrics --samplers smc -n 1 | grep "GPU Active"
Close memory-heavy apps like Chrome, Xcode, or Docker containers before loading large models. On Apple Silicon, a 16 GB MacBook Air can comfortably run a Q4 7B model, but may struggle with 13B+ models alongside other apps.
Choosing the Right Model for Your Hardware
| Available VRAM / RAM | Largest practical model | Recommended model |
|---|---|---|
| 4 GB | Q4 3B | llama3.2:3b or qwen2.5:3b |
| 6 GB | Q4 7B | llama3.2 or mistral:7b |
| 8 GB | Q4 7B (with headroom) | llama3.2 or qwen2.5:7b |
| 12 GB | Q4 13B | llama3.1:8b or codestral:latest |
| 16 GB | Q4 14B | qwen2.5:14b or deepseek-coder-v2:16b |
| 24 GB | Q4 32B | qwen2.5:32b or llama3.3:70b-q2_K |
| 48 GB+ | Q4 70B | llama3.3:70b or qwen2.5:72b |
Diagnosing Memory Issues with Ollama Logs
Enable verbose logging to see exactly what Ollama is doing when a model fails to load:
OLLAMA_DEBUG=1 ollama serve 2>&1 | tee ollama.log
Then attempt to load the model in another terminal. The log will show layer allocation, VRAM usage per layer, and where the failure occurs.


