Home / AI / Ollama / Ollama Out of Memory Errors: How to Fix Them

Ollama Out of Memory Errors: How to Fix Them

Ollama

Why Ollama Runs Out of Memory

Ollama loads model weights into RAM or VRAM to run inference. If the model is too large for the available memory, one of several things happens: the model offloads layers to RAM (slower), inference fails with an out of memory error, or your system becomes unresponsive as it tries to use swap.

Understanding which layer of the memory stack is failing helps you fix the problem efficiently.

Error Messages to Look For

Common signs of an out of memory problem in Ollama:

  • CUDA error: out of memory — GPU VRAM exhausted
  • OOM killer messages in system logs — Linux kernel killed the process
  • exit status 1 or signal: killed mid-inference
  • The model starts loading then hangs indefinitely
  • Your system becomes very slow, disk thrashes (swap use spiking)
  • Ollama returns an error response but no specific message — check ollama logs

On Mac, check the memory pressure in Activity Monitor. On Linux:

dmesg | grep -i "out of memory"
journalctl -u ollama --since "10 minutes ago"

Check How Much Memory a Model Needs

As a rough rule, the memory required is approximately:

  • 4-bit quantised (Q4): ~0.5 GB per billion parameters
  • 8-bit quantised (Q8): ~1 GB per billion parameters
  • Full precision (F16): ~2 GB per billion parameters

So a Q4 7B model needs roughly 3.5–4 GB. A Q4 14B model needs around 7–8 GB. The context window also uses memory — a 32k context adds 1–2 GB on top of the model weights.

Check the model’s actual size before loading:

ollama show llama3.2 --modelinfo | grep size

Fix 1: Use a Smaller or More Quantised Model

The most direct fix. If a 7B model is too large, try a 3B version. If the Q8 variant is failing, use Q4:

# Switch from 7B to 3B
ollama pull llama3.2:3b

# Pull a more aggressively quantised variant
ollama pull llama3.2:latest   # Q4_K_M by default — good balance
ollama pull llama3.2:3b-q2_K  # Smallest, lowest quality, minimum RAM

For most use cases, Q4_K_M is the sweet spot — it reduces memory significantly with minimal quality loss compared to Q8 or full precision.

Fix 2: Reduce the Context Window

The context window (how many tokens the model can process at once) consumes memory proportionally to its size. If you don’t need long contexts, shrink it:

ollama run llama3.2 --num-ctx 2048

Or set it persistently in a Modelfile:

FROM llama3.2
PARAMETER num_ctx 2048
ollama create llama3.2-small-ctx -f ./Modelfile
ollama run llama3.2-small-ctx

Reducing from 8192 to 2048 can save 1–2 GB depending on the model architecture.

Fix 3: Set Layer Offloading (GPU + CPU Split)

If the model is slightly too large for your VRAM, you can tell Ollama to load some layers into RAM instead of VRAM. This is slower but prevents the crash:

OLLAMA_GPU_OVERHEAD=256MiB ollama serve

Alternatively, set the number of GPU layers explicitly via the API:

curl http://localhost:11434/api/generate -d '{
  "model": "llama3.2",
  "prompt": "Hello",
  "options": { "num_gpu": 20 }
}'

Reduce num_gpu until the model loads without error. Each layer uses roughly the same amount of VRAM, so halving the layers roughly halves VRAM use.

Fix 4: Free Up VRAM Before Loading

Other processes may be consuming GPU memory. On Linux, check what’s using VRAM:

nvidia-smi

Close GPU-intensive applications (games, other ML workloads, browser hardware acceleration on some systems). On Windows, you can set hardware-accelerated GPU scheduling settings in Display Settings to reduce OS VRAM overhead.

Fix 5: Increase Virtual Memory / Swap (CPU-only)

If you’re running CPU-only and have limited RAM, you can extend it with swap space. This is much slower than RAM but allows larger models to run. On Linux:

# Create a 16 GB swap file
sudo fallocate -l 16G /swapfile
sudo chmod 600 /swapfile
sudo mkswap /swapfile
sudo swapon /swapfile

Performance will be poor — expect 5–20x slower inference than running fully in RAM. But it allows you to run models that wouldn’t fit otherwise.

On Windows, Windows manages virtual memory automatically. You can increase it in System → Advanced System Settings → Performance → Virtual Memory.

Fix 6: Unload Idle Models

By default, Ollama keeps models loaded in memory for 5 minutes after last use. If you’ve run several models, they may all still be sitting in VRAM. Unload them:

# Unload a specific model immediately
curl http://localhost:11434/api/generate -d '{
  "model": "llama3.2",
  "keep_alive": 0
}'

Or reduce the global keep-alive time so models unload faster after being idle:

OLLAMA_KEEP_ALIVE=1m ollama serve

Fix 7: Increase the OLLAMA_MAX_LOADED_MODELS Limit

By default, Ollama limits how many models can be loaded concurrently. If you’re loading multiple models and one is failing, check this setting:

OLLAMA_MAX_LOADED_MODELS=1 ollama serve

Setting it to 1 ensures only one model is loaded at a time, which frees memory used by any previously loaded model before the new one loads.

Fix 8: On Apple Silicon — Adjust Memory Allocation

On Mac with Apple Silicon (M1/M2/M3/M4), RAM is shared between CPU and GPU. macOS reserves some memory for the OS. If Ollama is failing to load a model that should fit, check that no other apps are consuming the shared memory pool.

You can run the following to monitor unified memory pressure:

sudo powermetrics --samplers smc -n 1 | grep "GPU Active"

Close memory-heavy apps like Chrome, Xcode, or Docker containers before loading large models. On Apple Silicon, a 16 GB MacBook Air can comfortably run a Q4 7B model, but may struggle with 13B+ models alongside other apps.

Choosing the Right Model for Your Hardware

Available VRAM / RAM Largest practical model Recommended model
4 GB Q4 3B llama3.2:3b or qwen2.5:3b
6 GB Q4 7B llama3.2 or mistral:7b
8 GB Q4 7B (with headroom) llama3.2 or qwen2.5:7b
12 GB Q4 13B llama3.1:8b or codestral:latest
16 GB Q4 14B qwen2.5:14b or deepseek-coder-v2:16b
24 GB Q4 32B qwen2.5:32b or llama3.3:70b-q2_K
48 GB+ Q4 70B llama3.3:70b or qwen2.5:72b

Diagnosing Memory Issues with Ollama Logs

Enable verbose logging to see exactly what Ollama is doing when a model fails to load:

OLLAMA_DEBUG=1 ollama serve 2>&1 | tee ollama.log

Then attempt to load the model in another terminal. The log will show layer allocation, VRAM usage per layer, and where the failure occurs.

Sign Up For Daily Newsletter

Stay updated with our weekly newsletter. Subscribe now to never miss an update!

[mc4wp_form]

Leave a Reply

Your email address will not be published. Required fields are marked *