Ollama loads model weights into RAM or VRAM to run inference. If the model is too large for the available memory, one of several things happens: the model offloads layers to RAM (slower), inference fails with an out of memory error, or your system becomes unresponsive as it tries to use swap.

Understanding which layer of the memory stack is failing helps you fix the problem efficiently.

Error Messages to Look For

Common signs of an out of memory problem in Ollama:

CUDA error: out of memory — GPU VRAM exhausted
OOM killer messages in system logs — Linux kernel killed the process
exit status 1 or signal: killed mid-inference
The model starts loading then hangs indefinitely
Your system becomes very slow, disk thrashes (swap use spiking)
Ollama returns an error response but no specific message — check ollama logs

On Mac, check the memory pressure in Activity Monitor. On Linux:

dmesg | grep -i "out of memory"
journalctl -u ollama --since "10 minutes ago"

Check How Much Memory a Model Needs

As a rough rule, the memory required is approximately:

4-bit quantised (Q4): ~0.5 GB per billion parameters
8-bit quantised (Q8): ~1 GB per billion parameters
Full precision (F16): ~2 GB per billion parameters

So a Q4 7B model needs roughly 3.5–4 GB. A Q4 14B model needs around 7–8 GB. The context window also uses memory — a 32k context adds 1–2 GB on top of the model weights.

Check the model’s actual size before loading:

ollama show llama3.2 --modelinfo | grep size

Fix 1: Use a Smaller or More Quantised Model

The most direct fix. If a 7B model is too large, try a 3B version. If the Q8 variant is failing, use Q4:

# Switch from 7B to 3B
ollama pull llama3.2:3b

# Pull a more aggressively quantised variant
ollama pull llama3.2:latest   # Q4_K_M by default — good balance
ollama pull llama3.2:3b-q2_K  # Smallest, lowest quality, minimum RAM

For most use cases, Q4_K_M is the sweet spot — it reduces memory significantly with minimal quality loss compared to Q8 or full precision.

Fix 2: Reduce the Context Window

The context window (how many tokens the model can process at once) consumes memory proportionally to its size. If you don’t need long contexts, shrink it:

ollama run llama3.2 --num-ctx 2048

Or set it persistently in a Modelfile:

FROM llama3.2
PARAMETER num_ctx 2048

ollama create llama3.2-small-ctx -f ./Modelfile
ollama run llama3.2-small-ctx

Reducing from 8192 to 2048 can save 1–2 GB depending on the model architecture.

Fix 3: Set Layer Offloading (GPU + CPU Split)

If the model is slightly too large for your VRAM, you can tell Ollama to load some layers into RAM instead of VRAM. This is slower but prevents the crash:

OLLAMA_GPU_OVERHEAD=256MiB ollama serve

Alternatively, set the number of GPU layers explicitly via the API:

curl http://localhost:11434/api/generate -d '{
  "model": "llama3.2",
  "prompt": "Hello",
  "options": { "num_gpu": 20 }
}'

Reduce num_gpu until the model loads without error. Each layer uses roughly the same amount of VRAM, so halving the layers roughly halves VRAM use.

Fix 4: Free Up VRAM Before Loading

Other processes may be consuming GPU memory. On Linux, check what’s using VRAM:

nvidia-smi

Close GPU-intensive applications (games, other ML workloads, browser hardware acceleration on some systems). On Windows, you can set hardware-accelerated GPU scheduling settings in Display Settings to reduce OS VRAM overhead.

Fix 5: Increase Virtual Memory / Swap (CPU-only)

If you’re running CPU-only and have limited RAM, you can extend it with swap space. This is much slower than RAM but allows larger models to run. On Linux:

# Create a 16 GB swap file
sudo fallocate -l 16G /swapfile
sudo chmod 600 /swapfile
sudo mkswap /swapfile
sudo swapon /swapfile

Performance will be poor — expect 5–20x slower inference than running fully in RAM. But it allows you to run models that wouldn’t fit otherwise.

On Windows, Windows manages virtual memory automatically. You can increase it in System → Advanced System Settings → Performance → Virtual Memory.

Fix 6: Unload Idle Models

By default, Ollama keeps models loaded in memory for 5 minutes after last use. If you’ve run several models, they may all still be sitting in VRAM. Unload them:

# Unload a specific model immediately
curl http://localhost:11434/api/generate -d '{
  "model": "llama3.2",
  "keep_alive": 0
}'

Or reduce the global keep-alive time so models unload faster after being idle:

OLLAMA_KEEP_ALIVE=1m ollama serve

Fix 7: Increase the OLLAMA_MAX_LOADED_MODELS Limit

By default, Ollama limits how many models can be loaded concurrently. If you’re loading multiple models and one is failing, check this setting:

OLLAMA_MAX_LOADED_MODELS=1 ollama serve

Setting it to 1 ensures only one model is loaded at a time, which frees memory used by any previously loaded model before the new one loads.

Fix 8: On Apple Silicon — Adjust Memory Allocation

On Mac with Apple Silicon (M1/M2/M3/M4), RAM is shared between CPU and GPU. macOS reserves some memory for the OS. If Ollama is failing to load a model that should fit, check that no other apps are consuming the shared memory pool.

You can run the following to monitor unified memory pressure:

sudo powermetrics --samplers smc -n 1 | grep "GPU Active"

Close memory-heavy apps like Chrome, Xcode, or Docker containers before loading large models. On Apple Silicon, a 16 GB MacBook Air can comfortably run a Q4 7B model, but may struggle with 13B+ models alongside other apps.

Choosing the Right Model for Your Hardware

Available VRAM / RAM	Largest practical model	Recommended model
4 GB	Q4 3B	llama3.2:3b or qwen2.5:3b
6 GB	Q4 7B	llama3.2 or mistral:7b
8 GB	Q4 7B (with headroom)	llama3.2 or qwen2.5:7b
12 GB	Q4 13B	llama3.1:8b or codestral:latest
16 GB	Q4 14B	qwen2.5:14b or deepseek-coder-v2:16b
24 GB	Q4 32B	qwen2.5:32b or llama3.3:70b-q2_K
48 GB+	Q4 70B	llama3.3:70b or qwen2.5:72b

Diagnosing Memory Issues with Ollama Logs

Enable verbose logging to see exactly what Ollama is doing when a model fails to load:

OLLAMA_DEBUG=1 ollama serve 2>&1 | tee ollama.log

Then attempt to load the model in another terminal. The log will show layer allocation, VRAM usage per layer, and where the failure occurs.

Which DeepSeek R1 Model Size Should You Use?
Best Ollama Models for RAG
How to Run Ollama with Docker

Ollama Out of Memory Errors: How to Fix Them

Table of Contents

1. Why Ollama Runs Out of Memory

2. Error Messages to Look For

3. Check How Much Memory a Model Needs

4. Fix 1: Use a Smaller or More Quantised Model

5. Fix 2: Reduce the Context Window

6. Fix 3: Set Layer Offloading (GPU + CPU Split)

7. Fix 4: Free Up VRAM Before Loading

8. Fix 5: Increase Virtual Memory / Swap (CPU-only)

9. Fix 6: Unload Idle Models

10. Fix 7: Increase the OLLAMA_MAX_LOADED_MODELS Limit

11. Fix 8: On Apple Silicon — Adjust Memory Allocation

12. Choosing the Right Model for Your Hardware

13. Diagnosing Memory Issues with Ollama Logs

14. Related Reading

Why Ollama Runs Out of Memory

Error Messages to Look For

Check How Much Memory a Model Needs

Fix 1: Use a Smaller or More Quantised Model

Fix 2: Reduce the Context Window

Fix 3: Set Layer Offloading (GPU + CPU Split)

Fix 4: Free Up VRAM Before Loading

Fix 5: Increase Virtual Memory / Swap (CPU-only)

Fix 6: Unload Idle Models

Fix 7: Increase the OLLAMA_MAX_LOADED_MODELS Limit

Fix 8: On Apple Silicon — Adjust Memory Allocation

Choosing the Right Model for Your Hardware

Diagnosing Memory Issues with Ollama Logs

How to Create Custom Ollama Models with Modelfiles

How to Access Ollama Over a Network and Remotely

Ollama Out of Memory Errors: How to Fix Them

Table of Contents

Why Ollama Runs Out of Memory

Error Messages to Look For

Check How Much Memory a Model Needs

Fix 1: Use a Smaller or More Quantised Model

Fix 2: Reduce the Context Window

Fix 3: Set Layer Offloading (GPU + CPU Split)

Fix 4: Free Up VRAM Before Loading

Fix 5: Increase Virtual Memory / Swap (CPU-only)

Fix 6: Unload Idle Models

Fix 7: Increase the OLLAMA_MAX_LOADED_MODELS Limit

Fix 8: On Apple Silicon — Adjust Memory Allocation

Choosing the Right Model for Your Hardware

Diagnosing Memory Issues with Ollama Logs

Related Reading

How to Create Custom Ollama Models with Modelfiles

How to Access Ollama Over a Network and Remotely

Related Posts