Home / AI / Ollama / Ollama Context Window: How to Set num_ctx

Ollama Context Window: How to Set num_ctx

Ollama Context Window: How to Set num_ctx

Ollama’s default context window is 4,096 tokens — roughly 3,000 words. When a conversation, document, or agent loop exceeds that limit, Ollama silently truncates from the beginning with no warning, no error, and no indication that anything went wrong. This is the root cause of one of the most common Ollama complaints: “the model suddenly forgot what we were talking about.” This guide explains how context windows work, how to check and change the limit using three different methods, what the hardware cost is, and what context size you actually need for different workloads.

What Is the Ollama Context Window?

The context window — controlled by the num_ctx parameter — defines the maximum number of tokens Ollama will hold in memory at once across a conversation. Tokens are not words: roughly 1,000 tokens equals 750 words in English, or about 1–1.5 pages of text. Everything the model can “see” at any moment — the system prompt, the full conversation history, and the space for the next response — must fit within this limit.

When the cumulative token count exceeds num_ctx, Ollama does not throw an error. It quietly drops the oldest messages from the context to make room. The model continues responding, but without access to the earlier conversation. From the user’s perspective, the model just forgot. Understanding this behaviour is the first step to fixing it.

What Is Ollama’s Default Context Window?

There are two answers, depending on how you look at it. The Ollama Modelfile specification defines a baseline default of 2,048 tokens. In practice, the runtime sets the default dynamically based on how much VRAM is available:

Available VRAM Runtime default num_ctx
Less than 24 GiB 4,096 tokens
24–48 GiB 32,768 tokens
48 GiB or more 262,144 tokens (256k)

Most desktop and laptop users land in the first tier — 4,096 tokens — which is comfortably enough for short chats but will truncate anything longer. The conservative default exists to prevent Ollama from running the model out of VRAM without warning.

How to Check the Context Window Your Model Is Using

Before changing anything, confirm what is currently set.

Check the Modelfile configuration

ollama show --modelfile llama3.1:8b

Look for a PARAMETER num_ctx line. If it is absent, the model is using the runtime default for your hardware.

Check running models and active allocation

ollama ps

The CONTEXT column shows the number of tokens actually allocated for models currently loaded in memory. This is the most reliable way to confirm what is in effect at runtime.

Inside the interactive REPL

/show info

This displays the current model parameters including the active context size.

Three Ways to Increase the Ollama Context Window

There are three methods, each suited to a different scenario. Knowing the precedence order matters: API parameter > environment variable > Modelfile PARAMETER > built-in default. A per-request setting always wins.

Method 1: OLLAMA_CONTEXT_LENGTH — server-wide default

Set this environment variable when starting Ollama to change the default for every model on the server:

OLLAMA_CONTEXT_LENGTH=32768 ollama serve

To make it permanent via systemd:

sudo systemctl edit ollama.service
[Service]
Environment="OLLAMA_CONTEXT_LENGTH=32768"
sudo systemctl daemon-reload && sudo systemctl restart ollama

When to use this: you want a consistent default across all models without modifying each one individually. Useful on a dedicated Ollama server where every workload benefits from a larger context.

Note on naming: older guides and community posts use OLLAMA_NUM_CTX — this was the variable name in earlier Ollama versions. OLLAMA_CONTEXT_LENGTH is the current official name. Both may work depending on your version; use OLLAMA_CONTEXT_LENGTH for new setups.

Method 2: Modelfile PARAMETER num_ctx — permanent per-model

Create a Modelfile that bakes the context size into a named model variant:

FROM llama3.1:8b
PARAMETER num_ctx 32768
ollama create llama3.1-32k -f Modelfile
ollama run llama3.1-32k

This survives server restarts and process kills. Any tool that references this model name — Open WebUI, Cline, Continue.dev, your own scripts — automatically gets the extended context. When to use this: a specific model variant always needs a specific context size, and you want a clean named model you can reference consistently.

Method 3: API options — per-request override

Pass num_ctx in the options object to override the context window for a single request:

curl http://localhost:11434/api/chat \
  -d '{
    "model": "llama3.1:8b",
    "messages": [{"role": "user", "content": "Summarise this document: ..."}],
    "options": {"num_ctx": 16384}
  }'

Using the Python library:

import ollama

response = ollama.chat(
    model='llama3.1:8b',
    messages=[{'role': 'user', 'content': 'Summarise this document...'}],
    options={'num_ctx': 16384}
)

On the OpenAI-compatible endpoint, pass it at the top level of the request body:

curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.1:8b",
    "messages": [{"role": "user", "content": "..."}],
    "num_ctx": 16384
  }'

When to use this: different tasks in the same application need different context sizes — a quick Q&A needs 4k, a document summarisation needs 32k. Setting per-request lets you use the smallest context that works, preserving VRAM and keeping responses fast. For a full reference on API parameters, see the Ollama REST API complete developer guide.

Temporary change in the REPL

Inside an interactive Ollama session, you can set context for the current session only:

/set parameter num_ctx 32768

Add /save mymodel-32k immediately after to persist it as a named model.

How Much Context Do You Actually Need?

More context is not always better — it consumes VRAM and slows inference. Use the smallest context that comfortably fits your workload:

Use Case Recommended num_ctx Reason
General chat / Q&A 4,096–8,192 Fits typical back-and-forth; fast responses
Document summarisation 16,384–32,768 Fits most PDFs and long articles in one pass
RAG pipelines 8,192–16,384 Retrieved chunks + query + history + response
Coding assistants (Cline, Roo Code) 32,768+ Long file contexts require space for the full file
AI agents with tool calling 64,000+ Ollama’s own docs explicitly recommend ≥64k for agents
Long-context models at full capacity Up to model max Only set this high if your VRAM supports it

The 64k recommendation for agents is important — if you use OpenCode, Cline, or any tool-calling agent with Ollama, the default 4,096 context will cause tool calls to fail partway through a task. Setting 64k is the single most impactful fix for agent reliability. See the Ollama + OpenCode setup guide for a walkthrough of the Modelfile approach applied to a coding agent.

VRAM and Performance: The Trade-off

Context window size has a direct, linear cost in VRAM. The model’s weights are fixed, but the KV (key-value) cache — where the context is stored — grows with every token you add to the window. Doubling the context roughly doubles the KV cache size.

For a concrete example with a 7–8B model at Q4_K_M quantisation:

  • Model weights: ~4–5 GiB (fixed)
  • KV cache at 4,096 tokens: ~0.25 GiB
  • KV cache at 32,768 tokens: ~2 GiB
  • KV cache at 131,072 tokens: ~8 GiB

Speed also degrades as context grows, because attention computation scales roughly quadratically with sequence length. Example throughput on the same hardware:

  • 4,096 context: ~86 tokens/sec
  • 32,768 context: ~43 tokens/sec
  • 131,072 context: ~9 tokens/sec

The critical threshold to watch: if the KV cache exceeds available VRAM, Ollama offloads it to CPU RAM. This causes inference speed to drop from 50–100 tokens/sec to 2–5 tokens/sec — a 20–50× slowdown. If responses suddenly become very slow after increasing context, your VRAM budget has been exceeded.

Maximum Context Windows by Model

Setting num_ctx above a model’s trained maximum degrades output quality — the model was never trained to handle positions beyond that length. Here are the practical limits for common models:

Model Max Trained Context
Llama 3.1 (8B, 70B, 405B) 128,000 tokens
Llama 3.2 (1B, 3B) 128,000 tokens
Qwen2.5 (all sizes) 128,000 tokens
Qwen3 (8B) ~40,960 tokens
Mistral 7B v0.3 32,000 tokens
Mistral Small / Large 128,000 tokens
Gemma 3 (all sizes) 131,072 tokens
DeepSeek-R1 128,000 tokens

Stay at or below these figures. Setting a 70B Llama model to num_ctx 200000 does not give you 200k working context — it just wastes VRAM on positions the model cannot reason about reliably.

Keeping Your System Prompt Lean

Every token in your system prompt counts against num_ctx before the conversation even begins. A verbose system prompt that runs 2,000 tokens eats nearly half your context on a 4,096 default, leaving only ~2,000 tokens for actual conversation — enough for three or four exchanges before truncation starts.

Practical guidelines for system prompt size:

  • General chat models: keep the system prompt under 200 tokens (~150 words). Include only essential persona or behavioural instructions.
  • Coding assistants: the system prompt can be longer (500–1,000 tokens) if it includes language-specific rules, but increase num_ctx proportionally.
  • Agent loops: minimise the system prompt and use tool descriptions efficiently — each tool definition adds tokens. With 10 tools at 200 tokens each, you have already used 2,000 tokens before the first user message.

If context length remains a constraint even after increasing num_ctx, consider Ollama Cloud — cloud models include options with up to 1 million token context windows, removing hardware limitations entirely. The :cloud suffix is the only change needed in your code.

Troubleshooting: Context Setting Not Taking Effect

Environment variable set but model still uses 4,096

The Modelfile PARAMETER num_ctx takes precedence over the environment variable. If the model you are running has PARAMETER num_ctx 4096 baked in, the env var will be overridden. Check with ollama show --modelfile modelname and either remove the PARAMETER from a custom Modelfile or use the API per-request override instead.

Also confirm the env var is exported in the same shell process where ollama serve runs. If Ollama is managed by systemd, the variable must be set in the service override file — not in your interactive shell. Verify what the service sees with sudo systemctl show ollama --property=Environment.

Context increased but responses are extremely slow

The KV cache has exceeded your VRAM and spilled to CPU RAM. Reduce num_ctx, switch to a more aggressively quantised model (Q4_K_M instead of Q8_0), or add more VRAM. Run ollama ps to see current context allocation alongside model size.

OOM crash when loading model with large context

VRAM cannot hold both the model weights and the KV cache at the requested context size. Either reduce num_ctx or use a smaller/more quantised model variant. The formula for a rough VRAM estimate: model size in GiB + (num_ctx × layers × 2 × dtype_bytes / 1e9).

OLLAMA_NUM_CTX not working on newer Ollama

Rename it to OLLAMA_CONTEXT_LENGTH. The old variable name may not be recognised on Ollama 0.5+ — make sure you are using the current env var name and restart the Ollama service after the change.

If you are unsure which version you are running, check with ollama --version. If you are on an older version, updating to the latest resolves most env var issues and also brings security patches. See how to update Ollama for the full process.

Related articles: What is Hermes Agent and How Does It Work with Ollama?, What is Kimi K2.6 and Is It Worth Using on Ollama?, How to Use Ollama with Cursor IDE: Local AI for Free