Home / AI / Ollama / Ollama Thinking Mode Explained: How to Enable, Disable, and Control Reasoning

Ollama Thinking Mode Explained: How to Enable, Disable, and Control Reasoning

Ollama Thinking Mode Explained: How to Enable, Disable, and Control Reasoning

When Qwen3 arrived in April 2026, a lot of Ollama users were immediately confused. Responses were suddenly much longer, with elaborate reasoning traces appearing before the actual answer. Nothing was broken. Qwen3 ships with Ollama thinking mode enabled by default, and it was doing exactly what it was designed to do. This guide explains what thinking mode actually is, which models support it, how to control it across every interface Ollama provides, and when you should turn it off entirely.

What Is Thinking Mode in Ollama?

Thinking mode is Ollama’s implementation of chain-of-thought reasoning. When enabled, a model works through a problem step by step before producing a final answer. The internal reasoning process, sometimes called the reasoning trace or scratchpad, is visible in the output and separated from the final response.

This is not a new concept. DeepSeek R1’s chain-of-thought reasoning was always active with no way to turn it off. Qwen3 changed that. It introduced a proper toggle so you can enable thinking for complex tasks and disable it when you just want a fast, direct answer. That flexibility is the key difference between the two approaches.

In practice, a thinking-mode response outputs a <think> block first, working through the model’s reasoning, then delivers the final answer. With thinking disabled, you get the answer directly with no preamble.

Which Ollama Models Support Thinking Mode?

Thinking mode is currently part of the Qwen3 model family. All Qwen3 sizes support the feature, and thinking is enabled by default across all of them. Here is a reference table covering the available sizes and what to expect on typical hardware:

ModelMin RAM (thinking on)Thinking qualityBest for
qwen3:0.6b2 GBBasicTesting only
qwen3:1.7b3 GBLimitedLow-power devices
qwen3:4b5 GBModerateEveryday reasoning on 8 GB machines
qwen3:8b8 GBGoodMost users, best value
qwen3:14b12 GBStrongComplex coding and maths
qwen3:32b24 GBVery strongDemanding reasoning tasks
qwen3:235b128 GB+ExcellentServer-grade hardware only

If you have been using Qwen2.5 on Ollama, Qwen3 is a significant step up in reasoning capability. For most users on consumer hardware, qwen3:8b gives the best balance of quality and speed. The 0.6B and 1.7B models can think, but the reasoning depth at those sizes is limited. They are useful for testing the feature rather than relying on it for serious work.

Models outside the Qwen3 family, such as Llama 3, Mistral, and Phi-4, do not support thinking mode. Passing think: true to one of these models has no effect. The think parameter is silently ignored rather than throwing an error, so always verify your model is in the Qwen3 family before relying on it.

How to Enable or Disable Ollama Thinking Mode

Ollama gives you three ways to control thinking mode: a command-line flag when starting a session, a command within an interactive session, and a parameter in API calls. All three are straightforward once you know where to look.

CLI: the –think and –nothink flags

The simplest approach is to pass a flag when you run the model:

ollama run qwen3:8b --think

This enables thinking mode explicitly. Since thinking is on by default for Qwen3, this flag is mainly useful for clarity or for future models where the default might differ.

To disable thinking and get direct answers without reasoning traces:

ollama run qwen3:8b --nothink

With --nothink, the model skips the reasoning trace entirely and responds like a standard language model. For quick questions, summarisation, or any task where speed matters more than depth, this is the flag to use.

Interactive session: /set think and inline switching

If you are already inside an Ollama interactive session, you can toggle thinking mode without restarting:

/set think on
/set think off

You can also switch on a per-message basis using inline commands at the start of your prompt. Place /think at the beginning of a message to enable thinking for that response only, or /nothink to disable it for that message:

/think Explain the algorithmic complexity of quicksort versus mergesort
/nothink What is the capital of France?

This is particularly useful in multi-turn conversations where you want deep reasoning for some questions and fast answers for others, all within the same session without restarting.

API: the think parameter

When calling Ollama via the REST API, add a think field to your request body:

curl http://localhost:11434/api/generate 
  -d '{
    "model": "qwen3:8b",
    "prompt": "A train travels at 60 mph for 2.5 hours. How far does it travel?",
    "think": true
  }'

To disable thinking in an API call:

curl http://localhost:11434/api/generate 
  -d '{
    "model": "qwen3:8b",
    "prompt": "Summarise this paragraph in one sentence.",
    "think": false
  }'

In Python using the Ollama library:

import ollama

response = ollama.generate(
    model='qwen3:8b',
    prompt='Walk me through solving this calculus problem step by step.',
    think=True
)

print(response['thinking'])   # the reasoning trace
print(response['response'])   # the final answer

The response object includes a separate thinking field containing the reasoning trace and a response field with the final answer, so you can use each independently in your application. If you are generating structured JSON output from Ollama, pass think: false explicitly, since thinking mode can interfere with strict schema adherence.

The –hidethinking Flag

There is a third flag that most guides skip over entirely: --hidethinking. It works alongside --think and does something subtly different from simply disabling reasoning.

With --hidethinking, the model still performs its full chain-of-thought reasoning internally. You still get all the depth and accuracy benefits of thinking mode. The reasoning trace is simply not included in the output. Only the final answer is returned.

ollama run qwen3:8b --think --hidethinking

This is particularly valuable when building production applications or local APIs where end users should only see clean answers. The model reasons just as thoroughly, but the thinking stays private.

Think of it this way: --think enables the reasoning process and --hidethinking controls whether that process is visible in the output. If you want reasoning quality without reasoning traces appearing in your application’s response, these two flags work together to achieve exactly that.

When to Use Thinking Mode and When to Turn It Off

Thinking mode is not always the right choice. The core tradeoff is quality versus speed. When thinking is enabled, the model generates a reasoning trace before answering. That can add hundreds or thousands of tokens to every response, and on slower hardware the delay is noticeable. On a machine with 8 GB of VRAM running qwen3:8b, a simple question might take 2 seconds without thinking and 5 to 8 seconds with it enabled.

Task typeThinking modeReason
Maths and problem solvingEnableStep-by-step reasoning dramatically improves accuracy
Complex coding tasksEnableModel works through logic before writing code
Multi-step analysis or planningEnableStructured reasoning catches errors early
Debugging and code reviewEnableReasoning traces reveal how the model reads your code
Simple factual questionsDisableNo reasoning benefit, adds unnecessary latency
Summarisation or classificationDisableThe task does not benefit from chain-of-thought
Chat and conversational useDisableResponses feel unnatural with visible reasoning traces
Production APIDisable or use –hidethinkingLatency is the priority; use –hidethinking to keep quality

Ollama’s best models for reasoning and maths tasks all benefit from thinking mode being enabled for those specific use cases. For everything else, the overhead is rarely worth it. The quality improvement that thinking mode provides is genuinely meaningful for tasks that involve multi-step logic. For a casual question, you will not notice any difference in output quality and you will certainly notice the wait.

Disabling Thinking Mode by Default in a Modelfile

If you find yourself always running Qwen3 with --nothink, you can make that the default by creating a Modelfile. This saves you from passing the flag every time you start a session, and it works cleanly in UI tools like Open WebUI where you cannot pass raw API parameters per request.

FROM qwen3:8b

PARAMETER nothink true

Save that as a file called Modelfile, then build a custom model from it:

ollama create qwen3-fast -f Modelfile
ollama run qwen3-fast

Thinking mode will now be off by default for that model. You can still override it per-session with --think if you need reasoning on a particular task. Create the model once, then select it from any Ollama-compatible interface and it will behave consistently without any extra configuration.

Thinking mode is one of the most significant additions to Ollama in recent months, and Qwen3 is the model that made it genuinely practical. Whether you leave it on for deep analysis, turn it off for speed, or hide the trace with --hidethinking for production use, knowing how to control it puts you in a much stronger position than relying on the defaults. For a broader overview of how Ollama works and the full range of models it supports, the complete Ollama guide covers everything you need.