Home / AI / Ollama / How to Run Qwen3 on Ollama: All Sizes, Thinking Mode

Ollama

How to Run Qwen3 on Ollama: All Sizes, Thinking Mode

2. Qwen3 Model Sizes and Hardware Requirements

4. Thinking Mode — How It Works

6. Which Qwen3 Size Should You Use?

7. Qwen3 vs DeepSeek R1 vs Llama 4 Scout

Alibaba released Qwen3 on April 28th 2026 and it has already become one of the most-pulled model families on Ollama. The headline feature is a built-in thinking mode — the same deep reasoning you get from DeepSeek R1 or Llama 4 Scout, but switchable on or off mid-conversation without loading a different model. This guide covers every size, the hardware you need, and how to get it running locally in minutes.

What Is Qwen3?

Qwen3 is the third generation of Alibaba’s open-weight large language model series. It ships as eight separate models — six dense architectures and two Mixture-of-Experts (MoE) models — all available under the Apache 2.0 licence, meaning free for personal and commercial use.

The standout feature is a unified thinking framework. Previous reasoning models required a separate dedicated model (DeepSeek R1, QwQ). Qwen3 integrates both fast response mode and slow deep-reasoning mode into every model in the family. You switch between them with a single command.

Qwen3 Model Sizes and Hardware Requirements

Ollama hosts the full Qwen3 family. The default tag pulls the 8B model, which suits most setups with a mid-range GPU or 16GB of RAM:

Model	Type	Min VRAM / RAM	Best For
qwen3:0.6b	Dense	4 GB	Very low-end hardware, quick tests
qwen3:1.7b	Dense	4 GB	Raspberry Pi, older PCs
qwen3:4b	Dense	4–6 GB	Budget GPUs, fast responses
qwen3:8b	Dense	6–8 GB	Default — best balance for most users
qwen3:14b	Dense	10–12 GB	RTX 3060/4060 12 GB, M2/M3 Mac
qwen3:32b	Dense	20–24 GB	High-end GPU, Mac Studio
qwen3:30b-a3b	MoE	20 GB	Efficient — 30B quality, 3B active cost
qwen3:235b-a22b	MoE	128 GB+	Server-grade, flagship quality

The 30B-A3B MoE model is worth highlighting. Mixture-of-Experts means only 3 billion parameters are active per token during inference, so it runs at roughly 8B speed while producing quality closer to a 30B dense model. If you have 20 GB of VRAM or unified memory, this is worth trying over the 8B.

How to Run Qwen3 on Ollama

First, make sure Ollama is installed and running. Then pull and run your chosen size:

# Default (8B model)
ollama run qwen3

# Specific sizes
ollama run qwen3:4b
ollama run qwen3:14b
ollama run qwen3:32b

# MoE model — 30B quality, efficient inference
ollama run qwen3:30b-a3b

Ollama will download the model on first run. The 8B model is around 5 GB; the 14B is around 9 GB.

Thinking Mode — How It Works

Thinking mode is Qwen3’s most important feature. When enabled, the model works through a problem step by step before giving its final answer — the same approach that makes DeepSeek R1 strong at reasoning, maths, and code. When disabled, it responds instantly like a standard chat model.

All Qwen3 models run with thinking mode on by default in Ollama. You can control it in three ways:

From the command line at launch:

# Force thinking mode on
ollama run qwen3 --think

# Force thinking mode off (faster responses)
ollama run qwen3 --no-think

During a chat session:

/think
/no_think

Via the API:

curl http://localhost:11434/api/chat -d '{
  "model": "qwen3",
  "think": false,
  "messages": [{"role": "user", "content": "Summarise this in one sentence."}]
}'

A practical approach: use --no-think for quick questions, summaries, and drafting. Switch to thinking mode for code debugging, maths problems, logic tasks, or anything where you want the model to reason carefully before answering.

Thinking Budget Control

Qwen3 also lets you set a thinking budget — a cap on how many tokens the model spends reasoning before it gives its answer. This is useful when you want some reasoning depth but do not want to wait for an exhaustive chain of thought:

curl http://localhost:11434/api/chat -d '{
  "model": "qwen3",
  "thinking": {"budget_tokens": 1024},
  "messages": [{"role": "user", "content": "Debug this Python function."}]
}'

Higher budget = more thorough reasoning, slower response. Lower budget = quicker but shallower. The default (uncapped) is fine for most use cases.

Which Qwen3 Size Should You Use?

A practical guide based on your hardware:

8 GB RAM / integrated graphics — qwen3:4b without thinking mode. Functional for chat and drafting.
16 GB RAM / no discrete GPU — qwen3:8b, CPU inference. Slow but capable. A Mac with 16 GB unified memory will run this well.
RTX 3060/4060 (12 GB VRAM) — qwen3:14b is the sweet spot. Excellent reasoning at fast speed.
RTX 4090 / Mac M2 Pro 32 GB — qwen3:32b or qwen3:30b-a3b. Near-frontier quality locally.
Home server with 32+ GB RAM — qwen3:30b-a3b via CPU offloading, or qwen3:32b if RAM allows.

Qwen3 vs DeepSeek R1 vs Llama 4 Scout

	Qwen3 8B	DeepSeek R1 8B	Llama 4 Scout
Thinking mode	Yes — switchable	Always on	No
Licence	Apache 2.0	MIT	Llama 4 Community
Coding	Strong	Strong	Good
Multilingual	Excellent (119 languages)	Good	Good
Best for	General + reasoning	Reasoning tasks	Long context, vision

The key advantage of Qwen3 over DeepSeek R1 is flexibility — one model handles both quick responses and deep reasoning. DeepSeek R1 is always in reasoning mode, which is powerful but slow for simple tasks.

Qwen3 for Everyday Use

Beyond the technical benchmarks, Qwen3 8B with thinking mode off is a very capable everyday model — fast, accurate, and good at following instructions. Thinking mode makes it a genuine competitor to much larger models for technical tasks. The combination in one download is what makes it worth switching to if you are currently running Llama 3.3 or Mistral as your daily driver.

How to Run Qwen3 on Ollama: All Sizes, Thinking Mode

Table of Contents

1. What Is Qwen3?

2. Qwen3 Model Sizes and Hardware Requirements

3. How to Run Qwen3 on Ollama

4. Thinking Mode — How It Works

5. Thinking Budget Control

6. Which Qwen3 Size Should You Use?

7. Qwen3 vs DeepSeek R1 vs Llama 4 Scout

8. Qwen3 for Everyday Use

9. Related Guides

What Is Qwen3?

Qwen3 Model Sizes and Hardware Requirements

How to Run Qwen3 on Ollama

Thinking Mode — How It Works

Thinking Budget Control

Which Qwen3 Size Should You Use?

Qwen3 vs DeepSeek R1 vs Llama 4 Scout

Qwen3 for Everyday Use

Self-Host a Wispr Flow Alternative on Your Home Server

AnythingLLM + Ollama: Chat with Your Documents

How to Run Qwen3 on Ollama: All Sizes, Thinking Mode

Table of Contents

What Is Qwen3?

Qwen3 Model Sizes and Hardware Requirements

How to Run Qwen3 on Ollama

Thinking Mode — How It Works

Thinking Budget Control

Which Qwen3 Size Should You Use?

Qwen3 vs DeepSeek R1 vs Llama 4 Scout

Qwen3 for Everyday Use

Related Guides

Self-Host a Wispr Flow Alternative on Your Home Server

AnythingLLM + Ollama: Chat with Your Documents

Related Posts