Home / AI / Ollama / How to Run Qwen3 on Ollama: All Sizes, Thinking Mode

How to Run Qwen3 on Ollama: All Sizes, Thinking Mode

How to Run Qwen3 on Ollama: All Sizes, Thinking Mode and Hardware Guide

Alibaba released Qwen3 on April 28th 2026 and it has already become one of the most-pulled model families on Ollama. The headline feature is a built-in thinking mode — the same deep reasoning you get from DeepSeek R1 or Llama 4 Scout, but switchable on or off mid-conversation without loading a different model. This guide covers every size, the hardware you need, and how to get it running locally in minutes.

What Is Qwen3?

Qwen3 is the third generation of Alibaba’s open-weight large language model series. It ships as eight separate models — six dense architectures and two Mixture-of-Experts (MoE) models — all available under the Apache 2.0 licence, meaning free for personal and commercial use.

The standout feature is a unified thinking framework. Previous reasoning models required a separate dedicated model (DeepSeek R1, QwQ). Qwen3 integrates both fast response mode and slow deep-reasoning mode into every model in the family. You switch between them with a single command.

Qwen3 Model Sizes and Hardware Requirements

Ollama hosts the full Qwen3 family. The default tag pulls the 8B model, which suits most setups with a mid-range GPU or 16GB of RAM:

Model Type Min VRAM / RAM Best For
qwen3:0.6b Dense 4 GB Very low-end hardware, quick tests
qwen3:1.7b Dense 4 GB Raspberry Pi, older PCs
qwen3:4b Dense 4–6 GB Budget GPUs, fast responses
qwen3:8b Dense 6–8 GB Default — best balance for most users
qwen3:14b Dense 10–12 GB RTX 3060/4060 12 GB, M2/M3 Mac
qwen3:32b Dense 20–24 GB High-end GPU, Mac Studio
qwen3:30b-a3b MoE 20 GB Efficient — 30B quality, 3B active cost
qwen3:235b-a22b MoE 128 GB+ Server-grade, flagship quality

The 30B-A3B MoE model is worth highlighting. Mixture-of-Experts means only 3 billion parameters are active per token during inference, so it runs at roughly 8B speed while producing quality closer to a 30B dense model. If you have 20 GB of VRAM or unified memory, this is worth trying over the 8B.

How to Run Qwen3 on Ollama

First, make sure Ollama is installed and running. Then pull and run your chosen size:

# Default (8B model)
ollama run qwen3

# Specific sizes
ollama run qwen3:4b
ollama run qwen3:14b
ollama run qwen3:32b

# MoE model — 30B quality, efficient inference
ollama run qwen3:30b-a3b

Ollama will download the model on first run. The 8B model is around 5 GB; the 14B is around 9 GB.

Thinking Mode — How It Works

Thinking mode is Qwen3’s most important feature. When enabled, the model works through a problem step by step before giving its final answer — the same approach that makes DeepSeek R1 strong at reasoning, maths, and code. When disabled, it responds instantly like a standard chat model.

All Qwen3 models run with thinking mode on by default in Ollama. You can control it in three ways:

From the command line at launch:

# Force thinking mode on
ollama run qwen3 --think

# Force thinking mode off (faster responses)
ollama run qwen3 --no-think

During a chat session:

/think
/no_think

Via the API:

curl http://localhost:11434/api/chat -d '{
  "model": "qwen3",
  "think": false,
  "messages": [{"role": "user", "content": "Summarise this in one sentence."}]
}'

A practical approach: use --no-think for quick questions, summaries, and drafting. Switch to thinking mode for code debugging, maths problems, logic tasks, or anything where you want the model to reason carefully before answering.

Thinking Budget Control

Qwen3 also lets you set a thinking budget — a cap on how many tokens the model spends reasoning before it gives its answer. This is useful when you want some reasoning depth but do not want to wait for an exhaustive chain of thought:

curl http://localhost:11434/api/chat -d '{
  "model": "qwen3",
  "thinking": {"budget_tokens": 1024},
  "messages": [{"role": "user", "content": "Debug this Python function."}]
}'

Higher budget = more thorough reasoning, slower response. Lower budget = quicker but shallower. The default (uncapped) is fine for most use cases.

Which Qwen3 Size Should You Use?

A practical guide based on your hardware:

  • 8 GB RAM / integrated graphics — qwen3:4b without thinking mode. Functional for chat and drafting.
  • 16 GB RAM / no discrete GPU — qwen3:8b, CPU inference. Slow but capable. A Mac with 16 GB unified memory will run this well.
  • RTX 3060/4060 (12 GB VRAM) — qwen3:14b is the sweet spot. Excellent reasoning at fast speed.
  • RTX 4090 / Mac M2 Pro 32 GB — qwen3:32b or qwen3:30b-a3b. Near-frontier quality locally.
  • Home server with 32+ GB RAM — qwen3:30b-a3b via CPU offloading, or qwen3:32b if RAM allows.

Qwen3 vs DeepSeek R1 vs Llama 4 Scout

Qwen3 8B DeepSeek R1 8B Llama 4 Scout
Thinking mode Yes — switchable Always on No
Licence Apache 2.0 MIT Llama 4 Community
Coding Strong Strong Good
Multilingual Excellent (119 languages) Good Good
Best for General + reasoning Reasoning tasks Long context, vision

The key advantage of Qwen3 over DeepSeek R1 is flexibility — one model handles both quick responses and deep reasoning. DeepSeek R1 is always in reasoning mode, which is powerful but slow for simple tasks.

Qwen3 for Everyday Use

Beyond the technical benchmarks, Qwen3 8B with thinking mode off is a very capable everyday model — fast, accurate, and good at following instructions. Thinking mode makes it a genuine competitor to much larger models for technical tasks. The combination in one download is what makes it worth switching to if you are currently running Llama 3.3 or Mistral as your daily driver.