Alibaba released Qwen3 on April 28th 2026 and it has already become one of the most-pulled model families on Ollama. The headline feature is a built-in thinking mode — the same deep reasoning you get from DeepSeek R1 or Llama 4 Scout, but switchable on or off mid-conversation without loading a different model. This guide covers every size, the hardware you need, and how to get it running locally in minutes.
What Is Qwen3?
Qwen3 is the third generation of Alibaba’s open-weight large language model series. It ships as eight separate models — six dense architectures and two Mixture-of-Experts (MoE) models — all available under the Apache 2.0 licence, meaning free for personal and commercial use.
The standout feature is a unified thinking framework. Previous reasoning models required a separate dedicated model (DeepSeek R1, QwQ). Qwen3 integrates both fast response mode and slow deep-reasoning mode into every model in the family. You switch between them with a single command.
Qwen3 Model Sizes and Hardware Requirements
Ollama hosts the full Qwen3 family. The default tag pulls the 8B model, which suits most setups with a mid-range GPU or 16GB of RAM:
| Model | Type | Min VRAM / RAM | Best For |
|---|---|---|---|
| qwen3:0.6b | Dense | 4 GB | Very low-end hardware, quick tests |
| qwen3:1.7b | Dense | 4 GB | Raspberry Pi, older PCs |
| qwen3:4b | Dense | 4–6 GB | Budget GPUs, fast responses |
| qwen3:8b | Dense | 6–8 GB | Default — best balance for most users |
| qwen3:14b | Dense | 10–12 GB | RTX 3060/4060 12 GB, M2/M3 Mac |
| qwen3:32b | Dense | 20–24 GB | High-end GPU, Mac Studio |
| qwen3:30b-a3b | MoE | 20 GB | Efficient — 30B quality, 3B active cost |
| qwen3:235b-a22b | MoE | 128 GB+ | Server-grade, flagship quality |
The 30B-A3B MoE model is worth highlighting. Mixture-of-Experts means only 3 billion parameters are active per token during inference, so it runs at roughly 8B speed while producing quality closer to a 30B dense model. If you have 20 GB of VRAM or unified memory, this is worth trying over the 8B.
How to Run Qwen3 on Ollama
First, make sure Ollama is installed and running. Then pull and run your chosen size:
# Default (8B model)
ollama run qwen3
# Specific sizes
ollama run qwen3:4b
ollama run qwen3:14b
ollama run qwen3:32b
# MoE model — 30B quality, efficient inference
ollama run qwen3:30b-a3b
Ollama will download the model on first run. The 8B model is around 5 GB; the 14B is around 9 GB.
Thinking Mode — How It Works
Thinking mode is Qwen3’s most important feature. When enabled, the model works through a problem step by step before giving its final answer — the same approach that makes DeepSeek R1 strong at reasoning, maths, and code. When disabled, it responds instantly like a standard chat model.
All Qwen3 models run with thinking mode on by default in Ollama. You can control it in three ways:
From the command line at launch:
# Force thinking mode on
ollama run qwen3 --think
# Force thinking mode off (faster responses)
ollama run qwen3 --no-think
During a chat session:
/think
/no_think
Via the API:
curl http://localhost:11434/api/chat -d '{
"model": "qwen3",
"think": false,
"messages": [{"role": "user", "content": "Summarise this in one sentence."}]
}'
A practical approach: use --no-think for quick questions, summaries, and drafting. Switch to thinking mode for code debugging, maths problems, logic tasks, or anything where you want the model to reason carefully before answering.
Thinking Budget Control
Qwen3 also lets you set a thinking budget — a cap on how many tokens the model spends reasoning before it gives its answer. This is useful when you want some reasoning depth but do not want to wait for an exhaustive chain of thought:
curl http://localhost:11434/api/chat -d '{
"model": "qwen3",
"thinking": {"budget_tokens": 1024},
"messages": [{"role": "user", "content": "Debug this Python function."}]
}'
Higher budget = more thorough reasoning, slower response. Lower budget = quicker but shallower. The default (uncapped) is fine for most use cases.
Which Qwen3 Size Should You Use?
A practical guide based on your hardware:
- 8 GB RAM / integrated graphics — qwen3:4b without thinking mode. Functional for chat and drafting.
- 16 GB RAM / no discrete GPU — qwen3:8b, CPU inference. Slow but capable. A Mac with 16 GB unified memory will run this well.
- RTX 3060/4060 (12 GB VRAM) — qwen3:14b is the sweet spot. Excellent reasoning at fast speed.
- RTX 4090 / Mac M2 Pro 32 GB — qwen3:32b or qwen3:30b-a3b. Near-frontier quality locally.
- Home server with 32+ GB RAM — qwen3:30b-a3b via CPU offloading, or qwen3:32b if RAM allows.
Qwen3 vs DeepSeek R1 vs Llama 4 Scout
| Qwen3 8B | DeepSeek R1 8B | Llama 4 Scout | |
|---|---|---|---|
| Thinking mode | Yes — switchable | Always on | No |
| Licence | Apache 2.0 | MIT | Llama 4 Community |
| Coding | Strong | Strong | Good |
| Multilingual | Excellent (119 languages) | Good | Good |
| Best for | General + reasoning | Reasoning tasks | Long context, vision |
The key advantage of Qwen3 over DeepSeek R1 is flexibility — one model handles both quick responses and deep reasoning. DeepSeek R1 is always in reasoning mode, which is powerful but slow for simple tasks.
Qwen3 for Everyday Use
Beyond the technical benchmarks, Qwen3 8B with thinking mode off is a very capable everyday model — fast, accurate, and good at following instructions. Thinking mode makes it a genuine competitor to much larger models for technical tasks. The combination in one download is what makes it worth switching to if you are currently running Llama 3.3 or Mistral as your daily driver.
Related Guides
- What Is Ollama? A Beginner’s Guide to Local AI
- Ollama Thinking Mode: How to Enable, Disable and Control
- How to Run Llama 4 on Ollama
- Qwen3-Coder vs Llama 4 Scout: Best Local Coding Model
- Best Ollama Models in 2026






