Llama 4 is Meta’s most capable open-weight model family to date, released in April 2026. It introduces a mixture-of-experts (MoE) architecture and native multimodal support — meaning it handles both text and images out of the box. Two variants are available to run locally via Ollama: Scout and Maverick. This guide covers what each model offers, what hardware you need, and how to get started.
Llama 4 Scout vs Maverick: Which Should You Run?
Llama 4 comes in two locally-runnable variants:
- Llama 4 Scout — 17B active parameters (109B total across experts), supports up to a 10 million token context window. Requires around 20–24GB of VRAM or unified memory. This is the version most home users will run.
- Llama 4 Maverick — 17B active parameters (400B total), significantly more capable but requiring substantially more memory. Best suited for multi-GPU setups or high-memory workstations.
For most users: run Scout. It delivers strong performance on a single high-end consumer GPU or an Apple Silicon Mac with 32GB+ unified memory.
Hardware Requirements
- Llama 4 Scout: 24GB VRAM (NVIDIA RTX 3090/4090), or 32GB+ Apple Silicon unified memory, or 32GB RAM for CPU-only (slow)
- Llama 4 Maverick: Multi-GPU setup or high-memory server — not practical for most home users
Llama 4 Scout at Q4 quantisation runs in around 20GB, making it accessible on a single RTX 4090 or Mac Studio with 32GB.
How to Install Llama 4 Scout on Ollama
Make sure you have Ollama installed and up to date before pulling Llama 4. Run ollama --version and update if needed.
# Pull Llama 4 Scout
ollama pull llama4
# Or pull Maverick (requires significantly more memory)
ollama pull llama4:maverick
The Scout model is the default when you pull llama4. Depending on your connection, the download will take several minutes.
Running Llama 4 Scout
# Start an interactive chat session
ollama run llama4
# Run a single prompt
ollama run llama4 "Explain the difference between MoE and dense transformer models"
Using Llama 4’s Vision Capabilities
Llama 4 is natively multimodal — you can pass images directly without any additional setup. Using the Ollama Python library:
import ollama
response = ollama.chat(
model='llama4',
messages=[{
'role': 'user',
'content': 'What is in this image?',
'images': ['path/to/your/image.jpg']
}]
)
print(response['message']['content'])
This works with PNG, JPEG, and WebP images. Unlike earlier vision models that used a separate vision encoder, Llama 4’s multimodal capability is baked into the base model.
Llama 4 via the REST API
curl http://localhost:11434/api/chat -d '{
"model": "llama4",
"messages": [
{"role": "user", "content": "What are the key features of Llama 4?"}
]
}'
What Llama 4 Scout Is Good At
- Long context reasoning — the 10M token context window is transformative for document analysis
- Coding — strong performance on standard coding benchmarks
- Multimodal tasks — image description, chart analysis, visual QA
- General reasoning — improved significantly over Llama 3.3
Llama 4 vs Llama 3.3 on Ollama
If you are currently running Llama 3.3 70B, Scout offers comparable or better performance with lower memory requirements due to the MoE architecture — only 17B parameters are active per inference pass. The native multimodal support is an additional capability Llama 3.3 does not have.
Troubleshooting
- Out of memory: Try a lower quantisation —
ollama pull llama4:scout-q4_K_Muses less VRAM - Slow inference: Ensure your GPU is being used — run
ollama psto see if the model is loaded on GPU - Model not found: Update Ollama to the latest version — Llama 4 requires a recent build
