Home / AI / Ollama / How to Run Llama 4 on Ollama (Scout and Maverick Guide)

How to Run Llama 4 on Ollama (Scout and Maverick Guide)

Llama 4 is Meta’s most capable open-weight model family to date, released in April 2026. It introduces a mixture-of-experts (MoE) architecture and native multimodal support — meaning it handles both text and images out of the box. Two variants are available to run locally via Ollama: Scout and Maverick. This guide covers what each model offers, what hardware you need, and how to get started.

Llama 4 Scout vs Maverick: Which Should You Run?

Llama 4 comes in two locally-runnable variants:

  • Llama 4 Scout — 17B active parameters (109B total across experts), supports up to a 10 million token context window. Requires around 20–24GB of VRAM or unified memory. This is the version most home users will run.
  • Llama 4 Maverick — 17B active parameters (400B total), significantly more capable but requiring substantially more memory. Best suited for multi-GPU setups or high-memory workstations.

For most users: run Scout. It delivers strong performance on a single high-end consumer GPU or an Apple Silicon Mac with 32GB+ unified memory.

Hardware Requirements

  • Llama 4 Scout: 24GB VRAM (NVIDIA RTX 3090/4090), or 32GB+ Apple Silicon unified memory, or 32GB RAM for CPU-only (slow)
  • Llama 4 Maverick: Multi-GPU setup or high-memory server — not practical for most home users

Llama 4 Scout at Q4 quantisation runs in around 20GB, making it accessible on a single RTX 4090 or Mac Studio with 32GB.

How to Install Llama 4 Scout on Ollama

Make sure you have Ollama installed and up to date before pulling Llama 4. Run ollama --version and update if needed.

# Pull Llama 4 Scout
ollama pull llama4

# Or pull Maverick (requires significantly more memory)
ollama pull llama4:maverick

The Scout model is the default when you pull llama4. Depending on your connection, the download will take several minutes.

Running Llama 4 Scout

# Start an interactive chat session
ollama run llama4

# Run a single prompt
ollama run llama4 "Explain the difference between MoE and dense transformer models"

Using Llama 4’s Vision Capabilities

Llama 4 is natively multimodal — you can pass images directly without any additional setup. Using the Ollama Python library:

import ollama

response = ollama.chat(
    model='llama4',
    messages=[{
        'role': 'user',
        'content': 'What is in this image?',
        'images': ['path/to/your/image.jpg']
    }]
)
print(response['message']['content'])

This works with PNG, JPEG, and WebP images. Unlike earlier vision models that used a separate vision encoder, Llama 4’s multimodal capability is baked into the base model.

Llama 4 via the REST API

curl http://localhost:11434/api/chat -d '{
  "model": "llama4",
  "messages": [
    {"role": "user", "content": "What are the key features of Llama 4?"}
  ]
}'

What Llama 4 Scout Is Good At

  • Long context reasoning — the 10M token context window is transformative for document analysis
  • Coding — strong performance on standard coding benchmarks
  • Multimodal tasks — image description, chart analysis, visual QA
  • General reasoning — improved significantly over Llama 3.3

Llama 4 vs Llama 3.3 on Ollama

If you are currently running Llama 3.3 70B, Scout offers comparable or better performance with lower memory requirements due to the MoE architecture — only 17B parameters are active per inference pass. The native multimodal support is an additional capability Llama 3.3 does not have.

Troubleshooting

  • Out of memory: Try a lower quantisation — ollama pull llama4:scout-q4_K_M uses less VRAM
  • Slow inference: Ensure your GPU is being used — run ollama ps to see if the model is loaded on GPU
  • Model not found: Update Ollama to the latest version — Llama 4 requires a recent build

Sign Up For Daily Newsletter

Stay updated with our weekly newsletter. Subscribe now to never miss an update!

[mc4wp_form]