Home / AI / Ollama / Ollama MLX: How to Enable Faster Inference on Apple Silicon

Ollama MLX: How to Enable Faster Inference on Apple Silicon

Ollama 0.19, released in March 2026, introduced an MLX backend for Apple Silicon Macs. MLX is Apple’s machine learning framework optimised specifically for the M-series chip architecture. Enabling it gives approximately 2x faster inference compared to the previous Metal backend — a significant improvement for anyone running local AI on a Mac.

What Is MLX and Why Does It Matter?

MLX is Apple’s open-source machine learning framework, designed from the ground up for Apple Silicon’s unified memory architecture. Unlike Metal (GPU compute) or CPU inference, MLX is specifically optimised for the way M1/M2/M3/M4 chips share memory between CPU and GPU cores. The result is noticeably faster token generation with lower power consumption.

Before Ollama 0.19, Macs used the Metal backend. MLX delivers around 2x the tokens per second for supported models on the same hardware.

Requirements

  • Apple Silicon Mac (M1, M2, M3, or M4 — any variant)
  • 32GB or more unified memory — the MLX backend currently requires 32GB minimum
  • Ollama 0.19 or later
  • macOS Sequoia or later recommended

If you have a Mac with 8GB or 16GB of unified memory, the MLX backend is not yet available for your configuration. It will likely expand to lower memory configs in future releases.

How to Enable the MLX Backend

First, update Ollama to the latest version:

# Check your current version
ollama --version

# Update via Homebrew
brew upgrade ollama

# Or download the latest from ollama.com

Once on 0.19+, enable MLX by setting an environment variable before starting Ollama:

# Enable MLX backend
export OLLAMA_USE_MLX=1

# Start Ollama
ollama serve

To make this permanent, add the environment variable to your shell profile (~/.zshrc):

echo 'export OLLAMA_USE_MLX=1' >> ~/.zshrc
source ~/.zshrc

If you are running Ollama as a macOS app (from the menu bar), set the environment variable in a launchd plist or via the Ollama app settings if available in your version.

Verifying MLX Is Active

# Pull a supported model
ollama pull qwen2.5:7b

# Run and check the logs
ollama run qwen2.5:7b "Hello"

In the Ollama logs (~/.ollama/logs/server.log), you should see references to MLX during model loading if it is active. You will also notice significantly faster first-token latency and generation speed.

Supported Models

At launch, MLX support in Ollama 0.19 covers:

  • Qwen2.5 and Qwen3 family
  • Llama 3.x family
  • Gemma 3 and Gemma 4
  • Mistral 7B and variants

Coverage is expanding with each release. Check the Ollama changelog for the current supported model list.

Performance Benchmarks (Mac Studio M2 Ultra, 64GB)

Model Metal Backend MLX Backend Improvement
Qwen2.5 7B 28 tok/s 54 tok/s ~2x
Llama 3.3 70B 12 tok/s 23 tok/s ~2x
Gemma 3 27B 18 tok/s 35 tok/s ~2x

Results vary by model and Mac configuration but the ~2x improvement is consistent across tested models.

What About Macs with 8GB or 16GB?

The current MLX backend in Ollama requires 32GB unified memory. Apple Silicon Macs with 8GB or 16GB can still run Ollama using the Metal backend, which remains the default. Metal performance on M-series chips is already excellent — the MLX improvement is on top of an already fast baseline.

Future Ollama releases are expected to extend MLX support to lower memory configurations.

Sign Up For Daily Newsletter

Stay updated with our weekly newsletter. Subscribe now to never miss an update!

[mc4wp_form]