Home / AI / Ollama / How to Run Ollama on a Raspberry Pi

How to Run Ollama on a Raspberry Pi

Ollama

Ollama — the tool that makes running large language models locally as simple as pulling a Docker image — is not just for powerful desktop machines. With the right Raspberry Pi model and a few sensible configuration choices, you can run a fully self-hosted LLM server on a device that sips power and fits in the palm of your hand. This guide covers which hardware to choose, how to install Ollama, which models actually perform well, and how to squeeze the best performance out of your setup.

Which Raspberry Pi Models Can Run Ollama?

  • Raspberry Pi 5 (4GB or 8GB): The best choice. The Pi 5’s faster CPU cores and significantly improved memory bandwidth make it noticeably quicker for inference than any previous Pi. The 8GB model is strongly recommended.
  • Raspberry Pi 4 (4GB or 8GB): Fully supported. Inference is roughly half the speed of a Pi 5, but it runs reliably with small models. 4GB is workable for 1B–3B models; 8GB gives more headroom.
  • Raspberry Pi 4 or 5 (2GB): Not recommended. After the OS claims its share, very little RAM is left for model weights.
  • Raspberry Pi 3 and earlier: Not supported — 32-bit or insufficient RAM.

The single most impactful upgrade is choosing a Pi 5 over a Pi 4. The difference is roughly 2–3x faster inference for the same model, thanks to the Pi 5’s Arm Cortex-A76 cores and faster LPDDR4X memory.

Prerequisites: 64-bit Raspberry Pi OS

Ollama’s Linux builds support ARM64 (aarch64), so your Pi must be running a 64-bit operating system. The recommended choice is Raspberry Pi OS Bookworm (64-bit), available from the Raspberry Pi Imager.

Verify your architecture:

uname -m

The output must be aarch64. If you see armv7l, you are running a 32-bit OS and need to re-flash with a 64-bit image.

For best performance, use the Lite variant (no desktop environment). A full desktop consumes RAM and CPU cycles that are better given to inference. You can still access the Pi over SSH or through a browser-based UI.

Installing Ollama

curl -fsSL https://ollama.com/install.sh | sh

The script detects your ARM64 architecture automatically, installs the Ollama binary to /usr/local/bin/ollama, creates a system user, and configures a systemd service. The whole process takes under a minute on a good connection.

ollama --version

Managing Ollama as a System Service

# Check status
sudo systemctl status ollama

# Start / stop
sudo systemctl start ollama
sudo systemctl stop ollama

# Enable to start on boot (set by installer already)
sudo systemctl enable ollama

To allow other devices on your network to use the Ollama API, bind it to all interfaces:

sudo systemctl edit ollama

Add in the editor:

[Service]
Environment="OLLAMA_HOST=0.0.0.0:11434"
sudo systemctl daemon-reload
sudo systemctl restart ollama

Choosing the Right Models

The Pi has no CUDA or ROCm GPU — all inference runs on the ARM CPU cores. This limits both speed and the practical model size ceiling.

  • Llama 3.2:1b — Most responsive option on a Pi. Fast enough for interactive use even on Pi 4. ollama pull llama3.2:1b
  • Qwen2.5:0.5b — Extremely small and fast. Useful when near-instant responses matter more than depth.
  • Qwen2.5:1.5b — Good step up from 0.5b with noticeably better reasoning.
  • Llama 3.2:3b — Sweet spot for Pi 5 users. Capable output at 2–5 tokens/second on Pi 5 8GB. ollama pull llama3.2:3b
  • Phi3:mini (3.8b) — Microsoft’s efficient small model. Good quality on Pi 5.

Models to Avoid

7B models are marginal on a Pi 5 8GB — they will run at around 0.5–1 token/second, which makes interactive conversation uncomfortable. On Pi 4, 7B is not worth attempting for general use. Models larger than 7B are not realistic on this hardware.

Performance Expectations

Approximate generation speeds (Q4 quantisation, CPU inference):

  • Raspberry Pi 5 8GB, 1B model: ~8–12 tokens/second
  • Raspberry Pi 5 8GB, 3B model: ~2–5 tokens/second
  • Raspberry Pi 5 8GB, 7B model: ~0.5–1 token/second
  • Raspberry Pi 4 8GB, 3B model: ~1–2 tokens/second
  • Raspberry Pi 4 8GB, 7B model: ~0.3–0.5 tokens/second

Storage Considerations

  • MicroSD card: Default option. Use an A2-rated card at minimum — these are rated for random read/write performance that model loading demands.
  • USB SSD (via USB 3.0): Significantly faster for model loading. A cheap USB SSD outperforms even a good microSD for sequential reads.
  • NVMe via PCIe (Pi 5 only): The Pi 5 has an FPC connector for the official M.2 HAT. The fastest possible storage option and the best setup for a dedicated LLM server.

To store models on a different path, set OLLAMA_MODELS in your service environment:

[Service]
Environment="OLLAMA_MODELS=/mnt/ssd/ollama-models"

Memory Management Tips

  • Limit parallel requests: Environment="OLLAMA_NUM_PARALLEL=1" — prevents multiple model instances from loading simultaneously
  • Use Lite OS: No desktop environment means no Wayland/X11 overhead — saves a few hundred MB of RAM
  • Consider disabling swap: On an 8GB Pi running only Ollama, swap is often unnecessary; swapping to microSD is extremely slow
  • Ollama pulls Q4 quantised variants by default — avoid full-precision models unless you have a specific reason

Running Open WebUI on the Same Pi

Open WebUI gives you a browser-based chat interface accessible from any device on your network. On Pi 5 this is feasible; on Pi 4 it works but adds memory pressure.

# Install Docker
curl -fsSL https://get.docker.com | sh
sudo usermod -aG docker $USER

# Run Open WebUI
docker run -d \
  --network=host \
  -v open-webui:/app/backend/data \
  --name open-webui \
  --restart always \
  ghcr.io/open-webui/open-webui:main

Access it at http://<your-pi-ip>:3000 from any browser on your local network.

Practical Use Cases for a Pi LLM Server

  • Home automation assistant — connect a small model to Home Assistant or Node-RED to process natural language commands locally, with no cloud dependency
  • Local API for scripts — query Ollama’s REST API from Python or bash to add LLM capabilities to automation scripts
  • Self-hosted chat UI — a private, family-accessible interface on your home network with no accounts or data leaving the house
  • Batch summarisation — run overnight jobs processing documents without time pressure on inference speed
  • Offline LLM access — in environments without reliable internet, a Pi provides a usable local model with no external dependencies

Pi 5 vs Pi 4: Which Should You Buy?

If you are buying new hardware specifically to run Ollama, the Raspberry Pi 5 8GB is the clear recommendation. The performance difference is not marginal — the Pi 5 is roughly 2–3x faster for inference, which is the difference between a model feeling interactive and feeling sluggish. For a 3B model, the Pi 5 will generate responses at a pace where short exchanges feel natural.

If you already own a Pi 4 8GB and want to experiment without new hardware, it is absolutely worth trying. Stick to 1B–3B models and keep your expectations calibrated, and you will have a working local LLM server for very little ongoing cost.

Sign Up For Daily Newsletter

Stay updated with our weekly newsletter. Subscribe now to never miss an update!

[mc4wp_form]

Leave a Reply

Your email address will not be published. Required fields are marked *