Home / AI / Ollama / Best Ollama Models for 8GB RAM and Low VRAM Hardware

Best Ollama Models for 8GB RAM and Low VRAM Hardware

Best Ollama Models for 8GB RAM and Low VRAM Hardware

Running Ollama on hardware with 8GB of RAM or VRAM is entirely possible — you just need to pick the right models. The key is choosing quantised versions of smaller models that fit within your memory budget while still delivering useful results. This guide covers the best options at each tier: 4GB, 6GB, and 8GB.

How Memory Limits Work with Ollama

When you run a model in Ollama, it needs to fit in your GPU VRAM (or system RAM if you have no GPU). The memory required depends on two things: the number of parameters and the quantisation level. A 7B model at Q4 quantisation uses roughly 4–5GB; the same model at Q8 uses around 7–8GB.

Ollama downloads quantised versions by default, so most models are optimised for memory efficiency out of the box.

Best Models for 4GB VRAM

If you have a GPU with 4GB VRAM (or 4–6GB system RAM for CPU-only inference), these models work well:

  • Gemma 4 E2B — Google’s newest 2B model, natively multimodal. ollama pull gemma4:e2b
  • Phi-4 Mini — Microsoft’s compact model, excellent reasoning for its size. ollama pull phi4-mini
  • Qwen2.5 3B — Strong all-rounder at 3B. ollama pull qwen2.5:3b
  • Llama 3.2 3B — Meta’s compact model, good for quick tasks. ollama pull llama3.2:3b

Best Models for 6GB VRAM

  • Gemma 4 E4B — Best 4B model available. Strong coding and reasoning. ollama pull gemma4:e4b
  • Qwen3-Coder 3B — Outstanding coding performance at 3B active. ollama pull qwen3-coder
  • Mistral 7B Q4 — Fast and capable general model. ollama pull mistral
  • Llama 3.2 7B Q4 — Reliable and well-tested. ollama pull llama3.2

Best Models for 8GB VRAM

8GB of VRAM is the most common tier for gaming GPUs (RTX 3070, RTX 4060 Ti) and covers a wide range of capable models:

  • Llama 3.3 8B Q4 — Meta’s best small model, excellent instruction following. ollama pull llama3.3:8b
  • Qwen2.5 7B — One of the strongest 7B models available. ollama pull qwen2.5:7b
  • Gemma 4 E4B (Q8) — Higher quality quantisation with room to spare. ollama pull gemma4:e4b-q8_0
  • Phi-4 — Microsoft’s 14B model at Q4 squeezes into 8GB. ollama pull phi4
  • DeepSeek R1 8B — Strong reasoning model, good for maths and logic. ollama pull deepseek-r1:8b

Models to Avoid on 8GB or Less

These models require more memory than 8GB and will either fail to load or run very slowly on CPU only:

  • Llama 3.3 70B (requires 40GB+)
  • Llama 4 Scout (requires 20–24GB)
  • Qwen2.5 32B or larger
  • DeepSeek R1 32B or larger

CPU-Only: What Works Without a GPU

If you have no GPU and are running purely on CPU with 8GB system RAM, stick to models under 4B parameters:

  • Gemma 4 E2B, Phi-4 Mini, Llama 3.2 1B or 3B
  • Expect 2–8 tokens per second on a modern CPU — usable but slow
  • Close all other applications to free up RAM before running

Tips for Running Models on Limited Hardware

  • Use Q4_K_M quantisation — the best balance of quality and memory use
  • Set OLLAMA_MAX_LOADED_MODELS=1 to prevent Ollama loading multiple models at once
  • Close your browser and other memory-hungry apps before running larger models
  • Use ollama ps to check if your model loaded onto GPU or fell back to CPU

For a full reference of every Ollama command and flag, see the Ollama CLI Cheat Sheet.