Gemma 3 is Google’s latest open-source model family and one of the most versatile models you can run in Ollama. What makes it stand out is its support for both text and images — making it one of the few multimodal models available locally at this size. Here’s how to get started.
What is Gemma 3?
Gemma 3 was released by Google DeepMind in March 2025 and represents a significant step up from Gemma 2. The key improvements are:
- Multimodal support — Gemma 3 can understand and describe images, not just text
- Longer context — 128K token context window (up from 8K in Gemma 2)
- Better instruction following — more reliable at complex, multi-step prompts
- Multilingual — supports over 140 languages
Gemma 3 Model Sizes in Ollama
| Model | RAM needed | Notes |
|---|---|---|
| gemma3:1b | ~2 GB | Ultra-lightweight, basic tasks |
| gemma3:4b | ~4 GB | Good for older hardware, solid quality |
| gemma3:12b | ~9 GB | Recommended — best balance of size and capability |
| gemma3:27b | ~18 GB | High quality, needs 32 GB RAM |
How to Install Gemma 3 in Ollama
ollama pull gemma3
This pulls the default 4B model. For the recommended 12B:
ollama pull gemma3:12b
To run it:
ollama run gemma3:12b
How to Use Gemma 3 with Images
Gemma 3 supports image inputs from the command line. To analyse an image:
ollama run gemma3:12b "What is in this image?" --image /path/to/image.jpg
Or via the API:
curl http://localhost:11434/api/generate -d '{
"model": "gemma3:12b",
"prompt": "Describe what you see in this image",
"images": ["base64_encoded_image_here"]
}'
For a more user-friendly way to use image inputs, Open WebUI supports image uploads with Gemma 3 out of the box.
What is Gemma 3 Good At?
- Image understanding — describe photos, read text in images, analyse diagrams
- Long document processing — the 128K context makes it well suited for summarising lengthy texts
- Multilingual content — strong performance across many languages
- General knowledge Q&A — well-rounded responses across most topics
- Creative writing — better than most small models at creative tasks
Gemma 3 vs Llama 3.2 Vision
Both support images locally. Gemma 3 tends to give more detailed image descriptions and handles longer context better. Llama 3.2 Vision has a larger community and more third-party integrations. For most users, Gemma 3 12B is the better choice if multimodal capability is a priority.
Gemma 3 vs Phi-4
Phi-4 is stronger on pure reasoning and maths. Gemma 3 is more versatile — better at creative tasks, images, and multilingual content. If you want one model that handles a wide range of tasks, Gemma 3 12B is a strong pick.
