Home / AI / Ollama / How to Run Gemma 3 on Ollama (Google’s Multimodal Model)

How to Run Gemma 3 on Ollama (Google’s Multimodal Model)

Gemma 3 is Google’s latest open-source model family and one of the most versatile models you can run in Ollama. What makes it stand out is its support for both text and images — making it one of the few multimodal models available locally at this size. Here’s how to get started.

What is Gemma 3?

Gemma 3 was released by Google DeepMind in March 2025 and represents a significant step up from Gemma 2. The key improvements are:

  • Multimodal support — Gemma 3 can understand and describe images, not just text
  • Longer context — 128K token context window (up from 8K in Gemma 2)
  • Better instruction following — more reliable at complex, multi-step prompts
  • Multilingual — supports over 140 languages

Gemma 3 Model Sizes in Ollama

Model RAM needed Notes
gemma3:1b ~2 GB Ultra-lightweight, basic tasks
gemma3:4b ~4 GB Good for older hardware, solid quality
gemma3:12b ~9 GB Recommended — best balance of size and capability
gemma3:27b ~18 GB High quality, needs 32 GB RAM

How to Install Gemma 3 in Ollama

ollama pull gemma3

This pulls the default 4B model. For the recommended 12B:

ollama pull gemma3:12b

To run it:

ollama run gemma3:12b

How to Use Gemma 3 with Images

Gemma 3 supports image inputs from the command line. To analyse an image:

ollama run gemma3:12b "What is in this image?" --image /path/to/image.jpg

Or via the API:

curl http://localhost:11434/api/generate -d '{
  "model": "gemma3:12b",
  "prompt": "Describe what you see in this image",
  "images": ["base64_encoded_image_here"]
}'

For a more user-friendly way to use image inputs, Open WebUI supports image uploads with Gemma 3 out of the box.

What is Gemma 3 Good At?

  • Image understanding — describe photos, read text in images, analyse diagrams
  • Long document processing — the 128K context makes it well suited for summarising lengthy texts
  • Multilingual content — strong performance across many languages
  • General knowledge Q&A — well-rounded responses across most topics
  • Creative writing — better than most small models at creative tasks

Gemma 3 vs Llama 3.2 Vision

Both support images locally. Gemma 3 tends to give more detailed image descriptions and handles longer context better. Llama 3.2 Vision has a larger community and more third-party integrations. For most users, Gemma 3 12B is the better choice if multimodal capability is a priority.

Gemma 3 vs Phi-4

Phi-4 is stronger on pure reasoning and maths. Gemma 3 is more versatile — better at creative tasks, images, and multilingual content. If you want one model that handles a wide range of tasks, Gemma 3 12B is a strong pick.

Sign Up For Daily Newsletter

Stay updated with our weekly newsletter. Subscribe now to never miss an update!

[mc4wp_form]

Leave a Reply

Your email address will not be published. Required fields are marked *