Ollama makes it straightforward to run large language models locally, but its capabilities extend well beyond text. A growing number of models available through Ollama can accept image inputs alongside text prompts — a capability referred to as multimodal or vision AI. This guide covers what vision models are, which ones are available via Ollama, and how to use them from the command line, the API, and Python.

What Are Multimodal Vision Models?

A standard large language model works entirely with text: you send a text prompt, it returns a text response. A multimodal model can additionally process one or more images as part of the input. This opens up tasks that are impossible with text-only models:

Describing the contents of a photo in natural language
Answering specific questions about what appears in an image
Extracting and transcribing text visible in a screenshot
Analysing charts, graphs, and diagrams
Identifying objects, people, or scenes
Providing structured data extracted from unstructured image content

Under the hood, these models combine a vision encoder (which converts pixel data into a format the language model can understand) with the language model itself — reasoning across text and images in one pass.

Vision Models Available in Ollama

LLaVA (Large Language and Vision Assistant)

LLaVA is one of the original open-source vision-language models and remains widely used. Available in three sizes: llava:7b, llava:13b, and llava:34b. The 7B variant runs comfortably on most hardware with 8 GB of RAM. The 34B delivers considerably better quality but requires 24 GB or more of RAM.

LLaVA-Phi3

Available as llava-phi3, this model pairs the LLaVA vision component with Microsoft’s Phi-3 language model. Smaller and faster than standard LLaVA variants while maintaining good output quality for everyday vision tasks.

Moondream2

The moondream model is exceptionally compact at 1.8 billion parameters — designed specifically for speed and efficiency on constrained hardware. The best option for simple image description tasks on low-spec machines. For quick captioning or basic object identification it performs well above its weight.

Llama 3.2 Vision

Meta’s llama3.2-vision:11b is currently the strongest general-purpose vision model available in Ollama. It delivers noticeably better spatial reasoning, OCR accuracy, and chart analysis than the LLaVA family at a comparable size. The recommended starting point if you have 8 GB or more of RAM.

BakLLaVA

Available as bakllava, this model combines the LLaVA vision approach with a Mistral base model — an alternative flavour worth experimenting with if you are already using Mistral-based models in your workflow.

Pulling a Vision Model

ollama pull llava
ollama pull llava:13b
ollama pull llava-phi3
ollama pull moondream
ollama pull llama3.2-vision

Download sizes range from roughly 1.5 GB for Moondream to 8 GB or more for larger LLaVA variants. Once downloaded, the model is cached locally and available immediately.

Using Vision Models from the CLI

Start an interactive session:

ollama run llava

Once the model is loaded, attach an image by including its file path in your message:

What is in this image? /path/to/your/image.jpg

Note: images are not retained between turns. Each message is processed independently. If you want the model to reference the same image in a follow-up, attach it again.

Using Vision Models via the Ollama API

To send an image via the API, encode it as a Base64 string and include it in the images array of the request body.

Encoding an Image to Base64

base64 -i image.jpg

Full curl Example

curl http://localhost:11434/api/generate 
  -H "Content-Type: application/json" 
  -d '{
    "model": "llava",
    "prompt": "Describe what you see in this image.",
    "images": ["'$(base64 -i image.jpg)'"],
    "stream": false
  }'

Ollama handles image resizing internally. Supported formats: JPEG, PNG, and WebP.

Using Vision Models with Python

pip install ollama

Pass an Image by File Path

import ollama

response = ollama.chat(
    model='llama3.2-vision',
    messages=[
        {
            'role': 'user',
            'content': 'What is shown in this image? Describe it in detail.',
            'images': ['/path/to/your/image.jpg']
        }
    ]
)

print(response['message']['content'])

Pass a Base64-Encoded Image

import ollama
import base64

with open('screenshot.png', 'rb') as f:
    image_data = base64.b64encode(f.read()).decode('utf-8')

response = ollama.chat(
    model='llava',
    messages=[
        {
            'role': 'user',
            'content': 'Extract all text visible in this screenshot.',
            'images': [image_data]
        }
    ]
)

print(response['message']['content'])

Practical Use Cases

Screenshot OCR: Extract text from screenshots, scanned documents, or photos of printed text without sending data to an external service
Chart and diagram analysis: Summarise trends in a chart or explain the structure of a diagram
Product image descriptions: Generate alt text or product descriptions from photographs automatically
Object identification: Ask what objects, animals, or landmarks appear in a photo
Image-based Q&A: Feed an image of a whiteboard or handwritten notes and ask specific questions about the content

Using Vision Models in Open WebUI

If you are running Open WebUI alongside Ollama, vision model support is built in. Select a vision-capable model from the model dropdown, then drag and drop an image directly into the chat input area before sending your message. Open WebUI handles the encoding and API call automatically — a practical option for non-technical users who want local vision AI without writing any code.

Limitations and Honest Expectations

Speed: Vision models are generally slower than text-only models of comparable size, because the image encoding step adds processing overhead
Quality ceiling: Open-weight 7B–11B vision models do not match GPT-4V or Claude 3 on complex reasoning tasks — capable for many practical tasks but will occasionally miss fine-grained details
Stateless images: Images are not retained between messages — re-attach if you need continuity across turns
Text-dense images: OCR accuracy on low-resolution or stylised text can be inconsistent; results improve with higher-resolution input

Choosing the Right Model for Your Hardware

8 GB RAM or more: Use llama3.2-vision:11b for the best general-purpose results
8 GB RAM, want faster responses: llava:7b or llava-phi3 — a balance between quality and speed
4 GB RAM or constrained hardware: Use moondream — runs quickly even on machines that struggle with larger models

How to Use Multimodal Models with Ollama (Images, Vision AI)

Table of Contents