Home / AI / Ollama / How to Use Multimodal Models with Ollama (Vision and Images)

How to Use Multimodal Models with Ollama (Vision and Images)

Multimodal AI models can process both text and images, opening up a wide range of practical applications — from describing screenshots to extracting text from documents and analysing diagrams. Ollama supports several vision models that you can run entirely on your own hardware, with no cloud services required.

This guide covers the available vision models, how to use them from the command line and via the API, and practical use cases to get you started.

Supported Vision Models

Ollama supports several capable multimodal models. Here are the main options as of 2026:

  • llava:7b — the most widely used vision model on Ollama. Good general-purpose image understanding on consumer hardware. Requires around 8 GB RAM.
  • llava:13b — more capable than the 7B variant, with better detail recognition and more accurate descriptions. Requires 16 GB RAM or a capable GPU.
  • moondream — a compact 1.8B parameter model. Very fast, low memory footprint (~2 GB), good for basic image descriptions and quick tasks.
  • llama3.2-vision — Meta’s vision-capable Llama 3.2 model. Strong at following instructions about images and combining image and text reasoning.
  • gemma3 — Google’s Gemma 3 model includes vision capabilities in its larger variants. Well suited to document understanding and technical diagrams.

Pull a model before use. For example: ollama pull llava:7b

If you want guidance on choosing between models for specific tasks, see our guide to the best Ollama vision models.

Using Vision Models from the Command Line

You can pass images to a vision model directly in the CLI using the image path as part of your prompt. Start an interactive session and include an image file path:

ollama run llava:7b

Once in the interactive session, type your prompt followed by the image path:

Describe this image: /path/to/your/image.jpg

Alternatively, you can pass the image on startup using the --image flag variant in single-shot mode — though the exact syntax depends on your Ollama version. The interactive session approach works reliably across all versions.

Using Vision Models via the API

The Ollama REST API accepts images as base64-encoded strings in the request body. Here is how to do that in Python:

import requests
import base64

def encode_image(image_path):
    with open(image_path, "rb") as f:
        return base64.b64encode(f.read()).decode("utf-8")

image_b64 = encode_image("/path/to/screenshot.png")

response = requests.post(
    "http://localhost:11434/api/generate",
    json={
        "model": "llava:7b",
        "prompt": "Describe what you see in this image.",
        "images": [image_b64],
        "stream": False
    }
)

print(response.json()["response"])

The images field takes a list of base64-encoded strings, so you can pass multiple images in a single request if the model supports it.

Multi-Turn Chat with Images

The /api/chat endpoint also supports images, which is useful when you want to ask follow-up questions about an image:

import requests
import base64

def encode_image(path):
    with open(path, "rb") as f:
        return base64.b64encode(f.read()).decode("utf-8")

image_b64 = encode_image("/path/to/diagram.png")

messages = [
    {
        "role": "user",
        "content": "What does this diagram show?",
        "images": [image_b64]
    }
]

response = requests.post(
    "http://localhost:11434/api/chat",
    json={
        "model": "llava:7b",
        "messages": messages,
        "stream": False
    }
)

reply = response.json()["message"]["content"]
print(reply)

# Follow-up question (no image needed for subsequent turns)
messages.append({"role": "assistant", "content": reply})
messages.append({"role": "user", "content": "What would you improve about this design?"})

response2 = requests.post(
    "http://localhost:11434/api/chat",
    json={
        "model": "llava:7b",
        "messages": messages,
        "stream": False
    }
)

print(response2.json()["message"]["content"])

Note that in follow-up messages, you generally don’t need to re-send the image — the model retains context from the conversation history.

Practical Use Cases

Analysing Error Screenshots

Paste a screenshot of an error message or stack trace into a vision model and ask it to explain the error and suggest a fix. This is faster than typing out the error manually, especially for long exception traces.

image_b64 = encode_image("/tmp/error_screenshot.png")

response = requests.post(
    "http://localhost:11434/api/generate",
    json={
        "model": "llava:7b",
        "prompt": "This is a screenshot of an error. Explain what is wrong and suggest how to fix it.",
        "images": [image_b64],
        "stream": False
    }
)
print(response.json()["response"])

Understanding Technical Diagrams

Architecture diagrams, network maps, and flowcharts can be uploaded to a vision model for explanation. This is useful when reviewing documentation or trying to understand a system you haven’t worked with before.

Document Scanning and OCR

Vision models can read text from images, making them a basic OCR tool. For scanned documents, invoices, or forms, you can ask the model to extract the text or specific fields. Note that dedicated OCR tools will generally be more accurate for high-volume structured extraction, but for occasional use the vision models perform well.

Automated Image Tagging

If you have a folder of images you want to organise or tag, a Python script using the vision API can automatically generate descriptions or tags:

import os
import requests
import base64

def tag_image(image_path):
    with open(image_path, "rb") as f:
        b64 = base64.b64encode(f.read()).decode("utf-8")

    response = requests.post(
        "http://localhost:11434/api/generate",
        json={
            "model": "moondream",
            "prompt": "List 5 descriptive tags for this image, separated by commas.",
            "images": [b64],
            "stream": False
        }
    )
    return response.json()["response"]

image_dir = "/path/to/images"
for filename in os.listdir(image_dir):
    if filename.lower().endswith((".jpg", ".jpeg", ".png")):
        path = os.path.join(image_dir, filename)
        tags = tag_image(path)
        print(f"{filename}: {tags}")

Using moondream for this task is a good choice because it’s much faster than llava, and bulk tagging doesn’t require deep image understanding — quick, accurate labels are sufficient.

Performance Considerations

Model Size vs Speed Trade-off

Vision models are larger than their text-only equivalents because they include a vision encoder component. llava:7b takes up around 4.5 GB on disk. On machines without a GPU, inference can be slow — 30–60 seconds per image on a typical laptop CPU for the 7B model.

For faster results on CPU-only machines, start with moondream — it’s 1.8B parameters and returns descriptions in a few seconds.

Image Resolution

Most vision models resize images internally to a fixed resolution (e.g. 336×336 or 448×448 pixels). Very high-resolution images won’t give proportionally better results and will be processed slower due to loading time. For most use cases, images at around 1024×768 or lower are fine.

GPU Acceleration

If you have a compatible GPU, Ollama will use it automatically. Vision model inference is significantly faster with a GPU — typically 5–10x faster than CPU-only. You can confirm GPU usage by running ollama ps while a generation is in progress.

Using Vision Models in Open WebUI

If you use Open WebUI as your interface to Ollama, vision models are supported directly in the chat interface. Simply switch to a vision-capable model (e.g. llava:7b) in the model selector, then click the image upload button (paperclip icon) in the chat input to attach an image to your message.

This gives you a fully visual workflow without writing any code — ideal for occasional use cases like document review or screenshot analysis.

For more details on building applications that use images and text together with retrieval, see our guide on Ollama embeddings and RAG. To explore the full API that powers these requests, see the Ollama REST API developer guide.

Summary

Ollama’s vision model support makes it straightforward to add image understanding to your local AI setup. Whether you need to analyse screenshots, read documents, understand diagrams, or automate image tagging, the available models cover a wide range of capabilities. Start with moondream for speed or llava:7b for general accuracy, and use the API examples in this guide to integrate vision into your own scripts and applications.

Sign Up For Daily Newsletter

Stay updated with our weekly newsletter. Subscribe now to never miss an update!

[mc4wp_form]

Leave a Reply

Your email address will not be published. Required fields are marked *