Home / AI / Ollama / How to Use the Ollama Python Library: A Complete Guide

How to Use the Ollama Python Library: A Complete Guide

Ollama

The Ollama Python library is the official client for interacting with Ollama from Python code. It gives you a clean, straightforward API to run large language models locally — no cloud API keys, no usage costs, and full control over your data. This guide covers everything from installation through to building practical applications with streaming, async support, and embeddings.

Prerequisites

Before you can use the Python library, Ollama itself must be installed and running on your machine. If you have not already installed Ollama, download it from ollama.com and follow the setup instructions for your platform.

Once installed, start the Ollama server:

ollama serve

On macOS, Ollama runs automatically in the background after installation. On Linux, you may need to start it manually or configure it as a system service. The server listens on http://localhost:11434 by default.

You will also need at least one model pulled:

ollama pull llama3.2

Installation

pip install ollama

This installs the official library published by the Ollama team. It requires Python 3.8 or later and has minimal dependencies. Import it with:

import ollama

Generating Text with generate()

The ollama.generate() function is the simplest entry point. It takes a model name and a prompt, sends a single-turn request to the local Ollama server, and returns a response dictionary.

import ollama

response = ollama.generate(model='llama3.2', prompt='Why is the sky blue?')
print(response['response'])

The generated text is under the 'response' key. You can pass additional parameters using the options argument:

response = ollama.generate(
    model='llama3.2',
    prompt='Explain quantum entanglement simply.',
    options={
        'temperature': 0.7,
        'num_ctx': 4096
    }
)
print(response['response'])

Common options:

  • temperature — Controls randomness. Lower values (e.g. 0.2) produce more deterministic output; higher values (e.g. 1.0) increase creativity.
  • num_ctx — Context window size in tokens. Increase this for longer inputs.
  • top_p — Nucleus sampling threshold.
  • num_predict — Maximum tokens to generate.

Multi-Turn Conversations with chat()

For conversational applications where the model needs to remember prior messages, use ollama.chat(). Pass a list of message dictionaries with role and content keys. Roles are 'user', 'assistant', and 'system'.

import ollama

messages = [
    {'role': 'system', 'content': 'You are a helpful assistant who speaks concisely.'},
    {'role': 'user', 'content': 'What is the capital of France?'},
]

response = ollama.chat(model='llama3.2', messages=messages)
print(response['message']['content'])

To continue the conversation, append the assistant’s reply and the next user message, then call chat() again:

messages.append(response['message'])
messages.append({'role': 'user', 'content': 'And what is its population?'})

response = ollama.chat(model='llama3.2', messages=messages)
print(response['message']['content'])

Building a Simple Chatbot Loop

import ollama

messages = []

print("Chat with llama3.2. Type 'quit' to exit.
")

while True:
    user_input = input("You: ").strip()
    if user_input.lower() == 'quit':
        break

    messages.append({'role': 'user', 'content': user_input})

    response = ollama.chat(model='llama3.2', messages=messages)
    reply = response['message']['content']

    messages.append({'role': 'assistant', 'content': reply})
    print(f"Assistant: {reply}
")

Streaming Responses

By default, the library waits for the model to finish generating before returning anything. Passing stream=True returns a generator, allowing you to print tokens as they arrive:

import ollama

stream = ollama.chat(
    model='llama3.2',
    messages=[{'role': 'user', 'content': 'Tell me a short story about a robot.'}],
    stream=True
)

for chunk in stream:
    print(chunk['message']['content'], end='', flush=True)

print()  # newline after stream ends

Streaming works with generate() too — each chunk has a 'response' key rather than 'message':

stream = ollama.generate(model='llama3.2', prompt='Count to ten.', stream=True)

for chunk in stream:
    print(chunk['response'], end='', flush=True)

Async Support

For async applications — FastAPI, asyncio — the library provides an AsyncClient that mirrors the synchronous API exactly:

import asyncio
from ollama import AsyncClient

async def main():
    client = AsyncClient()
    response = await client.chat(
        model='llama3.2',
        messages=[{'role': 'user', 'content': 'What is asyncio?'}]
    )
    print(response['message']['content'])

asyncio.run(main())

Async streaming uses async for:

async def stream_response():
    client = AsyncClient()
    stream = await client.chat(
        model='llama3.2',
        messages=[{'role': 'user', 'content': 'Explain async programming.'}],
        stream=True
    )
    async for chunk in stream:
        print(chunk['message']['content'], end='', flush=True)

Managing Models

Listing Available Models

result = ollama.list()
for model in result['models']:
    print(model['name'])

Pulling a Model

ollama.pull('llama3.2')

Stream pull progress:

for progress in ollama.pull('llama3.2', stream=True):
    status = progress.get('status', '')
    completed = progress.get('completed', 0)
    total = progress.get('total', 0)
    if total:
        print(f"{status}: {completed}/{total}", end='\r')
    else:
        print(status)

Showing Model Information

info = ollama.show('llama3.2')
print(info)

Generating Embeddings

Embeddings convert text into a numerical vector that captures its semantic meaning — used for similarity search, retrieval-augmented generation (RAG), and clustering. You need a model designed for embeddings such as nomic-embed-text:

ollama pull nomic-embed-text
result = ollama.embeddings(model='nomic-embed-text', prompt='The quick brown fox')
vector = result['embedding']
print(f"Embedding dimension: {len(vector)}")

A minimal similarity example using cosine distance:

def get_embedding(text):
    result = ollama.embeddings(model='nomic-embed-text', prompt=text)
    return result['embedding']

def cosine_similarity(a, b):
    dot = sum(x * y for x, y in zip(a, b))
    mag_a = sum(x ** 2 for x in a) ** 0.5
    mag_b = sum(x ** 2 for x in b) ** 0.5
    return dot / (mag_a * mag_b)

query = "What is machine learning?"
candidates = [
    "Machine learning is a branch of artificial intelligence.",
    "The weather in London is often cloudy.",
]

query_vec = get_embedding(query)
for candidate in candidates:
    score = cosine_similarity(query_vec, get_embedding(candidate))
    print(f"Score {score:.4f}: {candidate}")

Connecting to a Remote Ollama Instance

If Ollama runs on a different machine, point the client at a custom host using ollama.Client:

client = ollama.Client(host='http://192.168.1.50:11434')

response = client.chat(
    model='llama3.2',
    messages=[{'role': 'user', 'content': 'Hello from a remote client!'}]
)
print(response['message']['content'])

The Client class supports all the same methods as the module-level functions. The async equivalent is AsyncClient(host='...').

Error Handling

from ollama import ResponseError

try:
    response = ollama.chat(
        model='llama3.2',
        messages=[{'role': 'user', 'content': 'Hello'}]
    )
    print(response['message']['content'])
except ResponseError as e:
    print(f"Ollama error: {e.status_code} — {e.error}")
except Exception as e:
    print(f"Unexpected error: {e}")

The ResponseError object exposes a status_code (HTTP status) and an error string describing the problem. Common causes include referencing a model that has not been pulled, or attempting to connect when the Ollama server is not running.

Quick Reference

  • Install with pip install ollama and ensure ollama serve is running first
  • Use ollama.generate() for single-turn completions, ollama.chat() for multi-turn conversations
  • Pass stream=True to either function to receive tokens incrementally
  • Use AsyncClient in async codebases — the API is identical but uses await
  • Manage models with ollama.list(), ollama.pull(), and ollama.show()
  • Generate semantic vectors with ollama.embeddings() using a dedicated embedding model
  • Point to a remote instance using ollama.Client(host='http://...')
  • Catch ResponseError for structured error handling

Sign Up For Daily Newsletter

Stay updated with our weekly newsletter. Subscribe now to never miss an update!

[mc4wp_form]

Leave a Reply

Your email address will not be published. Required fields are marked *