Home / AI / Ollama / How to Use the Ollama Python Library: A Complete Guide

Ollama

How to Use the Ollama Python Library: A Complete Guide

3. Generating Text with generate()

4. Multi-Turn Conversations with chat()

5. Building a Simple Chatbot Loop

13. Connecting to a Remote Ollama Instance

The Ollama Python library is the official client for interacting with Ollama from Python code. It gives you a clean, straightforward API to run large language models locally — no cloud API keys, no usage costs, and full control over your data. This guide covers everything from installation through to building practical applications with streaming, async support, and embeddings.

Prerequisites

Before you can use the Python library, Ollama itself must be installed and running on your machine. If you have not already installed Ollama, download it from ollama.com and follow the setup instructions for your platform.

Once installed, start the Ollama server:

ollama serve

On macOS, Ollama runs automatically in the background after installation. On Linux, you may need to start it manually or configure it as a system service. The server listens on http://localhost:11434 by default.

You will also need at least one model pulled:

ollama pull llama3.2

Installation

pip install ollama

This installs the official library published by the Ollama team. It requires Python 3.8 or later and has minimal dependencies. Import it with:

import ollama

Generating Text with generate()

The ollama.generate() function is the simplest entry point. It takes a model name and a prompt, sends a single-turn request to the local Ollama server, and returns a response dictionary.

import ollama

response = ollama.generate(model='llama3.2', prompt='Why is the sky blue?')
print(response['response'])

The generated text is under the 'response' key. You can pass additional parameters using the options argument:

response = ollama.generate(
    model='llama3.2',
    prompt='Explain quantum entanglement simply.',
    options={
        'temperature': 0.7,
        'num_ctx': 4096
    }
)
print(response['response'])

Common options:

temperature — Controls randomness. Lower values (e.g. 0.2) produce more deterministic output; higher values (e.g. 1.0) increase creativity.
num_ctx — Context window size in tokens. Increase this for longer inputs.
top_p — Nucleus sampling threshold.
num_predict — Maximum tokens to generate.

Multi-Turn Conversations with chat()

For conversational applications where the model needs to remember prior messages, use ollama.chat(). Pass a list of message dictionaries with role and content keys. Roles are 'user', 'assistant', and 'system'.

import ollama

messages = [
    {'role': 'system', 'content': 'You are a helpful assistant who speaks concisely.'},
    {'role': 'user', 'content': 'What is the capital of France?'},
]

response = ollama.chat(model='llama3.2', messages=messages)
print(response['message']['content'])

To continue the conversation, append the assistant’s reply and the next user message, then call chat() again:

messages.append(response['message'])
messages.append({'role': 'user', 'content': 'And what is its population?'})

response = ollama.chat(model='llama3.2', messages=messages)
print(response['message']['content'])

Building a Simple Chatbot Loop

import ollama

messages = []

print("Chat with llama3.2. Type 'quit' to exit.
")

while True:
    user_input = input("You: ").strip()
    if user_input.lower() == 'quit':
        break

    messages.append({'role': 'user', 'content': user_input})

    response = ollama.chat(model='llama3.2', messages=messages)
    reply = response['message']['content']

    messages.append({'role': 'assistant', 'content': reply})
    print(f"Assistant: {reply}
")

Streaming Responses

By default, the library waits for the model to finish generating before returning anything. Passing stream=True returns a generator, allowing you to print tokens as they arrive:

import ollama

stream = ollama.chat(
    model='llama3.2',
    messages=[{'role': 'user', 'content': 'Tell me a short story about a robot.'}],
    stream=True
)

for chunk in stream:
    print(chunk['message']['content'], end='', flush=True)

print()  # newline after stream ends

Streaming works with generate() too — each chunk has a 'response' key rather than 'message':

stream = ollama.generate(model='llama3.2', prompt='Count to ten.', stream=True)

for chunk in stream:
    print(chunk['response'], end='', flush=True)

Async Support

For async applications — FastAPI, asyncio — the library provides an AsyncClient that mirrors the synchronous API exactly:

import asyncio
from ollama import AsyncClient

async def main():
    client = AsyncClient()
    response = await client.chat(
        model='llama3.2',
        messages=[{'role': 'user', 'content': 'What is asyncio?'}]
    )
    print(response['message']['content'])

asyncio.run(main())

Async streaming uses async for:

async def stream_response():
    client = AsyncClient()
    stream = await client.chat(
        model='llama3.2',
        messages=[{'role': 'user', 'content': 'Explain async programming.'}],
        stream=True
    )
    async for chunk in stream:
        print(chunk['message']['content'], end='', flush=True)

Managing Models

Listing Available Models

result = ollama.list()
for model in result['models']:
    print(model['name'])

Pulling a Model

ollama.pull('llama3.2')

Stream pull progress:

for progress in ollama.pull('llama3.2', stream=True):
    status = progress.get('status', '')
    completed = progress.get('completed', 0)
    total = progress.get('total', 0)
    if total:
        print(f"{status}: {completed}/{total}", end='\r')
    else:
        print(status)

Showing Model Information

info = ollama.show('llama3.2')
print(info)

Generating Embeddings

Embeddings convert text into a numerical vector that captures its semantic meaning — used for similarity search, retrieval-augmented generation (RAG), and clustering. You need a model designed for embeddings such as nomic-embed-text:

ollama pull nomic-embed-text

result = ollama.embeddings(model='nomic-embed-text', prompt='The quick brown fox')
vector = result['embedding']
print(f"Embedding dimension: {len(vector)}")

A minimal similarity example using cosine distance:

def get_embedding(text):
    result = ollama.embeddings(model='nomic-embed-text', prompt=text)
    return result['embedding']

def cosine_similarity(a, b):
    dot = sum(x * y for x, y in zip(a, b))
    mag_a = sum(x ** 2 for x in a) ** 0.5
    mag_b = sum(x ** 2 for x in b) ** 0.5
    return dot / (mag_a * mag_b)

query = "What is machine learning?"
candidates = [
    "Machine learning is a branch of artificial intelligence.",
    "The weather in London is often cloudy.",
]

query_vec = get_embedding(query)
for candidate in candidates:
    score = cosine_similarity(query_vec, get_embedding(candidate))
    print(f"Score {score:.4f}: {candidate}")

Connecting to a Remote Ollama Instance

If Ollama runs on a different machine, point the client at a custom host using ollama.Client:

client = ollama.Client(host='http://192.168.1.50:11434')

response = client.chat(
    model='llama3.2',
    messages=[{'role': 'user', 'content': 'Hello from a remote client!'}]
)
print(response['message']['content'])

The Client class supports all the same methods as the module-level functions. The async equivalent is AsyncClient(host='...').

Error Handling

from ollama import ResponseError

try:
    response = ollama.chat(
        model='llama3.2',
        messages=[{'role': 'user', 'content': 'Hello'}]
    )
    print(response['message']['content'])
except ResponseError as e:
    print(f"Ollama error: {e.status_code} — {e.error}")
except Exception as e:
    print(f"Unexpected error: {e}")

The ResponseError object exposes a status_code (HTTP status) and an error string describing the problem. Common causes include referencing a model that has not been pulled, or attempting to connect when the Ollama server is not running.

Quick Reference

Install with pip install ollama and ensure ollama serve is running first
Use ollama.generate() for single-turn completions, ollama.chat() for multi-turn conversations
Pass stream=True to either function to receive tokens incrementally
Use AsyncClient in async codebases — the API is identical but uses await
Manage models with ollama.list(), ollama.pull(), and ollama.show()
Generate semantic vectors with ollama.embeddings() using a dedicated embedding model
Point to a remote instance using ollama.Client(host='http://...')
Catch ResponseError for structured error handling

How to Use the Ollama Python Library: A Complete Guide

Table of Contents

1. Prerequisites

2. Installation

3. Generating Text with generate()

4. Multi-Turn Conversations with chat()

5. Building a Simple Chatbot Loop

6. Streaming Responses

7. Async Support

8. Managing Models

9. Listing Available Models

10. Pulling a Model

11. Showing Model Information

12. Generating Embeddings

13. Connecting to a Remote Ollama Instance

14. Error Handling

15. Quick Reference

Prerequisites

Installation

Generating Text with generate()

Multi-Turn Conversations with chat()

Building a Simple Chatbot Loop

Streaming Responses

Async Support

Managing Models

Listing Available Models

Pulling a Model

Showing Model Information

Generating Embeddings

Connecting to a Remote Ollama Instance

Error Handling

Quick Reference

Best GPUs for Ollama in 2026: Which Graphics Card Should You Buy?

How to Use Ollama for Embeddings and RAG

Leave a Reply Cancel reply

How to Use the Ollama Python Library: A Complete Guide

Table of Contents

Prerequisites

Installation

Generating Text with generate()

Multi-Turn Conversations with chat()

Building a Simple Chatbot Loop

Streaming Responses

Async Support

Managing Models

Listing Available Models

Pulling a Model

Showing Model Information

Generating Embeddings

Connecting to a Remote Ollama Instance

Error Handling

Quick Reference

Best GPUs for Ollama in 2026: Which Graphics Card Should You Buy?

How to Use Ollama for Embeddings and RAG

Sign Up For Daily Newsletter

Related Posts

Leave a Reply Cancel reply