The Ollama Python library is the official client for interacting with Ollama from Python code. It gives you a clean, straightforward API to run large language models locally — no cloud API keys, no usage costs, and full control over your data. This guide covers everything from installation through to building practical applications with streaming, async support, and embeddings.
Prerequisites
Before you can use the Python library, Ollama itself must be installed and running on your machine. If you have not already installed Ollama, download it from ollama.com and follow the setup instructions for your platform.
Once installed, start the Ollama server:
ollama serve
On macOS, Ollama runs automatically in the background after installation. On Linux, you may need to start it manually or configure it as a system service. The server listens on http://localhost:11434 by default.
You will also need at least one model pulled:
ollama pull llama3.2
Installation
pip install ollama
This installs the official library published by the Ollama team. It requires Python 3.8 or later and has minimal dependencies. Import it with:
import ollama
Generating Text with generate()
The ollama.generate() function is the simplest entry point. It takes a model name and a prompt, sends a single-turn request to the local Ollama server, and returns a response dictionary.
import ollama
response = ollama.generate(model='llama3.2', prompt='Why is the sky blue?')
print(response['response'])
The generated text is under the 'response' key. You can pass additional parameters using the options argument:
response = ollama.generate(
model='llama3.2',
prompt='Explain quantum entanglement simply.',
options={
'temperature': 0.7,
'num_ctx': 4096
}
)
print(response['response'])
Common options:
- temperature — Controls randomness. Lower values (e.g. 0.2) produce more deterministic output; higher values (e.g. 1.0) increase creativity.
- num_ctx — Context window size in tokens. Increase this for longer inputs.
- top_p — Nucleus sampling threshold.
- num_predict — Maximum tokens to generate.
Multi-Turn Conversations with chat()
For conversational applications where the model needs to remember prior messages, use ollama.chat(). Pass a list of message dictionaries with role and content keys. Roles are 'user', 'assistant', and 'system'.
import ollama
messages = [
{'role': 'system', 'content': 'You are a helpful assistant who speaks concisely.'},
{'role': 'user', 'content': 'What is the capital of France?'},
]
response = ollama.chat(model='llama3.2', messages=messages)
print(response['message']['content'])
To continue the conversation, append the assistant’s reply and the next user message, then call chat() again:
messages.append(response['message'])
messages.append({'role': 'user', 'content': 'And what is its population?'})
response = ollama.chat(model='llama3.2', messages=messages)
print(response['message']['content'])
Building a Simple Chatbot Loop
import ollama
messages = []
print("Chat with llama3.2. Type 'quit' to exit.
")
while True:
user_input = input("You: ").strip()
if user_input.lower() == 'quit':
break
messages.append({'role': 'user', 'content': user_input})
response = ollama.chat(model='llama3.2', messages=messages)
reply = response['message']['content']
messages.append({'role': 'assistant', 'content': reply})
print(f"Assistant: {reply}
")
Streaming Responses
By default, the library waits for the model to finish generating before returning anything. Passing stream=True returns a generator, allowing you to print tokens as they arrive:
import ollama
stream = ollama.chat(
model='llama3.2',
messages=[{'role': 'user', 'content': 'Tell me a short story about a robot.'}],
stream=True
)
for chunk in stream:
print(chunk['message']['content'], end='', flush=True)
print() # newline after stream ends
Streaming works with generate() too — each chunk has a 'response' key rather than 'message':
stream = ollama.generate(model='llama3.2', prompt='Count to ten.', stream=True)
for chunk in stream:
print(chunk['response'], end='', flush=True)
Async Support
For async applications — FastAPI, asyncio — the library provides an AsyncClient that mirrors the synchronous API exactly:
import asyncio
from ollama import AsyncClient
async def main():
client = AsyncClient()
response = await client.chat(
model='llama3.2',
messages=[{'role': 'user', 'content': 'What is asyncio?'}]
)
print(response['message']['content'])
asyncio.run(main())
Async streaming uses async for:
async def stream_response():
client = AsyncClient()
stream = await client.chat(
model='llama3.2',
messages=[{'role': 'user', 'content': 'Explain async programming.'}],
stream=True
)
async for chunk in stream:
print(chunk['message']['content'], end='', flush=True)
Managing Models
Listing Available Models
result = ollama.list()
for model in result['models']:
print(model['name'])
Pulling a Model
ollama.pull('llama3.2')
Stream pull progress:
for progress in ollama.pull('llama3.2', stream=True):
status = progress.get('status', '')
completed = progress.get('completed', 0)
total = progress.get('total', 0)
if total:
print(f"{status}: {completed}/{total}", end='\r')
else:
print(status)
Showing Model Information
info = ollama.show('llama3.2')
print(info)
Generating Embeddings
Embeddings convert text into a numerical vector that captures its semantic meaning — used for similarity search, retrieval-augmented generation (RAG), and clustering. You need a model designed for embeddings such as nomic-embed-text:
ollama pull nomic-embed-text
result = ollama.embeddings(model='nomic-embed-text', prompt='The quick brown fox')
vector = result['embedding']
print(f"Embedding dimension: {len(vector)}")
A minimal similarity example using cosine distance:
def get_embedding(text):
result = ollama.embeddings(model='nomic-embed-text', prompt=text)
return result['embedding']
def cosine_similarity(a, b):
dot = sum(x * y for x, y in zip(a, b))
mag_a = sum(x ** 2 for x in a) ** 0.5
mag_b = sum(x ** 2 for x in b) ** 0.5
return dot / (mag_a * mag_b)
query = "What is machine learning?"
candidates = [
"Machine learning is a branch of artificial intelligence.",
"The weather in London is often cloudy.",
]
query_vec = get_embedding(query)
for candidate in candidates:
score = cosine_similarity(query_vec, get_embedding(candidate))
print(f"Score {score:.4f}: {candidate}")
Connecting to a Remote Ollama Instance
If Ollama runs on a different machine, point the client at a custom host using ollama.Client:
client = ollama.Client(host='http://192.168.1.50:11434')
response = client.chat(
model='llama3.2',
messages=[{'role': 'user', 'content': 'Hello from a remote client!'}]
)
print(response['message']['content'])
The Client class supports all the same methods as the module-level functions. The async equivalent is AsyncClient(host='...').
Error Handling
from ollama import ResponseError
try:
response = ollama.chat(
model='llama3.2',
messages=[{'role': 'user', 'content': 'Hello'}]
)
print(response['message']['content'])
except ResponseError as e:
print(f"Ollama error: {e.status_code} — {e.error}")
except Exception as e:
print(f"Unexpected error: {e}")
The ResponseError object exposes a status_code (HTTP status) and an error string describing the problem. Common causes include referencing a model that has not been pulled, or attempting to connect when the Ollama server is not running.
Quick Reference
- Install with
pip install ollamaand ensureollama serveis running first - Use
ollama.generate()for single-turn completions,ollama.chat()for multi-turn conversations - Pass
stream=Trueto either function to receive tokens incrementally - Use
AsyncClientin async codebases — the API is identical but usesawait - Manage models with
ollama.list(),ollama.pull(), andollama.show() - Generate semantic vectors with
ollama.embeddings()using a dedicated embedding model - Point to a remote instance using
ollama.Client(host='http://...') - Catch
ResponseErrorfor structured error handling


