Ollama’s local REST API makes it straightforward to call local LLMs from Python — either directly with the requests library, via the official Ollama Python package, or through the OpenAI SDK. This guide covers all three approaches so you can pick the one that fits your project.
Prerequisites
Make sure Ollama is installed and running, and that you’ve pulled at least one model:
ollama pull llama3.1
ollama serve # starts the API on http://localhost:11434
Method 1: The Official Ollama Python Library
The simplest option. Install the package and start calling models immediately:
pip install ollama
Basic chat completion
import ollama
response = ollama.chat(
model='llama3.1',
messages=[
{'role': 'user', 'content': 'Explain what a REST API is in simple terms.'}
]
)
print(response['message']['content'])
Streaming responses
import ollama
stream = ollama.chat(
model='llama3.1',
messages=[{'role': 'user', 'content': 'Write a short poem about Python.'}],
stream=True
)
for chunk in stream:
print(chunk['message']['content'], end='', flush=True)
Generate (single prompt, no chat history)
import ollama
response = ollama.generate(model='llama3.1', prompt='What is the capital of France?')
print(response['response'])
Method 2: Using the OpenAI Python SDK
If your project already uses the OpenAI SDK, you can point it at your local Ollama instance with one change — the base_url:
pip install openai
from openai import OpenAI
client = OpenAI(
base_url='http://localhost:11434/v1',
api_key='ollama' # required but ignored by Ollama
)
response = client.chat.completions.create(
model='llama3.1',
messages=[
{'role': 'system', 'content': 'You are a helpful assistant.'},
{'role': 'user', 'content': 'Summarise the key features of Python 3.12.'}
]
)
print(response.choices[0].message.content)
This approach is ideal for projects where you want to swap between local Ollama models and OpenAI’s API without changing your application code.
Method 3: Direct HTTP Requests
No dependencies required — just requests:
import requests
import json
response = requests.post(
'http://localhost:11434/api/chat',
json={
'model': 'llama3.1',
'messages': [{'role': 'user', 'content': 'What is machine learning?'}],
'stream': False
}
)
data = response.json()
print(data['message']['content'])
Working with Embeddings
Ollama also supports embedding generation, which is useful for RAG pipelines and semantic search:
import ollama
# Pull an embedding model first
# ollama pull nomic-embed-text
response = ollama.embeddings(
model='nomic-embed-text',
prompt='The quick brown fox jumps over the lazy dog'
)
embedding = response['embedding']
print(f"Embedding dimensions: {len(embedding)}")
Multi-turn Conversation
Build a simple chat loop that maintains conversation history:
import ollama
messages = []
print("Chat with Llama 3.1 (type 'quit' to exit)")
while True:
user_input = input("You: ")
if user_input.lower() == 'quit':
break
messages.append({'role': 'user', 'content': user_input})
response = ollama.chat(model='llama3.1', messages=messages)
assistant_message = response['message']['content']
messages.append({'role': 'assistant', 'content': assistant_message})
print(f"Assistant: {assistant_message}\n")
Listing and Managing Models
import ollama
# List available models
models = ollama.list()
for model in models['models']:
print(model['name'])
# Pull a new model
ollama.pull('mistral')
# Check model details
info = ollama.show('llama3.1')
print(info['modelfile'])
Choosing the Right Model for Your Python Project
The best model depends on your use case:
- Coding assistance: See best Ollama models for coding
- Summarisation: See best Ollama models for summarisation
- RAG and embeddings: See best Ollama models for RAG
Next Steps
Once you’re comfortable calling Ollama from Python, the natural next step is building a RAG pipeline with LangChain or integrating Ollama into a web app using FastAPI. You can also explore the Docker deployment guide for running Ollama in a containerised environment.


