Ollama exposes a clean REST API on localhost:11434 that lets you integrate locally-running large language models into your applications with minimal setup. Whether you want to hit raw endpoints with curl, build Python scripts using the requests library, or simply swap out your OpenAI base URL and keep everything else unchanged, Ollama’s API is designed to get out of your way. This guide walks through every major endpoint, practical code examples, streaming, embeddings, LangChain integration, and environment variables for remote access.
API Overview
When Ollama is running, it starts an HTTP server on http://localhost:11434. No SDK required — all interactions happen over standard HTTP. Ollama also ships an OpenAI-compatible endpoint layer at http://localhost:11434/v1, meaning any code that already talks to the OpenAI API can be redirected to your local Ollama instance with a one-line change. There is no authentication by default.
Key Endpoints
POST /api/generate— Single-turn text generationPOST /api/chat— Multi-turn chat with message historyPOST /api/embeddings— Generate vector embeddings from textGET /api/tags— List all locally available modelsPOST /api/pull— Download a model from the Ollama libraryDELETE /api/delete— Remove a local model
The /api/generate Endpoint
curl http://localhost:11434/api/generate \
-d '{
"model": "llama3",
"prompt": "Explain what a REST API is in two sentences.",
"stream": false
}'
The same call in Python:
import requests
response = requests.post(
"http://localhost:11434/api/generate",
json={
"model": "llama3",
"prompt": "Explain what a REST API is in two sentences.",
"stream": False,
}
)
print(response.json()["response"])
The /api/chat Endpoint
For multi-turn conversations, use /api/chat with a messages array:
curl http://localhost:11434/api/chat \
-d '{
"model": "llama3",
"stream": false,
"messages": [
{ "role": "system", "content": "You are a concise technical assistant." },
{ "role": "user", "content": "What is the difference between TCP and UDP?" }
]
}'
Managing a conversation history in Python:
import requests
messages = [{"role": "system", "content": "You are a helpful coding assistant."}]
def chat(user_input):
messages.append({"role": "user", "content": user_input})
response = requests.post(
"http://localhost:11434/api/chat",
json={"model": "llama3", "messages": messages, "stream": False}
)
reply = response.json()["message"]["content"]
messages.append({"role": "assistant", "content": reply})
return reply
print(chat("Write a Python function to reverse a string."))
print(chat("Now add a docstring to that function."))
Setting System Prompts via API
curl http://localhost:11434/api/generate \
-d '{
"model": "llama3",
"system": "You are a senior Python developer. Always include type hints.",
"prompt": "Write a function that calculates the factorial of a number.",
"stream": false
}'
Streaming Responses
Ollama streams by default — each chunk is a newline-delimited JSON object. In Python:
import requests, json
response = requests.post(
"http://localhost:11434/api/chat",
json={
"model": "llama3",
"messages": [{"role": "user", "content": "Tell me three facts about black holes."}],
},
stream=True
)
for line in response.iter_lines():
if line:
chunk = json.loads(line)
if not chunk.get("done"):
print(chunk["message"]["content"], end="", flush=True)
print()
Context Window and Model Options
Pass model parameters via the options field:
num_ctx— Context window size in tokens (increase for longer documents)temperature— 0 = deterministic, 1 = more creative (default: 0.8)top_p— Nucleus sampling threshold (default: 0.9)seed— Fixed seed for reproducible outputsnum_predict— Maximum tokens to generate (-1 for unlimited)
response = requests.post(
"http://localhost:11434/api/generate",
json={
"model": "llama3",
"prompt": "Summarise REST API design principles.",
"stream": False,
"options": {"temperature": 0.2, "num_ctx": 8192, "seed": 42}
}
)
The /api/embeddings Endpoint
Embeddings convert text into vectors for RAG and semantic search:
curl http://localhost:11434/api/embeddings \
-d '{"model": "nomic-embed-text", "prompt": "How does vector search work?"}'
Good embedding models: nomic-embed-text, mxbai-embed-large. Pull with ollama pull nomic-embed-text.
Using the OpenAI-Compatible API
Any code using the OpenAI SDK can be redirected to Ollama by changing just the base_url:
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:11434/v1",
api_key="ollama", # required by SDK, ignored by Ollama
)
response = client.chat.completions.create(
model="llama3",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What are the SOLID principles?"},
]
)
print(response.choices[0].message.content)
Streaming works identically to standard OpenAI SDK usage. Embeddings also work through the /v1 layer.
Using Ollama with LangChain
pip install langchain-ollama langchain-core
from langchain_ollama import ChatOllama
from langchain_core.messages import HumanMessage, SystemMessage
llm = ChatOllama(model="llama3", temperature=0.3)
messages = [
SystemMessage(content="You are an expert at explaining technical concepts simply."),
HumanMessage(content="What is a transformer model?"),
]
response = llm.invoke(messages)
print(response.content)
Environment Variables for Remote Access
By default, Ollama only listens on 127.0.0.1. To expose the API on your network:
OLLAMA_HOST=0.0.0.0 ollama serve
On Linux with systemd, make it permanent:
sudo systemctl edit ollama
Add:
[Service]
Environment="OLLAMA_HOST=0.0.0.0"
Other useful variables:
OLLAMA_ORIGINS— Allowed CORS origins (use*to allow all)OLLAMA_MAX_LOADED_MODELS— Models to keep loaded simultaneouslyOLLAMA_NUM_PARALLEL— Parallel inference requests per modelOLLAMA_KEEP_ALIVE— How long a model stays loaded (e.g.5m,-1for forever)
For internet-facing deployments, place Ollama behind a reverse proxy with authentication — the API has no built-in auth.


