Home / AI / Ollama / Ollama REST API: Complete Developer Guide (2026)

Ollama REST API: Complete Developer Guide (2026)

Ollama exposes a straightforward HTTP REST API that lets you generate text, hold multi-turn conversations, create embeddings, and manage models programmatically. Whether you’re building an application that uses local AI or just want to query a model from a script, the Ollama API is well-designed and easy to work with.

This guide covers every major endpoint with practical curl and Python examples. If you need information on OpenAI compatibility, see our guide to Ollama’s OpenAI API compatibility layer.

Getting Started: Testing the API

When Ollama is running, it listens on http://localhost:11434 by default. You can confirm it’s working with a simple request:

curl http://localhost:11434/api/version

You should get a JSON response like {"version":"0.6.0"}. If you get a connection error, make sure Ollama is running — start it with ollama serve in a terminal if needed.

The /api/generate Endpoint

The generate endpoint is the simplest way to get a response from a model. It takes a model name and a prompt, and returns generated text.

Basic curl Example

curl http://localhost:11434/api/generate -d '{"model":"llama3.2","prompt":"What is the capital of France?","stream":false}'

Setting "stream":false waits for the entire response before returning. The response JSON includes a response field containing the generated text, plus metadata like token counts and generation time.

Python Example

import requests

response = requests.post(
    "http://localhost:11434/api/generate",
    json={
        "model": "llama3.2",
        "prompt": "Explain what a REST API is in simple terms.",
        "stream": False
    }
)

data = response.json()
print(data["response"])

System Prompts

You can include a system prompt to set the context or persona for the model:

import requests

response = requests.post(
    "http://localhost:11434/api/generate",
    json={
        "model": "llama3.2",
        "system": "You are a concise technical writer. Keep answers under 100 words.",
        "prompt": "What is Docker?",
        "stream": False
    }
)

print(response.json()["response"])

The /api/chat Endpoint (Multi-Turn Conversations)

The chat endpoint uses a messages array in the same format as OpenAI’s chat completions API. Each message has a role (system, user, or assistant) and content.

import requests

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is Python?"},
]

response = requests.post(
    "http://localhost:11434/api/chat",
    json={
        "model": "llama3.2",
        "messages": messages,
        "stream": False
    }
)

reply = response.json()["message"]["content"]
print(reply)

# Add the reply to messages to continue the conversation
messages.append({"role": "assistant", "content": reply})
messages.append({"role": "user", "content": "What is it used for?"})

response2 = requests.post(
    "http://localhost:11434/api/chat",
    json={
        "model": "llama3.2",
        "messages": messages,
        "stream": False
    }
)
print(response2.json()["message"]["content"])

This pattern — appending the assistant reply and the next user message to the messages list — is how you maintain conversation context across multiple turns.

Streaming Responses

For real-time output that prints tokens as they are generated (useful in applications with a live typing effect), set "stream":true and handle the response line by line:

import requests
import json

response = requests.post(
    "http://localhost:11434/api/generate",
    json={
        "model": "llama3.2",
        "prompt": "Write a short poem about programming.",
        "stream": True
    },
    stream=True
)

for line in response.iter_lines():
    if line:
        chunk = json.loads(line)
        print(chunk.get("response", ""), end="", flush=True)
        if chunk.get("done"):
            break

print()  # newline at end

Each streamed line is a JSON object with a response field containing the next token or tokens, and a done field that is true on the final chunk.

The /api/embeddings Endpoint

Embeddings convert text into numerical vectors, which are useful for semantic search, clustering, and retrieval-augmented generation (RAG). Ollama supports embedding models such as nomic-embed-text.

First, pull the model: ollama pull nomic-embed-text

import requests

response = requests.post(
    "http://localhost:11434/api/embeddings",
    json={
        "model": "nomic-embed-text",
        "prompt": "The quick brown fox jumps over the lazy dog"
    }
)

embedding = response.json()["embedding"]
print(f"Embedding dimensions: {len(embedding)}")
print(f"First 5 values: {embedding[:5]}")

For a full guide on using embeddings for RAG, see our article on Ollama embeddings and RAG.

Model Management Endpoints

List Available Models

curl http://localhost:11434/api/tags

Returns a JSON object with a models array listing every model you have pulled, including their names, sizes, and modification dates.

Pull a Model

curl http://localhost:11434/api/pull -d '{"name":"phi3","stream":false}'

This downloads a model in the same way as ollama pull but via the API. With "stream":true you get progress updates as each layer is downloaded.

Delete a Model

curl -X DELETE http://localhost:11434/api/delete -d '{"name":"phi3"}'

Show Model Information

curl http://localhost:11434/api/show -d '{"name":"llama3.2"}'

Returns metadata about a model including its template, system prompt, parameters, and the Modelfile it was created from.

Model Parameters

You can pass an options object to the generate or chat endpoints to control the model’s behaviour:

import requests

response = requests.post(
    "http://localhost:11434/api/generate",
    json={
        "model": "llama3.2",
        "prompt": "Tell me a creative story opening.",
        "stream": False,
        "options": {
            "temperature": 0.9,
            "top_p": 0.95,
            "num_ctx": 4096,
            "num_predict": 200
        }
    }
)

print(response.json()["response"])

Key parameters:

  • temperature — controls randomness. Lower values (e.g. 0.1) give more deterministic, factual responses. Higher values (e.g. 0.9) give more creative, varied output.
  • top_p — nucleus sampling threshold. Values around 0.9–0.95 are typical.
  • num_ctx — context window size in tokens. The default varies by model but is often 2048. Larger values allow longer inputs and conversations but use more memory.
  • num_predict — maximum number of tokens to generate. Set to -1 for unlimited (up to the context window).
  • seed — set a fixed seed (integer) for reproducible outputs.

OpenAI Compatibility Layer

Ollama provides an OpenAI-compatible API at /v1/. This means any application or library that uses the OpenAI Python SDK can point at Ollama instead with a two-line change:

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama"  # required but ignored by Ollama
)

response = client.chat.completions.create(
    model="llama3.2",
    messages=[{"role": "user", "content": "What is a neural network?"}]
)

print(response.choices[0].message.content)

This is particularly useful for integrating Ollama with tools and frameworks that were built for OpenAI — such as LangChain, LlamaIndex, and many others. Install the SDK with pip install openai if needed.

Running Ollama in a Server Environment

By default Ollama only accepts connections from localhost. To allow connections from other machines (for example if you’re building an application on a different server), set the environment variable OLLAMA_HOST=0.0.0.0 before starting Ollama. See our guide on running Ollama in Docker for how to do this in a containerised setup.

Error Handling in Python

Production code should handle errors gracefully:

import requests
from requests.exceptions import ConnectionError, Timeout

def query_ollama(prompt, model="llama3.2"):
    try:
        response = requests.post(
            "http://localhost:11434/api/generate",
            json={"model": model, "prompt": prompt, "stream": False},
            timeout=120
        )
        response.raise_for_status()
        return response.json()["response"]
    except ConnectionError:
        return "Error: Could not connect to Ollama. Is it running?"
    except Timeout:
        return "Error: Request timed out."
    except Exception as e:
        return f"Error: {str(e)}"

result = query_ollama("What is machine learning?")
print(result)

Summary

The Ollama REST API covers all the use cases you’ll need: single-turn generation, multi-turn chat, streaming output, embeddings, and model management. The OpenAI compatibility layer at /v1/ makes it easy to drop Ollama into existing AI-powered applications with minimal code changes.

For a practical application of the embeddings endpoint, our guide on building a RAG system with Ollama walks through a complete working example.

Sign Up For Daily Newsletter

Stay updated with our weekly newsletter. Subscribe now to never miss an update!

[mc4wp_form]

Leave a Reply

Your email address will not be published. Required fields are marked *