Home / AI / Ollama / Ollama REST API: Complete Developer Guide

Ollama REST API: Complete Developer Guide

Ollama

Ollama exposes a clean REST API on localhost:11434 that lets you integrate locally-running large language models into your applications with minimal setup. Whether you want to hit raw endpoints with curl, build Python scripts using the requests library, or simply swap out your OpenAI base URL and keep everything else unchanged, Ollama’s API is designed to get out of your way. This guide walks through every major endpoint, practical code examples, streaming, embeddings, LangChain integration, and environment variables for remote access.

API Overview

When Ollama is running, it starts an HTTP server on http://localhost:11434. No SDK required — all interactions happen over standard HTTP. Ollama also ships an OpenAI-compatible endpoint layer at http://localhost:11434/v1, meaning any code that already talks to the OpenAI API can be redirected to your local Ollama instance with a one-line change. There is no authentication by default.

Key Endpoints

  • POST /api/generate — Single-turn text generation
  • POST /api/chat — Multi-turn chat with message history
  • POST /api/embeddings — Generate vector embeddings from text
  • GET /api/tags — List all locally available models
  • POST /api/pull — Download a model from the Ollama library
  • DELETE /api/delete — Remove a local model

The /api/generate Endpoint

curl http://localhost:11434/api/generate \
  -d '{
    "model": "llama3",
    "prompt": "Explain what a REST API is in two sentences.",
    "stream": false
  }'

The same call in Python:

import requests

response = requests.post(
    "http://localhost:11434/api/generate",
    json={
        "model": "llama3",
        "prompt": "Explain what a REST API is in two sentences.",
        "stream": False,
    }
)

print(response.json()["response"])

The /api/chat Endpoint

For multi-turn conversations, use /api/chat with a messages array:

curl http://localhost:11434/api/chat \
  -d '{
    "model": "llama3",
    "stream": false,
    "messages": [
      { "role": "system", "content": "You are a concise technical assistant." },
      { "role": "user", "content": "What is the difference between TCP and UDP?" }
    ]
  }'

Managing a conversation history in Python:

import requests

messages = [{"role": "system", "content": "You are a helpful coding assistant."}]

def chat(user_input):
    messages.append({"role": "user", "content": user_input})
    response = requests.post(
        "http://localhost:11434/api/chat",
        json={"model": "llama3", "messages": messages, "stream": False}
    )
    reply = response.json()["message"]["content"]
    messages.append({"role": "assistant", "content": reply})
    return reply

print(chat("Write a Python function to reverse a string."))
print(chat("Now add a docstring to that function."))

Setting System Prompts via API

curl http://localhost:11434/api/generate \
  -d '{
    "model": "llama3",
    "system": "You are a senior Python developer. Always include type hints.",
    "prompt": "Write a function that calculates the factorial of a number.",
    "stream": false
  }'

Streaming Responses

Ollama streams by default — each chunk is a newline-delimited JSON object. In Python:

import requests, json

response = requests.post(
    "http://localhost:11434/api/chat",
    json={
        "model": "llama3",
        "messages": [{"role": "user", "content": "Tell me three facts about black holes."}],
    },
    stream=True
)

for line in response.iter_lines():
    if line:
        chunk = json.loads(line)
        if not chunk.get("done"):
            print(chunk["message"]["content"], end="", flush=True)
print()

Context Window and Model Options

Pass model parameters via the options field:

  • num_ctx — Context window size in tokens (increase for longer documents)
  • temperature — 0 = deterministic, 1 = more creative (default: 0.8)
  • top_p — Nucleus sampling threshold (default: 0.9)
  • seed — Fixed seed for reproducible outputs
  • num_predict — Maximum tokens to generate (-1 for unlimited)
response = requests.post(
    "http://localhost:11434/api/generate",
    json={
        "model": "llama3",
        "prompt": "Summarise REST API design principles.",
        "stream": False,
        "options": {"temperature": 0.2, "num_ctx": 8192, "seed": 42}
    }
)

The /api/embeddings Endpoint

Embeddings convert text into vectors for RAG and semantic search:

curl http://localhost:11434/api/embeddings \
  -d '{"model": "nomic-embed-text", "prompt": "How does vector search work?"}'

Good embedding models: nomic-embed-text, mxbai-embed-large. Pull with ollama pull nomic-embed-text.

Using the OpenAI-Compatible API

Any code using the OpenAI SDK can be redirected to Ollama by changing just the base_url:

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama",  # required by SDK, ignored by Ollama
)

response = client.chat.completions.create(
    model="llama3",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What are the SOLID principles?"},
    ]
)

print(response.choices[0].message.content)

Streaming works identically to standard OpenAI SDK usage. Embeddings also work through the /v1 layer.

Using Ollama with LangChain

pip install langchain-ollama langchain-core
from langchain_ollama import ChatOllama
from langchain_core.messages import HumanMessage, SystemMessage

llm = ChatOllama(model="llama3", temperature=0.3)

messages = [
    SystemMessage(content="You are an expert at explaining technical concepts simply."),
    HumanMessage(content="What is a transformer model?"),
]

response = llm.invoke(messages)
print(response.content)

Environment Variables for Remote Access

By default, Ollama only listens on 127.0.0.1. To expose the API on your network:

OLLAMA_HOST=0.0.0.0 ollama serve

On Linux with systemd, make it permanent:

sudo systemctl edit ollama

Add:

[Service]
Environment="OLLAMA_HOST=0.0.0.0"

Other useful variables:

  • OLLAMA_ORIGINS — Allowed CORS origins (use * to allow all)
  • OLLAMA_MAX_LOADED_MODELS — Models to keep loaded simultaneously
  • OLLAMA_NUM_PARALLEL — Parallel inference requests per model
  • OLLAMA_KEEP_ALIVE — How long a model stays loaded (e.g. 5m, -1 for forever)

For internet-facing deployments, place Ollama behind a reverse proxy with authentication — the API has no built-in auth.

Sign Up For Daily Newsletter

Stay updated with our weekly newsletter. Subscribe now to never miss an update!

[mc4wp_form]

Leave a Reply

Your email address will not be published. Required fields are marked *