Home / AI / Ollama / Ollama REST API: Complete Developer Guide

Ollama

Ollama REST API: Complete Developer Guide

5. Setting System Prompts via API

7. Context Window and Model Options

8. The /api/embeddings Endpoint

9. Using the OpenAI-Compatible API

10. Using Ollama with LangChain

11. Environment Variables for Remote Access

Ollama exposes a clean REST API on localhost:11434 that lets you integrate locally-running large language models into your applications with minimal setup. Whether you want to hit raw endpoints with curl, build Python scripts using the requests library, or simply swap out your OpenAI base URL and keep everything else unchanged, Ollama’s API is designed to get out of your way. This guide walks through every major endpoint, practical code examples, streaming, embeddings, LangChain integration, and environment variables for remote access.

API Overview

When Ollama is running, it starts an HTTP server on http://localhost:11434. No SDK required — all interactions happen over standard HTTP. Ollama also ships an OpenAI-compatible endpoint layer at http://localhost:11434/v1, meaning any code that already talks to the OpenAI API can be redirected to your local Ollama instance with a one-line change. There is no authentication by default.

Key Endpoints

POST /api/generate — Single-turn text generation
POST /api/chat — Multi-turn chat with message history
POST /api/embeddings — Generate vector embeddings from text
GET /api/tags — List all locally available models
POST /api/pull — Download a model from the Ollama library
DELETE /api/delete — Remove a local model

The /api/generate Endpoint

curl http://localhost:11434/api/generate \
  -d '{
    "model": "llama3",
    "prompt": "Explain what a REST API is in two sentences.",
    "stream": false
  }'

The same call in Python:

import requests

response = requests.post(
    "http://localhost:11434/api/generate",
    json={
        "model": "llama3",
        "prompt": "Explain what a REST API is in two sentences.",
        "stream": False,
    }
)

print(response.json()["response"])

The /api/chat Endpoint

For multi-turn conversations, use /api/chat with a messages array:

curl http://localhost:11434/api/chat \
  -d '{
    "model": "llama3",
    "stream": false,
    "messages": [
      { "role": "system", "content": "You are a concise technical assistant." },
      { "role": "user", "content": "What is the difference between TCP and UDP?" }
    ]
  }'

Managing a conversation history in Python:

import requests

messages = [{"role": "system", "content": "You are a helpful coding assistant."}]

def chat(user_input):
    messages.append({"role": "user", "content": user_input})
    response = requests.post(
        "http://localhost:11434/api/chat",
        json={"model": "llama3", "messages": messages, "stream": False}
    )
    reply = response.json()["message"]["content"]
    messages.append({"role": "assistant", "content": reply})
    return reply

print(chat("Write a Python function to reverse a string."))
print(chat("Now add a docstring to that function."))

Setting System Prompts via API

curl http://localhost:11434/api/generate \
  -d '{
    "model": "llama3",
    "system": "You are a senior Python developer. Always include type hints.",
    "prompt": "Write a function that calculates the factorial of a number.",
    "stream": false
  }'

Streaming Responses

Ollama streams by default — each chunk is a newline-delimited JSON object. In Python:

import requests, json

response = requests.post(
    "http://localhost:11434/api/chat",
    json={
        "model": "llama3",
        "messages": [{"role": "user", "content": "Tell me three facts about black holes."}],
    },
    stream=True
)

for line in response.iter_lines():
    if line:
        chunk = json.loads(line)
        if not chunk.get("done"):
            print(chunk["message"]["content"], end="", flush=True)
print()

Context Window and Model Options

Pass model parameters via the options field:

num_ctx — Context window size in tokens (increase for longer documents)
temperature — 0 = deterministic, 1 = more creative (default: 0.8)
top_p — Nucleus sampling threshold (default: 0.9)
seed — Fixed seed for reproducible outputs
num_predict — Maximum tokens to generate (-1 for unlimited)

response = requests.post(
    "http://localhost:11434/api/generate",
    json={
        "model": "llama3",
        "prompt": "Summarise REST API design principles.",
        "stream": False,
        "options": {"temperature": 0.2, "num_ctx": 8192, "seed": 42}
    }
)

The /api/embeddings Endpoint

Embeddings convert text into vectors for RAG and semantic search:

curl http://localhost:11434/api/embeddings \
  -d '{"model": "nomic-embed-text", "prompt": "How does vector search work?"}'

Good embedding models: nomic-embed-text, mxbai-embed-large. Pull with ollama pull nomic-embed-text.

Using the OpenAI-Compatible API

Any code using the OpenAI SDK can be redirected to Ollama by changing just the base_url:

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama",  # required by SDK, ignored by Ollama
)

response = client.chat.completions.create(
    model="llama3",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What are the SOLID principles?"},
    ]
)

print(response.choices[0].message.content)

Streaming works identically to standard OpenAI SDK usage. Embeddings also work through the /v1 layer.

Using Ollama with LangChain

pip install langchain-ollama langchain-core

from langchain_ollama import ChatOllama
from langchain_core.messages import HumanMessage, SystemMessage

llm = ChatOllama(model="llama3", temperature=0.3)

messages = [
    SystemMessage(content="You are an expert at explaining technical concepts simply."),
    HumanMessage(content="What is a transformer model?"),
]

response = llm.invoke(messages)
print(response.content)

Environment Variables for Remote Access

By default, Ollama only listens on 127.0.0.1. To expose the API on your network:

OLLAMA_HOST=0.0.0.0 ollama serve

On Linux with systemd, make it permanent:

sudo systemctl edit ollama

Add:

[Service]
Environment="OLLAMA_HOST=0.0.0.0"

Other useful variables:

OLLAMA_ORIGINS — Allowed CORS origins (use * to allow all)
OLLAMA_MAX_LOADED_MODELS — Models to keep loaded simultaneously
OLLAMA_NUM_PARALLEL — Parallel inference requests per model
OLLAMA_KEEP_ALIVE — How long a model stays loaded (e.g. 5m, -1 for forever)

For internet-facing deployments, place Ollama behind a reverse proxy with authentication — the API has no built-in auth.

Ollama REST API: Complete Developer Guide

Table of Contents

1. API Overview

2. Key Endpoints

3. The /api/generate Endpoint

4. The /api/chat Endpoint

5. Setting System Prompts via API

6. Streaming Responses

7. Context Window and Model Options

8. The /api/embeddings Endpoint

9. Using the OpenAI-Compatible API

10. Using Ollama with LangChain

11. Environment Variables for Remote Access

API Overview

Key Endpoints

The /api/generate Endpoint

The /api/chat Endpoint

Setting System Prompts via API

Streaming Responses

Context Window and Model Options

The /api/embeddings Endpoint

Using the OpenAI-Compatible API

Using Ollama with LangChain

Environment Variables for Remote Access

How to Run Ollama in Docker (Step-by-Step Guide)

How to Create Custom Ollama Models with Modelfiles

Leave a Reply Cancel reply

Ollama REST API: Complete Developer Guide

Table of Contents

API Overview

Key Endpoints

The /api/generate Endpoint

The /api/chat Endpoint

Setting System Prompts via API

Streaming Responses

Context Window and Model Options

The /api/embeddings Endpoint

Using the OpenAI-Compatible API

Using Ollama with LangChain

Environment Variables for Remote Access

How to Run Ollama in Docker (Step-by-Step Guide)

How to Create Custom Ollama Models with Modelfiles

Sign Up For Daily Newsletter

Related Posts

Leave a Reply Cancel reply