Home / AI / Ollama / How to Use Ollama for Embeddings and RAG

Ollama

How to Use Ollama for Embeddings and RAG

2. Choosing an Embedding Model

6. Building a Simple In-Memory RAG Pipeline

7. Using ChromaDB as a Vector Store

8. Other Compatible Vector Databases

9. Practical Use Case: Q&A Over a Folder of Documents

Ollama is best known for running large language models locally, but it also exposes a first-class embeddings API. Combined with a vector store, this lets you build semantic search engines, document question-answering tools, and full RAG pipelines entirely on your own hardware — no data ever leaves your machine. This guide covers everything from generating your first embedding to wiring up ChromaDB for a practical document Q&A system.

What Are Embeddings?

An embedding is a numerical vector — an array of floating-point numbers — that represents a piece of text in a high-dimensional space. The key property is that semantically similar texts produce vectors that are close together in that space, while unrelated texts produce vectors that are far apart.

For example, the sentences “The dog chased the ball” and “A puppy ran after the toy” will produce vectors that sit very close to each other, even though they share almost no exact words. This is fundamentally different from keyword search, which relies on literal string matching.

This makes embeddings useful for:

Semantic search — find documents that mean what you searched for, not just those containing the same words
Document clustering — automatically group related content
RAG pipelines — retrieve the most relevant context before generating an answer
Recommendation systems — find items similar to one a user has already engaged with

Choosing an Embedding Model

Ollama hosts several dedicated embedding models. These are not chat models — they are optimised purely for producing high-quality vector representations quickly.

nomic-embed-text — 768-dimensional vectors, very fast, excellent quality-to-speed ratio. The best default choice for most projects.
mxbai-embed-large — 1024-dimensional vectors, higher accuracy on retrieval benchmarks. Good when retrieval quality is critical.
snowflake-arctic-embed — Available in multiple sizes. Strong performance on domain-specific retrieval tasks.

For most local RAG projects, nomic-embed-text is the right starting point. It runs efficiently even on CPU.

ollama pull nomic-embed-text

The Embeddings API

Send a POST request to /api/embeddings with a JSON body containing the model name and the text to embed.

curl http://localhost:11434/api/embeddings 
  -H "Content-Type: application/json" 
  -d '{
    "model": "nomic-embed-text",
    "prompt": "Ollama makes local AI accessible to everyone"
  }'

The response contains an embedding key with the vector array — 768 floats for nomic-embed-text.

Using the Python Library

pip install ollama

import ollama

response = ollama.embeddings(
    model='nomic-embed-text',
    prompt='Ollama makes local AI accessible to everyone'
)

vector = response['embedding']
print(f"Vector dimensions: {len(vector)}")
# Vector dimensions: 768

What Is RAG?

RAG stands for Retrieval-Augmented Generation. Instead of relying solely on what a language model learned during training, you give it relevant information at query time by injecting it directly into the prompt.

A RAG pipeline has two phases:

Indexing (done once) — split your documents into chunks, embed each chunk, and store the vectors alongside the original text.
Query (done per question) — embed the user’s question, find the stored chunks whose vectors are closest to the question vector, and include those chunks in the prompt as context before asking the model to answer.

This approach dramatically reduces hallucination on knowledge-intensive tasks, because the model is working from retrieved facts rather than recalled training data.

Building a Simple In-Memory RAG Pipeline

pip install ollama numpy

import ollama
import numpy as np

documents = [
    "Ollama runs large language models locally on your own hardware.",
    "The nomic-embed-text model produces 768-dimensional embeddings.",
    "RAG stands for Retrieval-Augmented Generation.",
    "ChromaDB is an open-source vector database that works well with Ollama.",
    "Cosine similarity measures the angle between two vectors.",
    "Embedding models convert text into numerical vectors.",
]

EMBED_MODEL = "nomic-embed-text"
CHAT_MODEL  = "llama3.2"

def embed(text):
    return np.array(ollama.embeddings(model=EMBED_MODEL, prompt=text)["embedding"])

index = [{"text": doc, "vector": embed(doc)} for doc in documents]

def cosine_similarity(a, b):
    return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))

def retrieve(query, k=3):
    query_vec = embed(query)
    scored = [(cosine_similarity(query_vec, e["vector"]), e["text"]) for e in index]
    scored.sort(reverse=True)
    return [text for _, text in scored[:k]]

def ask(question):
    chunks = retrieve(question)
    context = "
".join(f"- {c}" for c in chunks)
    prompt = (
        f"Answer using only the context provided.

"
        f"Context:
{context}

"
        f"Question: {question}
Answer:"
    )
    response = ollama.chat(
        model=CHAT_MODEL,
        messages=[{"role": "user", "content": prompt}]
    )
    return response["message"]["content"]

print(ask("What is RAG and how does it relate to embeddings?"))

Using ChromaDB as a Vector Store

For larger document sets, keeping vectors in memory becomes impractical. ChromaDB is the easiest persistent vector database to run alongside Ollama — it stores vectors on disk and requires no separate server process.

pip install chromadb

import ollama
import chromadb

client = chromadb.PersistentClient(path="./chroma_store")
collection = client.get_or_create_collection(name="my_documents")

documents = [
    "Ollama runs large language models locally on your own hardware.",
    "The nomic-embed-text model produces 768-dimensional embeddings.",
    "RAG stands for Retrieval-Augmented Generation.",
    "ChromaDB is an open-source vector database that works well with Ollama.",
]

def embed(text):
    return ollama.embeddings(model="nomic-embed-text", prompt=text)["embedding"]

collection.add(
    ids=[f"doc_{i}" for i in range(len(documents))],
    embeddings=[embed(doc) for doc in documents],
    documents=documents,
)

# Query
query = "How does retrieval-augmented generation work?"
results = collection.query(
    query_embeddings=[embed(query)],
    n_results=2,
)

for doc in results["documents"][0]:
    print(doc)

Other Compatible Vector Databases

Qdrant — available as a local binary or Docker container, strong filtering capabilities
pgvector — a PostgreSQL extension; store embeddings alongside your relational data
FAISS — Meta’s in-memory library, very fast for large datasets with fine-grained index control

Practical Use Case: Q&A Over a Folder of Documents

The pattern is the same regardless of file type:

Read each file and split into overlapping chunks of 300–500 words (overlap by 50–100 words to avoid cutting context across boundaries)
Embed each chunk and store the vector with the source filename as metadata
At query time, embed the question and retrieve the top 3–5 chunks
Inject chunks as context and generate an answer — instruct the model not to speculate beyond the provided context

For PDF ingestion:

pip install pypdf

from pypdf import PdfReader

def extract_pdf_text(path):
    reader = PdfReader(path)
    return "
".join(page.extract_text() for page in reader.pages)

Performance Tips

Cache embeddings to disk — compute them once and persist; don’t re-index unchanged documents
Keep chunks 200–400 words — very short chunks lose context; very long ones dilute the semantic signal
Use mxbai-embed-large selectively — better accuracy, but slower; use it when retrieval quality matters most
Embedding is fast even on CPU — a few hundred chunks per minute is typical on modern hardware

How to Use Ollama for Embeddings and RAG

Table of Contents

1. What Are Embeddings?

2. Choosing an Embedding Model

3. The Embeddings API

4. Using the Python Library

5. What Is RAG?

6. Building a Simple In-Memory RAG Pipeline

7. Using ChromaDB as a Vector Store

8. Other Compatible Vector Databases

9. Practical Use Case: Q&A Over a Folder of Documents

10. Performance Tips

11. Related Reading

What Are Embeddings?

Choosing an Embedding Model

The Embeddings API

Using the Python Library

What Is RAG?

Building a Simple In-Memory RAG Pipeline

Using ChromaDB as a Vector Store

Other Compatible Vector Databases

Practical Use Case: Q&A Over a Folder of Documents

Performance Tips

How to Use Ollama with Python: Library, OpenAI SDK, and REST API

How to Run Ollama on a Raspberry Pi

How to Use Ollama for Embeddings and RAG

Table of Contents

What Are Embeddings?

Choosing an Embedding Model

The Embeddings API

Using the Python Library

What Is RAG?

Building a Simple In-Memory RAG Pipeline

Using ChromaDB as a Vector Store

Other Compatible Vector Databases

Practical Use Case: Q&A Over a Folder of Documents

Performance Tips

Related Reading

How to Use Ollama with Python: Library, OpenAI SDK, and REST API

How to Run Ollama on a Raspberry Pi

Related Posts