Ollama is best known for running large language models locally, but it also exposes a first-class embeddings API. Combined with a vector store, this lets you build semantic search engines, document question-answering tools, and full RAG pipelines entirely on your own hardware — no data ever leaves your machine. This guide covers everything from generating your first embedding to wiring up ChromaDB for a practical document Q&A system.
What Are Embeddings?
An embedding is a numerical vector — an array of floating-point numbers — that represents a piece of text in a high-dimensional space. The key property is that semantically similar texts produce vectors that are close together in that space, while unrelated texts produce vectors that are far apart.
For example, the sentences “The dog chased the ball” and “A puppy ran after the toy” will produce vectors that sit very close to each other, even though they share almost no exact words. This is fundamentally different from keyword search, which relies on literal string matching.
This makes embeddings useful for:
- Semantic search — find documents that mean what you searched for, not just those containing the same words
- Document clustering — automatically group related content
- RAG pipelines — retrieve the most relevant context before generating an answer
- Recommendation systems — find items similar to one a user has already engaged with
Choosing an Embedding Model
Ollama hosts several dedicated embedding models. These are not chat models — they are optimised purely for producing high-quality vector representations quickly.
- nomic-embed-text — 768-dimensional vectors, very fast, excellent quality-to-speed ratio. The best default choice for most projects.
- mxbai-embed-large — 1024-dimensional vectors, higher accuracy on retrieval benchmarks. Good when retrieval quality is critical.
- snowflake-arctic-embed — Available in multiple sizes. Strong performance on domain-specific retrieval tasks.
For most local RAG projects, nomic-embed-text is the right starting point. It runs efficiently even on CPU.
ollama pull nomic-embed-text
The Embeddings API
Send a POST request to /api/embeddings with a JSON body containing the model name and the text to embed.
curl http://localhost:11434/api/embeddings \
-H "Content-Type: application/json" \
-d '{
"model": "nomic-embed-text",
"prompt": "Ollama makes local AI accessible to everyone"
}'
The response contains an embedding key with the vector array — 768 floats for nomic-embed-text.
Using the Python Library
pip install ollama
import ollama
response = ollama.embeddings(
model='nomic-embed-text',
prompt='Ollama makes local AI accessible to everyone'
)
vector = response['embedding']
print(f"Vector dimensions: {len(vector)}")
# Vector dimensions: 768
What Is RAG?
RAG stands for Retrieval-Augmented Generation. Instead of relying solely on what a language model learned during training, you give it relevant information at query time by injecting it directly into the prompt.
A RAG pipeline has two phases:
- Indexing (done once) — split your documents into chunks, embed each chunk, and store the vectors alongside the original text.
- Query (done per question) — embed the user’s question, find the stored chunks whose vectors are closest to the question vector, and include those chunks in the prompt as context before asking the model to answer.
This approach dramatically reduces hallucination on knowledge-intensive tasks, because the model is working from retrieved facts rather than recalled training data.
Building a Simple In-Memory RAG Pipeline
pip install ollama numpy
import ollama
import numpy as np
documents = [
"Ollama runs large language models locally on your own hardware.",
"The nomic-embed-text model produces 768-dimensional embeddings.",
"RAG stands for Retrieval-Augmented Generation.",
"ChromaDB is an open-source vector database that works well with Ollama.",
"Cosine similarity measures the angle between two vectors.",
"Embedding models convert text into numerical vectors.",
]
EMBED_MODEL = "nomic-embed-text"
CHAT_MODEL = "llama3.2"
def embed(text):
return np.array(ollama.embeddings(model=EMBED_MODEL, prompt=text)["embedding"])
index = [{"text": doc, "vector": embed(doc)} for doc in documents]
def cosine_similarity(a, b):
return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))
def retrieve(query, k=3):
query_vec = embed(query)
scored = [(cosine_similarity(query_vec, e["vector"]), e["text"]) for e in index]
scored.sort(reverse=True)
return [text for _, text in scored[:k]]
def ask(question):
chunks = retrieve(question)
context = "
".join(f"- {c}" for c in chunks)
prompt = (
f"Answer using only the context provided.
"
f"Context:
{context}
"
f"Question: {question}
Answer:"
)
response = ollama.chat(
model=CHAT_MODEL,
messages=[{"role": "user", "content": prompt}]
)
return response["message"]["content"]
print(ask("What is RAG and how does it relate to embeddings?"))
Using ChromaDB as a Vector Store
For larger document sets, keeping vectors in memory becomes impractical. ChromaDB is the easiest persistent vector database to run alongside Ollama — it stores vectors on disk and requires no separate server process.
pip install chromadb
import ollama
import chromadb
client = chromadb.PersistentClient(path="./chroma_store")
collection = client.get_or_create_collection(name="my_documents")
documents = [
"Ollama runs large language models locally on your own hardware.",
"The nomic-embed-text model produces 768-dimensional embeddings.",
"RAG stands for Retrieval-Augmented Generation.",
"ChromaDB is an open-source vector database that works well with Ollama.",
]
def embed(text):
return ollama.embeddings(model="nomic-embed-text", prompt=text)["embedding"]
collection.add(
ids=[f"doc_{i}" for i in range(len(documents))],
embeddings=[embed(doc) for doc in documents],
documents=documents,
)
# Query
query = "How does retrieval-augmented generation work?"
results = collection.query(
query_embeddings=[embed(query)],
n_results=2,
)
for doc in results["documents"][0]:
print(doc)
Other Compatible Vector Databases
- Qdrant — available as a local binary or Docker container, strong filtering capabilities
- pgvector — a PostgreSQL extension; store embeddings alongside your relational data
- FAISS — Meta’s in-memory library, very fast for large datasets with fine-grained index control
Practical Use Case: Q&A Over a Folder of Documents
The pattern is the same regardless of file type:
- Read each file and split into overlapping chunks of 300–500 words (overlap by 50–100 words to avoid cutting context across boundaries)
- Embed each chunk and store the vector with the source filename as metadata
- At query time, embed the question and retrieve the top 3–5 chunks
- Inject chunks as context and generate an answer — instruct the model not to speculate beyond the provided context
For PDF ingestion:
pip install pypdf
from pypdf import PdfReader
def extract_pdf_text(path):
reader = PdfReader(path)
return "
".join(page.extract_text() for page in reader.pages)
Performance Tips
- Cache embeddings to disk — compute them once and persist; don’t re-index unchanged documents
- Keep chunks 200–400 words — very short chunks lose context; very long ones dilute the semantic signal
- Use mxbai-embed-large selectively — better accuracy, but slower; use it when retrieval quality matters most
- Embedding is fast even on CPU — a few hundred chunks per minute is typical on modern hardware


