Why LangChain and Ollama Work So Well Together
LangChain has established itself as the go-to framework for building LLM-powered applications in Python. It handles prompt management, chaining, retrieval-augmented generation (RAG), and agentic workflows in a composable way. Ollama, meanwhile, lets you run large language models entirely on your own hardware — no API keys, no usage costs, no data leaving your machine.
The combination is compelling for several reasons. When building RAG pipelines or autonomous agents, you often make dozens or hundreds of LLM calls for a single user request. With a cloud API, those calls cost money and introduce latency. With Ollama running locally, calls are essentially free and limited only by your hardware. This makes experimentation fast and cheap, and it keeps sensitive documents entirely private — a genuine advantage when working with internal business data or client documents.
LangChain’s langchain-ollama integration package gives you first-class support for both chat completions and embeddings through Ollama, using the same interfaces as any other LangChain-compatible model provider.
Prerequisites
Before writing any code, make sure you have the following in place:
- Python 3.9 or later. Python 3.11 is a good choice as of 2025.
- Ollama installed and running. Download it from
ollama.comand start it withollama serve. By default it listens onhttp://localhost:11434. - At least one model pulled. For example:
ollama pull llama3.2for a capable general-purpose model, andollama pull nomic-embed-textfor embeddings. - LangChain and the Ollama integration package installed.
pip install langchain langchain-ollama
If you plan to build a RAG pipeline (covered later), also install FAISS and the document loaders:
pip install faiss-cpu langchain-community
You can verify Ollama is responding before writing any Python by hitting its health endpoint in a terminal:
curl http://localhost:11434/api/tags
If you see a JSON list of your pulled models, you are ready to go.
Using ChatOllama for Basic Invocation
The ChatOllama class lives in langchain_ollama and follows the standard LangChain chat model interface. You instantiate it with a model name and, optionally, generation parameters like temperature.
from langchain_ollama import ChatOllama\\n\\nllm = ChatOllama(model="llama3.2", temperature=0.3)\\n\\nresponse = llm.invoke("Explain the difference between supervised and unsupervised learning in two sentences.")\\nprint(response.content)
invoke() accepts either a plain string or a list of HumanMessage / SystemMessage objects from langchain_core.messages. For anything beyond a single turn, you will want to pass a message list so you can include a system prompt or conversation history:
from langchain_core.messages import SystemMessage, HumanMessage\\n\\nmessages = [\\n SystemMessage(content="You are a concise technical assistant."),\\n HumanMessage(content="What is a vector database?"),\\n]\\n\\nresponse = llm.invoke(messages)\\nprint(response.content)
Building a Chain with Prompt Templates (LCEL)
LangChain’s Expression Language (LCEL) lets you compose prompts, models, and output parsers into a pipeline using the | operator. This is the idiomatic way to build chains in modern LangChain code.
from langchain_ollama import ChatOllama\\nfrom langchain_core.prompts import ChatPromptTemplate\\nfrom langchain_core.output_parsers import StrOutputParser\\n\\nllm = ChatOllama(model="llama3.2", temperature=0.2)\\n\\nprompt = ChatPromptTemplate.from_messages([\\n ("system", "You are a helpful Python tutor. Keep answers brief and include a short code example where relevant."),\\n ("human", "{question}"),\\n])\\n\\nchain = prompt | llm | StrOutputParser()\\n\\nanswer = chain.invoke({"question": "How do I read a CSV file with pandas?"})\\nprint(answer)
The chain works left to right: the prompt template formats the input dictionary into a list of messages, ChatOllama generates a response, and StrOutputParser extracts the text content from the AIMessage object. You can extend this chain further — for example, passing the output through a custom function or another prompt — simply by appending more steps with |.
Streaming Responses
For interactive applications, waiting for the full response before displaying anything feels sluggish. Both ChatOllama and LCEL chains support streaming natively.
for chunk in chain.stream({"question": "Explain Python generators in plain English."}):\\n print(chunk, end="", flush=True)\\nprint() # newline after stream ends
stream() yields string chunks as they arrive from the model. This works at every level of a chain — if any step in the chain does not support streaming, LangChain will buffer that step and continue streaming from the next one that does. For most chains ending in StrOutputParser, full token-level streaming works out of the box.
Using OllamaEmbeddings
Embeddings are vector representations of text, and they are the foundation of any RAG pipeline. Ollama provides several embedding models, with nomic-embed-text being a widely used open-source option that balances quality and speed well.
from langchain_ollama import OllamaEmbeddings\\n\\nembeddings = OllamaEmbeddings(model="nomic-embed-text")\\n\\n# Embed a single string\\nvector = embeddings.embed_query("What is retrieval-augmented generation?")\\nprint(f"Vector dimensions: {len(vector)}")\\n\\n# Embed a list of documents\\nvectors = embeddings.embed_documents([\\n "LangChain is a framework for LLM applications.",\\n "Ollama runs large language models locally.",\\n])
nomic-embed-text produces 768-dimensional vectors. Pull it once with ollama pull nomic-embed-text and it is available locally indefinitely.
Building a Local RAG Pipeline
A RAG pipeline has three phases: ingestion (load and index documents), retrieval (find relevant chunks for a query), and generation (answer the question using retrieved context). Here is a complete working example using FAISS as the vector store.
Step 1: Load and Split Documents
from langchain_community.document_loaders import TextLoader\\nfrom langchain_text_splitters import RecursiveCharacterTextSplitter\\n\\nloader = TextLoader("my_document.txt", encoding="utf-8")\\ndocuments = loader.load()\\n\\nsplitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)\\nchunks = splitter.split_documents(documents)
RecursiveCharacterTextSplitter tries to split on paragraph boundaries first, falling back to sentence and word boundaries. A chunk_size of 500 characters works well for most prose documents; adjust upward for technical content with longer paragraphs.
Step 2: Embed and Store in FAISS
from langchain_ollama import OllamaEmbeddings\\nfrom langchain_community.vectorstores import FAISS\\n\\nembeddings = OllamaEmbeddings(model="nomic-embed-text")\\nvectorstore = FAISS.from_documents(chunks, embeddings)
This embeds every chunk and stores the vectors in an in-memory FAISS index. You can persist it to disk with vectorstore.save_local("faiss_index") and reload it later with FAISS.load_local("faiss_index", embeddings, allow_dangerous_deserialization=True).
Step 3: Retrieve and Generate
from langchain_ollama import ChatOllama\\nfrom langchain_core.prompts import ChatPromptTemplate\\nfrom langchain_core.output_parsers import StrOutputParser\\nfrom langchain_core.runnables import RunnablePassthrough\\n\\nretriever = vectorstore.as_retriever(search_kwargs={"k": 4})\\n\\nllm = ChatOllama(model="llama3.2", temperature=0.1)\\n\\nprompt = ChatPromptTemplate.from_messages([\\n ("system", "Answer the question using only the context provided. "\\n "If the answer is not in the context, say so.\\\\n\\\\nContext:\\\\n{context}"),\\n ("human", "{question}"),\\n])\\n\\ndef format_docs(docs):\\n return "\\\\n\\\\n".join(doc.page_content for doc in docs)\\n\\nrag_chain = (\\n {"context": retriever | format_docs, "question": RunnablePassthrough()}\\n | prompt\\n | llm\\n | StrOutputParser()\\n)\\n\\nanswer = rag_chain.invoke("What are the main topics covered in this document?")\\nprint(answer)
The retriever fetches the four most semantically similar chunks to the query. format_docs joins them into a single string, which is inserted into the system prompt as context. The model then answers grounded in that context rather than relying on its training data alone.
Using Different Models for Different Tasks
One practical advantage of running models locally is that you can run multiple models simultaneously without any per-token cost. A common pattern in RAG pipelines is to use a fast, small model for intermediate steps and a larger model only for final generation.
from langchain_ollama import ChatOllama, OllamaEmbeddings\\n\\n# Fast model for query expansion or classification\\nfast_llm = ChatOllama(model="llama3.2:1b", temperature=0.0)\\n\\n# Stronger model for final answer generation\\nstrong_llm = ChatOllama(model="llama3.1:8b", temperature=0.2)\\n\\n# Dedicated embedding model\\nembeddings = OllamaEmbeddings(model="nomic-embed-text")
For example, you might use llama3.2:1b to rewrite a user’s question into a more precise search query before embedding it, then use llama3.1:8b to synthesise the final answer. This keeps the pipeline fast without sacrificing quality on the step that matters most.
Common Gotchas
Model Not Loaded Yet
Ollama loads models into memory on first use, which can take several seconds for large models. If your application sends a request the moment Ollama starts, you may receive a timeout or an error before the model is ready. A simple mitigation is to send a short warm-up request at startup, or to increase the timeout on your ChatOllama instance:
llm = ChatOllama(model="llama3.2", timeout=120) # seconds
Timeout on Long Generations
The default HTTP timeout in LangChain’s Ollama integration is relatively short. If you are generating long responses with a large model on modest hardware, increase the timeout as shown above, or use streaming so the connection stays alive throughout generation.
CORS When Calling from a Browser
Ollama’s API does not set permissive CORS headers by default. If you are building a web front end that calls Ollama directly from JavaScript in the browser, you will hit CORS errors. The correct fix is to set the OLLAMA_ORIGINS environment variable before starting Ollama:
OLLAMA_ORIGINS="http://localhost:3000" ollama serve
In production, route browser requests through your own backend rather than exposing Ollama directly. LangChain runs server-side, so this is rarely an issue when using langchain-ollama in Python.
Model Name Mismatches
Model names in Ollama are case-sensitive and must match exactly what ollama list shows. llama3.2 and llama3.2:latest are equivalent, but Llama3.2 will fail. If you get a model-not-found error, run ollama list to confirm the exact name.
Practical Example: A Local Document Q&A Chatbot
Putting it all together, here is the skeleton of a command-line document Q&A chatbot that accepts questions in a loop:
from langchain_ollama import ChatOllama, OllamaEmbeddings\\nfrom langchain_community.document_loaders import TextLoader\\nfrom langchain_community.vectorstores import FAISS\\nfrom langchain_text_splitters import RecursiveCharacterTextSplitter\\nfrom langchain_core.prompts import ChatPromptTemplate\\nfrom langchain_core.output_parsers import StrOutputParser\\nfrom langchain_core.runnables import RunnablePassthrough\\n\\n# --- Ingestion ---\\nloader = TextLoader("knowledge_base.txt", encoding="utf-8")\\ndocs = loader.load()\\nchunks = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50).split_documents(docs)\\n\\nembeddings = OllamaEmbeddings(model="nomic-embed-text")\\nvectorstore = FAISS.from_documents(chunks, embeddings)\\nretriever = vectorstore.as_retriever(search_kwargs={"k": 4})\\n\\n# --- Chain ---\\nllm = ChatOllama(model="llama3.2", temperature=0.1)\\n\\nprompt = ChatPromptTemplate.from_messages([\\n ("system", "You are a helpful assistant. Answer using only the context below.\\\\n\\\\nContext:\\\\n{context}"),\\n ("human", "{question}"),\\n])\\n\\ndef format_docs(docs):\\n return "\\\\n\\\\n".join(d.page_content for d in docs)\\n\\nchain = (\\n {"context": retriever | format_docs, "question": RunnablePassthrough()}\\n | prompt\\n | llm\\n | StrOutputParser()\\n)\\n\\n# --- Interactive loop ---\\nprint("Document Q&A ready. Type 'quit' to exit.\\\\n")\\nwhile True:\\n question = input("You: ").strip()\\n if question.lower() in ("quit", "exit"):\\n break\\n for chunk in chain.stream(question):\\n print(chunk, end="", flush=True)\\n print("\\\\n")
This chatbot loads a text file, indexes it locally, and answers questions about it entirely on your own machine. There are no API costs, no rate limits, and no data sent externally. Swap TextLoader for PyPDFLoader or UnstructuredWordDocumentLoader (both in langchain-community) to handle PDFs or Word documents instead.
Next Steps
Once you are comfortable with basic chains and RAG, the natural progression is to add memory so the chatbot can refer back to earlier messages in a session, or to build agents that can choose between multiple tools — for example, deciding whether to search a local knowledge base or perform a web search depending on the question. LangChain’s create_react_agent function works with ChatOllama in exactly the same way it works with any other chat model, making the transition straightforward. The key insight is that once you understand the LCEL composition model, the same patterns apply regardless of whether the underlying LLM is running locally via Ollama or in the cloud via an API.


