Home / AI / Ollama / How to Use Ollama with LangChain: Complete Guide

How to Use Ollama with LangChain: Complete Guide

Ollama

Why LangChain and Ollama Work So Well Together

LangChain has established itself as the go-to framework for building LLM-powered applications in Python. It handles prompt management, chaining, retrieval-augmented generation (RAG), and agentic workflows in a composable way. Ollama, meanwhile, lets you run large language models entirely on your own hardware — no API keys, no usage costs, no data leaving your machine.

The combination is compelling for several reasons. When building RAG pipelines or autonomous agents, you often make dozens or hundreds of LLM calls for a single user request. With a cloud API, those calls cost money and introduce latency. With Ollama running locally, calls are essentially free and limited only by your hardware. This makes experimentation fast and cheap, and it keeps sensitive documents entirely private — a genuine advantage when working with internal business data or client documents.

LangChain’s langchain-ollama integration package gives you first-class support for both chat completions and embeddings through Ollama, using the same interfaces as any other LangChain-compatible model provider.

Prerequisites

Before writing any code, make sure you have the following in place:

  • Python 3.9 or later. Python 3.11 is a good choice as of 2025.
  • Ollama installed and running. Download it from ollama.com and start it with ollama serve. By default it listens on http://localhost:11434.
  • At least one model pulled. For example: ollama pull llama3.2 for a capable general-purpose model, and ollama pull nomic-embed-text for embeddings.
  • LangChain and the Ollama integration package installed.
pip install langchain langchain-ollama

If you plan to build a RAG pipeline (covered later), also install FAISS and the document loaders:

pip install faiss-cpu langchain-community

You can verify Ollama is responding before writing any Python by hitting its health endpoint in a terminal:

curl http://localhost:11434/api/tags

If you see a JSON list of your pulled models, you are ready to go.

Using ChatOllama for Basic Invocation

The ChatOllama class lives in langchain_ollama and follows the standard LangChain chat model interface. You instantiate it with a model name and, optionally, generation parameters like temperature.

from langchain_ollama import ChatOllama\\n\\nllm = ChatOllama(model="llama3.2", temperature=0.3)\\n\\nresponse = llm.invoke("Explain the difference between supervised and unsupervised learning in two sentences.")\\nprint(response.content)

invoke() accepts either a plain string or a list of HumanMessage / SystemMessage objects from langchain_core.messages. For anything beyond a single turn, you will want to pass a message list so you can include a system prompt or conversation history:

from langchain_core.messages import SystemMessage, HumanMessage\\n\\nmessages = [\\n    SystemMessage(content="You are a concise technical assistant."),\\n    HumanMessage(content="What is a vector database?"),\\n]\\n\\nresponse = llm.invoke(messages)\\nprint(response.content)

Building a Chain with Prompt Templates (LCEL)

LangChain’s Expression Language (LCEL) lets you compose prompts, models, and output parsers into a pipeline using the | operator. This is the idiomatic way to build chains in modern LangChain code.

from langchain_ollama import ChatOllama\\nfrom langchain_core.prompts import ChatPromptTemplate\\nfrom langchain_core.output_parsers import StrOutputParser\\n\\nllm = ChatOllama(model="llama3.2", temperature=0.2)\\n\\nprompt = ChatPromptTemplate.from_messages([\\n    ("system", "You are a helpful Python tutor. Keep answers brief and include a short code example where relevant."),\\n    ("human", "{question}"),\\n])\\n\\nchain = prompt | llm | StrOutputParser()\\n\\nanswer = chain.invoke({"question": "How do I read a CSV file with pandas?"})\\nprint(answer)

The chain works left to right: the prompt template formats the input dictionary into a list of messages, ChatOllama generates a response, and StrOutputParser extracts the text content from the AIMessage object. You can extend this chain further — for example, passing the output through a custom function or another prompt — simply by appending more steps with |.

Streaming Responses

For interactive applications, waiting for the full response before displaying anything feels sluggish. Both ChatOllama and LCEL chains support streaming natively.

for chunk in chain.stream({"question": "Explain Python generators in plain English."}):\\n    print(chunk, end="", flush=True)\\nprint()  # newline after stream ends

stream() yields string chunks as they arrive from the model. This works at every level of a chain — if any step in the chain does not support streaming, LangChain will buffer that step and continue streaming from the next one that does. For most chains ending in StrOutputParser, full token-level streaming works out of the box.

Using OllamaEmbeddings

Embeddings are vector representations of text, and they are the foundation of any RAG pipeline. Ollama provides several embedding models, with nomic-embed-text being a widely used open-source option that balances quality and speed well.

from langchain_ollama import OllamaEmbeddings\\n\\nembeddings = OllamaEmbeddings(model="nomic-embed-text")\\n\\n# Embed a single string\\nvector = embeddings.embed_query("What is retrieval-augmented generation?")\\nprint(f"Vector dimensions: {len(vector)}")\\n\\n# Embed a list of documents\\nvectors = embeddings.embed_documents([\\n    "LangChain is a framework for LLM applications.",\\n    "Ollama runs large language models locally.",\\n])

nomic-embed-text produces 768-dimensional vectors. Pull it once with ollama pull nomic-embed-text and it is available locally indefinitely.

Building a Local RAG Pipeline

A RAG pipeline has three phases: ingestion (load and index documents), retrieval (find relevant chunks for a query), and generation (answer the question using retrieved context). Here is a complete working example using FAISS as the vector store.

Step 1: Load and Split Documents

from langchain_community.document_loaders import TextLoader\\nfrom langchain_text_splitters import RecursiveCharacterTextSplitter\\n\\nloader = TextLoader("my_document.txt", encoding="utf-8")\\ndocuments = loader.load()\\n\\nsplitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)\\nchunks = splitter.split_documents(documents)

RecursiveCharacterTextSplitter tries to split on paragraph boundaries first, falling back to sentence and word boundaries. A chunk_size of 500 characters works well for most prose documents; adjust upward for technical content with longer paragraphs.

Step 2: Embed and Store in FAISS

from langchain_ollama import OllamaEmbeddings\\nfrom langchain_community.vectorstores import FAISS\\n\\nembeddings = OllamaEmbeddings(model="nomic-embed-text")\\nvectorstore = FAISS.from_documents(chunks, embeddings)

This embeds every chunk and stores the vectors in an in-memory FAISS index. You can persist it to disk with vectorstore.save_local("faiss_index") and reload it later with FAISS.load_local("faiss_index", embeddings, allow_dangerous_deserialization=True).

Step 3: Retrieve and Generate

from langchain_ollama import ChatOllama\\nfrom langchain_core.prompts import ChatPromptTemplate\\nfrom langchain_core.output_parsers import StrOutputParser\\nfrom langchain_core.runnables import RunnablePassthrough\\n\\nretriever = vectorstore.as_retriever(search_kwargs={"k": 4})\\n\\nllm = ChatOllama(model="llama3.2", temperature=0.1)\\n\\nprompt = ChatPromptTemplate.from_messages([\\n    ("system", "Answer the question using only the context provided. "\\n               "If the answer is not in the context, say so.\\\\n\\\\nContext:\\\\n{context}"),\\n    ("human", "{question}"),\\n])\\n\\ndef format_docs(docs):\\n    return "\\\\n\\\\n".join(doc.page_content for doc in docs)\\n\\nrag_chain = (\\n    {"context": retriever | format_docs, "question": RunnablePassthrough()}\\n    | prompt\\n    | llm\\n    | StrOutputParser()\\n)\\n\\nanswer = rag_chain.invoke("What are the main topics covered in this document?")\\nprint(answer)

The retriever fetches the four most semantically similar chunks to the query. format_docs joins them into a single string, which is inserted into the system prompt as context. The model then answers grounded in that context rather than relying on its training data alone.

Using Different Models for Different Tasks

One practical advantage of running models locally is that you can run multiple models simultaneously without any per-token cost. A common pattern in RAG pipelines is to use a fast, small model for intermediate steps and a larger model only for final generation.

from langchain_ollama import ChatOllama, OllamaEmbeddings\\n\\n# Fast model for query expansion or classification\\nfast_llm = ChatOllama(model="llama3.2:1b", temperature=0.0)\\n\\n# Stronger model for final answer generation\\nstrong_llm = ChatOllama(model="llama3.1:8b", temperature=0.2)\\n\\n# Dedicated embedding model\\nembeddings = OllamaEmbeddings(model="nomic-embed-text")

For example, you might use llama3.2:1b to rewrite a user’s question into a more precise search query before embedding it, then use llama3.1:8b to synthesise the final answer. This keeps the pipeline fast without sacrificing quality on the step that matters most.

Common Gotchas

Model Not Loaded Yet

Ollama loads models into memory on first use, which can take several seconds for large models. If your application sends a request the moment Ollama starts, you may receive a timeout or an error before the model is ready. A simple mitigation is to send a short warm-up request at startup, or to increase the timeout on your ChatOllama instance:

llm = ChatOllama(model="llama3.2", timeout=120)  # seconds

Timeout on Long Generations

The default HTTP timeout in LangChain’s Ollama integration is relatively short. If you are generating long responses with a large model on modest hardware, increase the timeout as shown above, or use streaming so the connection stays alive throughout generation.

CORS When Calling from a Browser

Ollama’s API does not set permissive CORS headers by default. If you are building a web front end that calls Ollama directly from JavaScript in the browser, you will hit CORS errors. The correct fix is to set the OLLAMA_ORIGINS environment variable before starting Ollama:

OLLAMA_ORIGINS="http://localhost:3000" ollama serve

In production, route browser requests through your own backend rather than exposing Ollama directly. LangChain runs server-side, so this is rarely an issue when using langchain-ollama in Python.

Model Name Mismatches

Model names in Ollama are case-sensitive and must match exactly what ollama list shows. llama3.2 and llama3.2:latest are equivalent, but Llama3.2 will fail. If you get a model-not-found error, run ollama list to confirm the exact name.

Practical Example: A Local Document Q&A Chatbot

Putting it all together, here is the skeleton of a command-line document Q&A chatbot that accepts questions in a loop:

from langchain_ollama import ChatOllama, OllamaEmbeddings\\nfrom langchain_community.document_loaders import TextLoader\\nfrom langchain_community.vectorstores import FAISS\\nfrom langchain_text_splitters import RecursiveCharacterTextSplitter\\nfrom langchain_core.prompts import ChatPromptTemplate\\nfrom langchain_core.output_parsers import StrOutputParser\\nfrom langchain_core.runnables import RunnablePassthrough\\n\\n# --- Ingestion ---\\nloader = TextLoader("knowledge_base.txt", encoding="utf-8")\\ndocs = loader.load()\\nchunks = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50).split_documents(docs)\\n\\nembeddings = OllamaEmbeddings(model="nomic-embed-text")\\nvectorstore = FAISS.from_documents(chunks, embeddings)\\nretriever = vectorstore.as_retriever(search_kwargs={"k": 4})\\n\\n# --- Chain ---\\nllm = ChatOllama(model="llama3.2", temperature=0.1)\\n\\nprompt = ChatPromptTemplate.from_messages([\\n    ("system", "You are a helpful assistant. Answer using only the context below.\\\\n\\\\nContext:\\\\n{context}"),\\n    ("human", "{question}"),\\n])\\n\\ndef format_docs(docs):\\n    return "\\\\n\\\\n".join(d.page_content for d in docs)\\n\\nchain = (\\n    {"context": retriever | format_docs, "question": RunnablePassthrough()}\\n    | prompt\\n    | llm\\n    | StrOutputParser()\\n)\\n\\n# --- Interactive loop ---\\nprint("Document Q&A ready. Type 'quit' to exit.\\\\n")\\nwhile True:\\n    question = input("You: ").strip()\\n    if question.lower() in ("quit", "exit"):\\n        break\\n    for chunk in chain.stream(question):\\n        print(chunk, end="", flush=True)\\n    print("\\\\n")

This chatbot loads a text file, indexes it locally, and answers questions about it entirely on your own machine. There are no API costs, no rate limits, and no data sent externally. Swap TextLoader for PyPDFLoader or UnstructuredWordDocumentLoader (both in langchain-community) to handle PDFs or Word documents instead.

Next Steps

Once you are comfortable with basic chains and RAG, the natural progression is to add memory so the chatbot can refer back to earlier messages in a session, or to build agents that can choose between multiple tools — for example, deciding whether to search a local knowledge base or perform a web search depending on the question. LangChain’s create_react_agent function works with ChatOllama in exactly the same way it works with any other chat model, making the transition straightforward. The key insight is that once you understand the LCEL composition model, the same patterns apply regardless of whether the underlying LLM is running locally via Ollama or in the cloud via an API.

Sign Up For Daily Newsletter

Stay updated with our weekly newsletter. Subscribe now to never miss an update!

[mc4wp_form]

Leave a Reply

Your email address will not be published. Required fields are marked *