Large language models are impressive, but they hallucinate and they cannot answer questions about your private data. Retrieval-Augmented Generation (RAG) solves both problems by grounding the model's responses in real documents that you control. Instead of relying on the LLM's training data alone, a RAG system retrieves relevant passages from your own knowledge base and feeds them into the prompt so the model can generate accurate, cited answers.
In this tutorial, you will build a fully working RAG chatbot from scratch using LangChain for orchestration, Pinecone as the vector store, and OpenAI for embeddings and chat completion. By the end, you will have a chatbot that ingests PDF documents, stores them as vector embeddings, and answers natural language questions with context drawn directly from those documents. The chatbot will also maintain conversation history so follow-up questions work naturally.
Architecture Overview
Before writing any code, it helps to understand the moving parts. A RAG system has two phases: an ingestion phase and a query phase.
During ingestion, your documents are loaded, split into smaller chunks, converted into numerical vectors (embeddings) by an embedding model, and stored in a vector database like Pinecone. During the query phase, the user's question is also converted into an embedding, the vector database finds the most similar document chunks, and those chunks are injected into a prompt alongside the question. The LLM then generates an answer grounded in those retrieved passages.
Here is the complete pipeline at a glance:
- Load documents - Read PDFs, text files, or web pages into LangChain Document objects.
- Split into chunks - Break documents into overlapping chunks small enough for the embedding model.
- Embed chunks - Convert each chunk into a vector using OpenAI's embedding model.
- Store in Pinecone - Persist vectors in Pinecone for fast similarity search.
- Retrieve and generate - At query time, find the top-k relevant chunks and pass them to the LLM to produce a grounded answer.

Prerequisites
Before you start, make sure you have the following ready:
- Python 3.11+ installed on your machine.
- An OpenAI API key with access to the embeddings and chat completions endpoints.
- A free Pinecone account and API key. The free tier is sufficient for this tutorial.
- One or more PDF documents you want the chatbot to answer questions about.
- Basic familiarity with Python and working with APIs.
Step 1: Install Dependencies
Create a new project directory and set up a virtual environment. Then install all the packages you need. LangChain provides the orchestration layer, langchain-pinecone handles the vector store integration, and pypdf lets you load PDF files.
mkdir rag-chatbot && cd rag-chatbot
python -m venv .venv
source .venv/bin/activate
pip install langchain langchain-openai langchain-pinecone \
langchain-community pinecone-client pypdf python-dotenvCreate a .env file in the project root to store your API keys securely:
OPENAI_API_KEY=sk-your-openai-api-key
PINECONE_API_KEY=your-pinecone-api-key
PINECONE_INDEX_NAME=rag-chatbotStep 2: Load and Split Documents
The first step in any RAG pipeline is getting your documents into a format LangChain can work with. LangChain has document loaders for dozens of formats. For this tutorial, we will use PyPDFLoader to read PDF files, then RecursiveCharacterTextSplitter to break them into chunks.
Why split documents? Embedding models have token limits, and smaller chunks produce more precise retrieval. A chunk size of 1000 characters with a 200-character overlap is a solid starting point. The overlap ensures that context is not lost at chunk boundaries.
import os
from pathlib import Path
from dotenv import load_dotenv
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
load_dotenv()
def load_and_split_documents(pdf_directory: str) -> list:
"""Load all PDFs from a directory and split them into chunks."""
documents = []
pdf_dir = Path(pdf_directory)
for pdf_file in pdf_dir.glob("*.pdf"):
print(f"Loading: {pdf_file.name}")
loader = PyPDFLoader(str(pdf_file))
documents.extend(loader.load())
print(f"Loaded {len(documents)} pages from {len(list(pdf_dir.glob('*.pdf')))} PDFs")
# Split documents into chunks
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200,
length_function=len,
separators=["\n\n", "\n", ". ", " ", ""],
)
chunks = text_splitter.split_documents(documents)
print(f"Split into {len(chunks)} chunks")
return chunks
if __name__ == "__main__":
chunks = load_and_split_documents("./docs")
# Preview the first chunk
print(f"\nFirst chunk ({len(chunks[0].page_content)} chars):")
print(chunks[0].page_content[:300])
print(f"\nMetadata: {chunks[0].metadata}")Place your PDF files in a docs/ folder at the project root. The loader automatically preserves metadata like page numbers and source filenames, which you can display alongside chatbot answers later.
Step 3: Create Embeddings and Store in Pinecone
Now you need to turn each text chunk into a vector and store it in Pinecone. OpenAI's text-embedding-3-small model produces 1536-dimensional vectors and works well for RAG applications. You can learn more in the OpenAI embeddings guide.
First, create a Pinecone index. Go to the Pinecone dashboard and create an index named rag-chatbot with a dimension of 1536 and the cosine metric. Alternatively, you can create it programmatically:
from pinecone import Pinecone, ServerlessSpec
from langchain_openai import OpenAIEmbeddings
from langchain_pinecone import PineconeVectorStore
from dotenv import load_dotenv
import os
import time
load_dotenv()
# Initialize Pinecone
pc = Pinecone(api_key=os.getenv("PINECONE_API_KEY"))
index_name = os.getenv("PINECONE_INDEX_NAME", "rag-chatbot")
# Create the index if it doesn't exist
if index_name not in pc.list_indexes().names():
pc.create_index(
name=index_name,
dimension=1536,
metric="cosine",
spec=ServerlessSpec(cloud="aws", region="us-east-1"),
)
# Wait for index to be ready
while not pc.describe_index(index_name).status["ready"]:
time.sleep(1)
print(f"Created index: {index_name}")
else:
print(f"Index '{index_name}' already exists")
# Initialize the embedding model
embeddings = OpenAIEmbeddings(
model="text-embedding-3-small",
openai_api_key=os.getenv("OPENAI_API_KEY"),
)
# Create the vector store and add documents
def ingest_documents(chunks: list) -> PineconeVectorStore:
"""Embed document chunks and store them in Pinecone."""
print(f"Embedding and storing {len(chunks)} chunks...")
vector_store = PineconeVectorStore.from_documents(
documents=chunks,
embedding=embeddings,
index_name=index_name,
)
print("Ingestion complete!")
return vector_storeAdd the ingestion call to the bottom of your ingest.py script's __main__ block so the full pipeline runs with a single command:
if __name__ == "__main__":
chunks = load_and_split_documents("./docs")
vector_store = ingest_documents(chunks)
print(f"\nStored {len(chunks)} chunks in Pinecone index '{index_name}'")Run the ingestion script with python ingest.py. Depending on the number and size of your PDFs, this may take a minute or two. You only need to run ingestion once per document set. If you add new documents later, you can run it again and the new vectors will be appended to the same index.
Step 4: Build the RAG Chain
This is the core of the chatbot. You will connect the Pinecone vector store to a LangChain retrieval chain that retrieves relevant documents and sends them to GPT-4o for answer generation. The prompt template is critical here as it instructs the model to only answer based on the provided context and to say "I don't know" when the context does not contain the answer.
import os
from dotenv import load_dotenv
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_pinecone import PineconeVectorStore
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser
load_dotenv()
# Initialize components
embeddings = OpenAIEmbeddings(
model="text-embedding-3-small",
openai_api_key=os.getenv("OPENAI_API_KEY"),
)
vector_store = PineconeVectorStore(
index_name=os.getenv("PINECONE_INDEX_NAME", "rag-chatbot"),
embedding=embeddings,
)
retriever = vector_store.as_retriever(
search_type="similarity",
search_kwargs={"k": 4}, # Return top 4 most relevant chunks
)
llm = ChatOpenAI(
model="gpt-4o",
temperature=0,
openai_api_key=os.getenv("OPENAI_API_KEY"),
)
# Define the RAG prompt
RAG_PROMPT = ChatPromptTemplate.from_template("""\
You are a helpful assistant that answers questions based on the provided context.
Use ONLY the context below to answer the question. If the context does not contain
enough information to answer the question, say "I don't have enough information
to answer that question based on the available documents."
When possible, cite which document or section your answer comes from.
Context:
{context}
Question: {question}
Answer:
""")
def format_docs(docs):
"""Format retrieved documents into a single string for the prompt."""
formatted = []
for i, doc in enumerate(docs, 1):
source = doc.metadata.get("source", "Unknown")
page = doc.metadata.get("page", "N/A")
formatted.append(
f"[Document {i} | Source: {source}, Page: {page}]\n{doc.page_content}"
)
return "\n\n".join(formatted)
# Build the RAG chain using LCEL (LangChain Expression Language)
rag_chain = (
{"context": retriever | format_docs, "question": RunnablePassthrough()}
| RAG_PROMPT
| llm
| StrOutputParser()
)Let us break down what is happening here. The retriever wraps the Pinecone vector store with a search configuration that returns the top 4 most similar chunks. The format_docs function takes the retrieved documents and formats them into a readable string with source information. The LCEL pipe syntax chains everything together: the question goes to the retriever, the retrieved docs get formatted, everything gets inserted into the prompt template, the prompt goes to the LLM, and the output is parsed into a plain string.
Step 5: Add Conversation Memory
A plain RAG chain treats every question independently. That means if the user asks "What is the return policy?" and then follows up with "How long does it take?", the chatbot will not know that "it" refers to returns. To fix this, you need to add conversation memory that reformulates follow-up questions into standalone queries.
The approach is to use a second LLM call that takes the chat history and the latest user message, then rewrites the message into a self-contained question. This standalone question is then used for retrieval.
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
from langchain_core.messages import HumanMessage, AIMessage
# Prompt to condense follow-up questions into standalone questions
CONDENSE_PROMPT = ChatPromptTemplate.from_messages([
("system", """Given the following conversation and a follow-up question,
rephrase the follow-up question to be a standalone question that captures
all necessary context from the conversation history.
If the question is already standalone, return it as-is."""),
MessagesPlaceholder(variable_name="chat_history"),
("human", "{question}"),
])
condense_chain = CONDENSE_PROMPT | llm | StrOutputParser()
# Full RAG prompt with history awareness
RAG_PROMPT_WITH_HISTORY = ChatPromptTemplate.from_template("""\
You are a helpful assistant that answers questions based on the provided context.
Use ONLY the context below to answer the question. If the context does not contain
enough information to answer the question, say "I don't have enough information
to answer that question based on the available documents."
When possible, cite which document or section your answer comes from.
Context:
{context}
Question: {standalone_question}
Answer:
""")
class RAGChatbot:
"""RAG chatbot with conversation memory."""
def __init__(self):
self.chat_history: list = []
def ask(self, question: str) -> str:
"""Ask a question with conversation context."""
# Step 1: Condense the question using chat history
if self.chat_history:
standalone_question = condense_chain.invoke({
"chat_history": self.chat_history,
"question": question,
})
else:
standalone_question = question
# Step 2: Retrieve relevant documents
docs = retriever.invoke(standalone_question)
context = format_docs(docs)
# Step 3: Generate answer
prompt = RAG_PROMPT_WITH_HISTORY.format(
context=context,
standalone_question=standalone_question,
)
response = llm.invoke(prompt)
answer = response.content
# Step 4: Update chat history
self.chat_history.append(HumanMessage(content=question))
self.chat_history.append(AIMessage(content=answer))
# Keep only the last 10 exchanges to avoid token limits
if len(self.chat_history) > 20:
self.chat_history = self.chat_history[-20:]
return answer
def reset(self):
"""Clear conversation history."""
self.chat_history = []
print("Conversation history cleared.")The RAGChatbot class maintains a list of messages. When the user asks a follow-up question, the condense chain rewrites it using the history. For example, if the first question was "What is the refund policy?" and the follow-up is "How long does it take?", the condense chain would rewrite it as "How long does the refund process take?" before searching the vector store. The history is capped at the last 10 exchanges (20 messages) to stay within the LLM's context window.
Step 6: Test Your Chatbot
Now put it all together with an interactive loop. Add this to the bottom of chat.py:
def main():
"""Run the chatbot in interactive mode."""
print("RAG Chatbot Ready!")
print("Type your questions below. Commands: 'quit' to exit, 'reset' to clear history.")
print("-" * 60)
chatbot = RAGChatbot()
while True:
question = input("\nYou: ").strip()
if not question:
continue
if question.lower() in ("quit", "exit", "q"):
print("Goodbye!")
break
if question.lower() == "reset":
chatbot.reset()
continue
print("\nAssistant: ", end="")
answer = chatbot.ask(question)
print(answer)
if __name__ == "__main__":
main()Run it with python chat.py and start asking questions. Here is an example session assuming you ingested a company handbook:
# Example conversation output:
#
# RAG Chatbot Ready!
# Type your questions below. Commands: 'quit' to exit, 'reset' to clear history.
# ------------------------------------------------------------
#
# You: What is the company's remote work policy?
#
# Assistant: According to the Employee Handbook (Page 12), the company offers
# a hybrid remote work policy. Employees can work remotely up to 3 days per
# week after completing their 90-day probationary period. Remote work
# arrangements must be approved by the direct manager.
# [Document 1 | Source: employee_handbook.pdf, Page: 12]
#
# You: Does that apply to contractors too?
#
# Assistant: Based on the available documents, the remote work policy described
# on Page 12 applies specifically to full-time employees. The Contractor
# Guidelines (Page 3) state that contractors follow the work arrangement
# specified in their individual contracts.
# [Document 2 | Source: contractor_guidelines.pdf, Page: 3]
#
# You: quit
# Goodbye!Tuning Chunk Size for Better Results
The chunk_size and chunk_overlap parameters in the text splitter have a significant impact on retrieval quality. Smaller chunks (500 characters) give more precise retrieval but may lose context. Larger chunks (1500 characters) preserve more context but may dilute relevance. Start with 1000/200 and experiment based on your documents. For technical documentation with short sections, try 500/100. For narrative text like reports, try 1500/300. You can also experiment with the 'k' parameter in the retriever — increasing it from 4 to 6 retrieves more context but uses more tokens.
How Retrieval Actually Works Under the Hood
When a user submits a question, the embedding model converts that question into a 1536-dimensional vector, exactly the same way your document chunks were embedded. Pinecone then performs an approximate nearest neighbor (ANN) search across all stored vectors and returns the ones with the highest cosine similarity scores. These are the chunks most semantically related to the question, even if they do not share exact keywords. This is what makes vector search so powerful compared to traditional keyword search: the question "How do I get my money back?" will match chunks about "refund policy" because they are semantically close in the embedding space.
The retrieved chunks are then formatted and injected into the system prompt as context. The LLM generates its response based exclusively on this context. By instructing the model to say "I don't know" when the context is insufficient, you prevent hallucinations that would otherwise undermine trust in your chatbot.
Going Further
You now have a working RAG chatbot, but there are many ways to extend it for production use:
- Streaming responses - Replace
llm.invoke()withllm.stream()to display answers token by token, giving the user immediate feedback instead of waiting for the full response. - Web UI with Streamlit or Gradio - Wrap the chatbot in a simple web interface. Streamlit's
st.chat_inputandst.chat_messagecomponents make it straightforward to build a ChatGPT-style interface in under 50 lines of code. - Alternative vector databases - LangChain supports Chroma (local, no account needed), Weaviate, Qdrant, pgvector, and many others. Swapping vector stores requires changing just a few lines since LangChain abstracts the interface.
- Hybrid search - Combine vector similarity with keyword search (BM25) for better recall. LangChain's
EnsembleRetrievermakes this easy to set up. - Source citations with links - Extend the output to include clickable links to the source documents or specific page numbers, making it easy for users to verify answers.
- Evaluation and monitoring - Use tools like RAGAS or LangSmith to evaluate retrieval quality and track metrics like answer relevance, faithfulness, and context recall over time.
- Multi-format ingestion - Add support for Word documents, Markdown files, HTML pages, or Notion exports using LangChain's extensive library of document loaders.
Related Reading
Continue learning with these related articles:
- When to choose RAG over fine-tuning
- Deploy your chatbot to production with Docker
- More essential AI tools for developers
Key Takeaways
- RAG grounds LLM responses in your own data, dramatically reducing hallucinations and enabling answers about private or domain-specific information.
- The pipeline has two phases: ingestion (load, split, embed, store) and query (embed question, retrieve, generate). Once you understand this separation, the architecture becomes intuitive.
- LangChain's LCEL syntax makes it simple to compose retrieval, prompt formatting, LLM calls, and output parsing into a single declarative chain.
- Conversation memory requires a question-condensation step so follow-up questions are properly understood and matched against the vector store.
- Chunk size, overlap, and the number of retrieved documents (k) are the most important tuning parameters. Start with sensible defaults and adjust based on your specific use case and documents.
- Pinecone's managed vector database removes the operational burden of running your own search infrastructure, but LangChain makes it easy to swap in alternatives like Chroma for local development.
RAG is one of the most practical patterns in applied AI today. With the code from this tutorial, you have a solid foundation that can be adapted for customer support, internal knowledge bases, legal document analysis, research assistants, and countless other applications. The key to a great RAG system is not just the code but the quality of your chunking strategy and prompt engineering. Invest time in tuning those, and your chatbot will deliver genuinely useful answers from your documents.



