From PDFs to Intelligent Answers: Building a RAG System with Llama Index and OpenSearch

15 min readAug 10, 2024
Picture from ChatGPT

Llama Index is a powerful framework that enables you to create applications leveraging large language models (LLMs) for efficient data processing and retrieval. When combined with OpenSearch and Ollama, you can build a sophisticated question answering system for PDF documents without relying on costly cloud services or APIs.

In this article, we’ll demonstrate how to use Llama Index in conjunction with OpenSearch and Ollama to create a PDF question answering system utilizing Retrieval-augmented generation (RAG) techniques.

The technologies we’ll be using include:

  1. Llama Index: A framework for building AI applications with LLMs
  2. OpenSearch: An open-source distributed search and analytics suite
  3. HuggingFace Transformers: A library for working with Transformer-based models
  4. Vector Embeddings: A technique for converting text into numerical vectors
  5. Hybrid Search: A search technique combining keyword-based and semantic search
  6. Ollama: An open-source tool for running LLMs locally

By leveraging these technologies, we’ll create a robust system capable of understanding and answering questions based on the content of PDF documents. This approach offers a cost-effective and privacy-preserving solution for organizations looking to implement advanced document analysis and retrieval systems.

What is a large language model (LLM)?

LLMs are sophisticated AI models trained on vast amounts of data, forming the foundation for advanced AI systems. These models, based on complex neural networks, use learned patterns and relationships to predict the next item in a sequence, whether it’s text or other types of data.

For text processing, an LLM predicts the next word in a sequence based on the context of previous words, using probability distribution techniques. Examples of LLMs include GPT from OpenAI, BERT from Google, and T5 from Google. In our example, we’ll be using the BAAI/bge-m3 model for generating embeddings.

What is Retrieval-augmented generation (RAG)?

RAG is a technique that enhances the accuracy and reliability of generative AI models by incorporating facts from trusted external sources. By giving your LLM access to specific, known factual information, you can improve its output by reducing hallucinations and increasing accuracy. In our case, we’re using RAG to allow the LLM to answer questions based on the content of PDF documents.

What is Llama Index?

Llama Index is a data framework for building LLM applications. It provides tools for ingesting, structuring, and accessing data in LLM applications. Llama Index helps in creating a bridge between external data and LLMs, making it easier to build applications that can query and analyze large amounts of custom data.

What is OpenSearch?

OpenSearch is an open-source, distributed search and analytics suite. In our context, we’re using it as a vector store to efficiently index and search through vector embeddings of our document chunks. It provides fast and scalable full-text search capabilities, which we leverage for hybrid search combining keyword-based and semantic search.

What is HuggingFace Transformers?

HuggingFace Transformers is a popular library that provides thousands of Transformer-based models for various NLP tasks. In our project, we’re using it to load and utilize the BAAI/bge-m3 model for generating text embeddings.

What is Ollama?

Ollama is an open-source tool that simplifies the process of running large language models (LLMs) locally on your machine. It allows users to easily download, run, and manage various open-source LLMs without the need for complex setups or cloud services.

Ollama is designed to work efficiently on consumer-grade hardware, making it possible to leverage the power of LLMs on personal computers. In our project, we’ll be using Ollama to download and run the LLM that we’ll use for our question-answering system. This allows us to have a powerful, locally-run AI model without relying on external API services.

Let’s get started. Here’s what we’ll do:

  1. Set up our environment and install necessary libraries
  2. Configure a hybrid search pipeline
  3. Load and process PDF documents
  4. Generate embeddings using the BAAI/bge-m3 model
  5. Store embeddings in OpenSearch
  6. Create a VectorStoreIndex using Llama Index
  7. Use the index to answer questions about the PDF content
  8. Implement Few-Shot Learning with OpenThaiGPT on Ollama

This approach allows us to create a powerful, flexible, and efficient system that not only retrieves relevant information from PDF documents but also generates intelligent, context-aware answers to natural language questions.

1. Set up our environment and install necessary libraries

1.1 Ollama

Before we begin, ensure you’re using a Linux or macOS system, as Ollama is not currently supported on Windows. If you’re on Windows, you’ll need to use Windows Subsystem for Linux (WSL).

curl -fsSL https://ollama.com/install.sh | sh

1.2 Docker Desktop

Install Docker Desktop: Download and install Docker Desktop from the official website (https://www.docker.com/products/docker-desktop/). This will be used to run OpenSearch.

Creates a custom Docker network named “opensearch-net”. It establishes an isolated network environment within Docker, allowing containers to communicate with each other securely.

docker network create opensearch-net

1.3 OpenSearch

Run OpenSearch using Docker: Once Docker Desktop is installed and running, you can start OpenSearch with the following command:

docker run -e OPENSEARCH_JAVA_OPTS="-Xms512m -Xmx512m" -e discovery.type="single-node" \
-e DISABLE_SECURITY_PLUGIN="true" -e bootstrap.memory_lock="true" \
-e cluster.name="opensearch-cluster" -e node.name="os01" \
-e plugins.neural_search.hybrid_search_disabled="true" \
-e DISABLE_INSTALL_DEMO_CONFIG="true" \
--ulimit nofile="65536:65536" --ulimit memlock="-1:-1" \
--net opensearch-net --restart=no \
-v opensearch-data:/usr/share/opensearch/data \
-p 9200:9200 \
--name=opensearch-single-node \
opensearchproject/opensearch:latest

This command runs OpenSearch in a Docker container, making it accessible on port 9200.

1.4 Python Libraries

Set up Python environment: Now, let’s set up our Python environment. If you’re using a Jupyter notebook, you can run these commands directly in a cell:

%pip install llama-index
%pip install llama-index-readers-elasticsearch
%pip install llama-index-vector-stores-opensearch
%pip install llama-index-embeddings-ollama
%pip install ollama
%pip install nest-asyncio
%pip install llama-index-embeddings-huggingface

Import necessary modules and set up configuration:

from llama_index.core import SimpleDirectoryReader, VectorStoreIndex, StorageContext
from llama_index.core.node_parser import TokenTextSplitter
from llama_index.vector_stores.opensearch import OpensearchVectorStore, OpensearchVectorClient
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
import torch
import nest_asyncio
from os import getenv

# Apply nest_asyncio to avoid runtime errors in Jupyter notebooks
nest_asyncio.apply()

# Check if CUDA is available for GPU acceleration
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

In these setups, we’ve:

  1. Installed Ollama for running LLMs locally
  2. Set up OpenSearch using Docker
  3. Installed the necessary Python libraries
  4. Imported required modules
  5. Set up CUDA for GPU acceleration if available

Now that our environment is fully set up with Ollama and OpenSearch running, we’re ready to start processing PDF documents and building our question answering system. In the next section, we’ll cover how to load and process PDF documents using Llama Index.

2. Configure a hybrid search pipeline

In this step, we’ll configure a hybrid search pipeline in OpenSearch. This pipeline combines traditional keyword-based search with semantic vector search, allowing for more accurate and contextually relevant results.

! curl -XPUT "http://localhost:9200/_search/pipeline/hybrid-search-pipeline" -H 'Content-Type: application/json' -d' \
{ \
"description": "Pipeline for hybrid search", \
"phase_results_processors": [ \
{ \
"normalization-processor": { \
"normalization": { \
"technique": "min_max" \
}, \
"combination": { \
"technique": "harmonic_mean", \
"parameters": { \
"weights": [ \
0.3, \
0.7 \
] \
} \
} \
} \
} \
] \
}'

Let’s break down this configuration:

2.1 API Endpoint

We’re using a PUT request to the OpenSearch API endpoint /_search/pipeline/hybrid-search-pipeline to create or update a search pipeline named "hybrid-search-pipeline".

2.2 Pipeline Configuration

  • description: Provides a brief description of the pipeline's purpose.
  • phase_results_processors: Defines the processors that will be applied to the search results.

2.3 Normalization Processor

  • normalization: Uses the "min_max" technique to normalize scores from different search methods (keyword and semantic) to a common scale.
  • combination: Specifies how to combine the normalized scores:
  • technique: "harmonic_mean" is used to combine the scores.
  • weights: [0.3, 0.7] assigns different weights to the two search methods. In this case, the semantic search (vector search) is given more weight (0.7) than the keyword search (0.3).

This hybrid approach allows us to leverage both the speed of keyword search and the contextual understanding of semantic search. By assigning a higher weight to the semantic search, we prioritize results that are conceptually more relevant to the query, while still considering exact keyword matches.

The use of curl in this step assumes you’re running this in a Unix-like environment or using WSL on Windows. If you’re in a different environment, you may need to adjust this command or use an alternative method to send the HTTP request to OpenSearch.

3. Load and process PDF documents

In this step, we’ll load PDF documents and convert them into a format suitable for further processing, using Llama Index to help manage the data.

3.1 Load PDF documents:

from llama_index.core import SimpleDirectoryReader
reader = SimpleDirectoryReader(input_dir="pdf_corpus",recursive=True)
documents = reader.load_data()

Explanation: We use SimpleDirectoryReader to load PDF files from a specified directory. This function is part of Llama Index and is designed to handle various document formats, including PDFs. It automatically extracts text content from the PDFs, making it easier to process the information in subsequent steps.

3.2 Split documents into chunks:

from llama_index.core.node_parser import TokenTextSplitter
splitter = TokenTextSplitter(
chunk_size=512,
chunk_overlap=128,
separator=" ",
)
token_nodes = splitter.get_nodes_from_documents(
documents, show_progress=True
)

Explanation: We use TokenTextSplitter to divide the documents into smaller, manageable pieces called “token_nodes”.

  • chunk_size=512: This determines the maximum number of tokens in each chunk (each node). We choose 512 as it’s a common size that balances between having enough context and not overwhelming the model.
  • chunk_overlap=128: This allows for some overlap between chunks, which helps maintain context across chunk boundaries. The overlap of 128 tokens ensures that we don’t lose important information that might be split between two chunks.

4. Generate embeddings using the BAAI/bge-m3 model

In this step, we’ll use the BAAI/bge-m3 model to generate embeddings for our text chunks. These embeddings are crucial for semantic search and understanding the content of our documents.

from llama_index.embeddings.huggingface import HuggingFaceEmbedding
embedding_model_name = 'BAAI/bge-m3'
embedding_model = HuggingFaceEmbedding(model_name=embedding_model_name,max_length=512, device=device)
embeddings = embedding_model.get_text_embedding("box")
dim = len(embeddings)
print("embedding dimension of example text ===>",dim)

In this code:

  1. We set up the BAAI/bge-m3 model using the HuggingFaceEmbedding class, specifying a maximum length of 512 tokens and using the device (CPU or GPU) that we determined earlier.
  2. We use the get_text_embedding_batch method to efficiently generate embeddings for all our text nodes at once.
  3. We print out the number of embeddings created.
  4. We generate an example embedding for the word “box” to determine the dimension of our embeddings and print this information.

The BAAI/bge-m3 model is specifically designed for generating high-quality embeddings for semantic search tasks. By using this model, we ensure that our embeddings capture the semantic meaning of our text chunks effectively.

These embeddings will be used in subsequent steps for indexing and searching our document content, allowing us to find the most relevant information when answering questions about our PDFs.

5. Setup and Initialize OpenSearch Vector Client

In this step, we’ll set up and initialize the OpenSearch Vector Client, preparing our system to use OpenSearch as a vector database for efficient similarity searches in the future.

5.1 Set up OpenSearch connection details

from os import getenv
from llama_index.vector_stores.opensearch import (
OpensearchVectorStore,
OpensearchVectorClient,
)

# http endpoint for your cluster (opensearch required for vector index usage)
endpoint = getenv("OPENSEARCH_ENDPOINT", "http://localhost:9200")
# index to demonstrate the VectorStore impl
idx = getenv("OPENSEARCH_INDEX", "test_pdf_index")

We use getenv to retrieve the OpenSearch endpoint and index name from environment variables, with default values provided. This approach allows for easy configuration changes across different environments without modifying the code.

5.2 Configure OpenSearchVectorClient

# OpensearchVectorClient stores text in this field by default
text_field = "content_text"
# OpensearchVectorClient stores embeddings in this field by default
embedding_field = "embedding"
# OpensearchVectorClient encapsulates logic for a
# single opensearch index with vector search enabled with hybrid search pipeline
client = OpensearchVectorClient(
endpoint=endpoint,
index=idx,
dim=dim,
embedding_field=embedding_field,
text_field=text_field,
search_pipeline="hybrid-search-pipeline",
)

We create an OpensearchVectorClient with specific configurations:

  • endpoint and index: Specify where to connect to OpenSearch and which index to use.
  • dim: Sets the dimension of the embedding vectors, which must match the dimension of our generated embeddings.
  • embedding_field and text_field: Define the field names for storing embeddings and text content, respectively.
  • search_pipeline: Specifies the use of a "hybrid-search-pipeline", combining keyword and semantic search for improved results.

5.3 Initialize OpensearchVectorStore

# initialize vector store
vector_store = OpensearchVectorStore(client)

We create an instance of OpensearchVectorStore using the configured client. This provides an interface for managing the vector store within the Llama Index framework, facilitating easy integration with other Llama Index functionalities.

This structured approach prepares our system for future storage and retrieval of embeddings using OpenSearch’s vector search capabilities. The hybrid search pipeline enables more sophisticated search operations, potentially improving the accuracy and relevance of our document retrieval process. The actual storage of embeddings will occur in subsequent steps.

In a production environment, it’s recommended to implement error handling and connection verification to ensure robust operation. Additionally, you may want to test the connection and perform a simple operation to confirm that everything is set up correctly before proceeding to the next steps.

6. Create VectorStoreIndex and Store Embeddings

In this step, we’ll create a VectorStoreIndex using Llama Index and explicitly store our document embeddings in OpenSearch. This process enables efficient semantic search and retrieval of information from our processed documents.

6.1 Create a StorageContext

from llama_index.core import VectorStoreIndex, StorageContext

storage_context = StorageContext.from_defaults(vector_store=vector_store)

Here, we create a StorageContext using the vector_store (OpenSearch) we initialized in the previous step. This StorageContext provides a standardized interface for managing storage in Llama Index and connects our index to the OpenSearch vector store.

6.2 Create VectorStoreIndex, Generate and Store Embeddings

index = VectorStoreIndex(
token_nodes, storage_context=storage_context, embed_model=embedding_model
)

In this crucial step, we create the VectorStoreIndex, generate embeddings for our documents, and store them in OpenSearch:

  • token_nodes: The processed document nodes created earlier.
  • storage_context: The storage context we just created, linking to our OpenSearch vector store.
  • embedding_model: The embedding model we're using (BAAI/bge-m3 in this case).

This operation performs several important tasks:

  1. It crucially generates embeddings for each document node using the specified embedding_model.
  2. It then stores these embeddings in the OpenSearch vector store defined in our storage_context. OpenSearch handles the efficient storage and indexing of these vector embeddings.
  3. Finally, Llama Index constructs its own internal searchable structure, which: (a) Creates a local index to store document metadata and references to the embeddings in OpenSearch. (b) Builds efficient data structures to enable fast similarity searches and retrieval of relevant documents.

Note: You can verify that embeddings are stored in OpenSearch by checking the OpenSearch index after this operation. The embeddings will be stored in the field specified when setting up the OpenSearchVectorClient (typically “embedding”).

The VectorStoreIndex acts as a bridge between our processed documents, their embeddings stored in OpenSearch, and the Llama Index framework. It enables us to perform fast semantic similarity searches when querying our documents in later steps, leveraging the vector search capabilities of OpenSearch.

By creating this index and explicitly storing the embeddings in OpenSearch, we’re establishing a system that can quickly find and retrieve relevant information from our processed PDF documents based on semantic similarity. This will be crucial for implementing effective question-answering capabilities in the subsequent steps.

In a production environment, consider adding error handling, progress reporting, and verification steps to ensure embeddings are correctly stored in OpenSearch, especially when dealing with a large number of documents.

7. Use the index to answer questions about the PDF content

In this section, we’ll use our created index to answer questions about the content of our PDF documents.

7.1 Set up the retriever

from llama_index.core.vector_stores.types import VectorStoreQueryMode
retriever = index.as_retriever(similarity_top_k=3,vector_store_query_mode=VectorStoreQueryMode.HYBRID)

We create a retriever from our index with the following parameters:

  • similarity_top_k=3: This specifies that we want to retrieve the top 3 most similar results.
  • vector_store_query_mode=VectorStoreQueryMode.HYBRID: This enables hybrid search, combining both keyword and semantic search for better results.

7.2 Define the query

question = "ไม่สามารถใช้งานไวไฟได้ต้องแก้ไขอย่างไร"

This is the question we want to ask about our PDF content. It’s in Thai, asking “How to fix when WiFi is not working?”

7.3 Retrieve relevant information

prompt = retriever.retrieve(question)

We use the retriever to find the most relevant information from our indexed documents based on the query.

7.4 Display the results

for r in prompt:
print(r.metadata)
print(r)

We iterate through the retrieved results, printing both the metadata and the content of each retrieved chunk.

This step demonstrates how we can use our indexed documents to retrieve relevant information based on a natural language query. The hybrid search mode allows us to leverage both keyword matching and semantic similarity, potentially providing more accurate and contextually relevant results.

Picture from author and document [https://www.tenda.cz/sites/Upload/HG1/HG1_UG.pdf]

By using this approach, we can efficiently answer questions about the content of our PDF documents, even if the exact wording doesn’t match what’s in the documents. This is particularly useful for creating intelligent document querying systems or chatbots that can understand and respond to questions about specific document contents.

8. Implement Few-Shot Learning with OpenThaiGPT on Ollama

In this final step, we’ll use few-shot learning with OpenThaiGPT to improve the quality of answers. Few-shot learning is a technique that allows models to quickly learn and adapt to new tasks using only a few examples. In this case, we provide the model with examples of answering questions about WiFi troubleshooting. This helps the model understand the format and approach we want for answering questions, which will result in more relevant and useful responses for users.

Few-shot learning is particularly effective when we want the model to follow a specific pattern or style in its responses. By providing carefully crafted examples, we can guide the model to generate answers that are more aligned with our expectations and more helpful for the end-users.”

These improvements will make the few-shot learning more effective for answering questions specifically about WiFi issues. The examples now directly relate to common WiFi problems, which should help the model generate more relevant and practical responses. The added explanation provides context on why we’re using few-shot learning and its benefits in this scenario.

8.1 Set up Ollama with OpenThaiGPT

First, ensure that you have Ollama installed and the OpenThaiGPT model is available:

! ollama list

You should see openthaigpt:latest in the list of available models.

8.2 Create a function to query OpenThaiGPT

Let’s create a function that will send our query and retrieved context to OpenThaiGPT:

import ollama

def query_openthaigpt(question, context):
formatted_prompt = f'''# Few-shot examples
Example 1:
Context: An ONT (Optical Network Terminal) device is unable to connect to the network.
Question: What are the steps to troubleshoot an ONT that cannot connect?
Answer: 1. Check the fiber optic cable connection.
2. Restart the ONT device.
3. Verify the status lights on the device.
4. Test the connection with a spare ONT.
5. Contact technical support if the issue persists.

Example 2:
Context: A customer complains about slow internet speed in their FTTX system.
Question: What is the procedure for diagnosing and fixing slow internet speed in an FTTX system?
Answer: 1. Run an internet speed test using a speed testing tool.
2. Check the Wi-Fi router settings.
3. Test the connection directly to the ONT using an Ethernet cable.
4. Verify the optical signal strength at the ONT.
5. Check the customer's bandwidth usage.
6. Adjust device settings or replace faulty equipment if necessary.

# RAG component
Retrieved information:
{context}

# Actual question
Question: {question}

Please answer the question with Thai language using the information from the examples and the retrieved information above. Focus on troubleshooting end-user devices in FTTX projects. Provide a clear, step-by-step answer, prioritized in order of importance. Include a brief explanation for each step. If the provided information is insufficient to answer the question completely, state "The available information is not sufficient to fully answer this question" and suggest general troubleshooting steps.

Answer:'''

#print(formatted_prompt)
response = ollama.generate(model='openthaigpt:latest', prompt=formatted_prompt)
return response['response']

This function takes the user’s question and the retrieved context, formats them into a prompt, and sends it to the OpenThaiGPT model running on Ollama.

8.3 Integrate OpenThaiGPT with our retriever

Now, let’s modify our question-answering process to use OpenThaiGPT:

def answer_question(question):   
# Query OpenThaiGPT
answer = query_openthaigpt(question, prompt[0])
return answer

# Example usage
answer = answer_question(question)
print(f"Question: {question}")
print(f"Answer: {answer}")

This function retrieves relevant information using our previously set up retriever, combines the retrieved text into a context, and then uses OpenThaiGPT to generate an answer based on this context.

Your RAG Journey Concludes

Congratulations on building a powerful Retrieval-Augmented Generation (RAG) system! Let’s recap what we’ve achieved:

  1. Open-Source Power: We’ve created a cost-effective, privacy-preserving solution using Llama Index, OpenSearch, and OpenThaiGPT.
  2. Dynamic Knowledge Base: You’ve transformed static PDFs into an intelligent, queryable resource.
  3. Enhanced Responses: With few-shot learning, we’ve tailored the system for specific use cases like WiFi troubleshooting.
  4. Real-World Impact: This system can revolutionize document interaction in customer service, research, and beyond.

The potential applications are vast — from chatbots accessing entire product manuals to research assistants synthesizing thousands of papers. Your new skills open doors to creating more intelligent, efficient information systems.

What’s your next step? Will you adapt this for a specific need or dive deeper into one of the technologies?

Remember, you’re at the forefront of reshaping how we interact with information. Share your ideas in the comments — your insights could spark the next big innovation!

Happy coding, and here’s to turning information into actionable insights! 🚀📚✨

👏 If you found this article helpful, don’t forget to clap and share your thoughts in the comments below!

--

--

Aekanun Thongtae
Aekanun Thongtae

Written by Aekanun Thongtae

Experienced Senior Big Data & Data Science Consultant with a history of working in many enterprises and various domains . Skilled in Apache Spark, and Hadoop.

Responses (2)