Searching your own files in natural language

How to search and chat with your own files using LangChain, vector databases and LLMs. Useful, but with limitations.

Reinaldo Bergamaschi
21 min readSep 6, 2023
This image was created with the assistance of DALL·E 2

It is impossible today not to be overloaded with technical information. If you are like me, you probably receive dozens of links to interesting medium articles, journal and conference papers, technical reports, etc., many of which are worth reading. When you do not have time to read them immediately (who does?), you save them locally in your computer. Only to never find them again.

I have a constant problem of not finding “interesting” files that I saved. I usually remember what they were about, but the search on file content, available on Windows or Linux, always left me disappointed.

In the past, my company developed a full search system using Solr which could be deployed in a laptop and used to search on keywords on any type of file (pdf, csv, html, docx, xlsx, etc.). It would regularly scan all files in your computer, selected using a variety of filters, and index their contents as a search index. This worked very well, but it did not have the ability to pull together data from various sources and present a summarized version of the information available. For example, a typical search engine would return the lines and files containing the word “LLM”, but it would not understand a query like “Summarize the information about LLMs in my files and also return the sources”. Enter LangChain, vector databases and OpenAI/GPT.

In order to create such system we need to implement two flows: one responsible for collecting the information from our files and making them available in a searchable format, and the other responsible for understanding the user question and, using the content of our files, extract the relevant information and display it back to the user in a natural language format. These flows are illustrated below and explained in detail with corresponding Python implementations in the following sections.

Search System Indexing and User Flows

Frameworks and Tools

Before we go into details, let’s review the software environments that we will be using.

OpenAI
The LLMs (large-language models) that I will using are provided by OpenAI. These models are not free, and there are many free models from other providers that work just as well. But for our purposes the OpenAI models are relatively cheap and easy to use.

For the generation of the embeddings I will use model text-embedding-ada-002 which has a cost of $0.0001 for 1K tokens. For the user query LLM I will use GPT3.5 Turbo which has a cost between $0.0015 and $0.002 for 1K tokens. According to OpenAI, as a rough rule of thumb, 1 token is approximately 4 characters or 0.75 words for English text.

Langchain
LangChain is a software framework with higher-level classes and methods for accessing and using language models and vector databases. It contains common classes that allow for more uniform interfaces to different language models, thus making it easier to develop new applications, not necessarily tied to one language model. Langchain has also higher-level methods to interact with different vector databases. In addition, LangChain has chains, which are combinations of methods that can be applied to specific use-cases, e.g., chatbots.

Milvus (Vector DB)
Milvus is a vector database, where the data stored are primarily vector embeddings and data retrieval is done through vector similarity search algorithms, such as ANN (approximate nearest neighbors). Milvus is open-source software with an active and helpful community.

Indexing Flow

This flow is responsible for crawling a directory structure, collecting the files that we may be interested in searching, and storing the information. It can be run once a day, or more often depending on how fast the files change. The specific steps are shown in the figure above and described below.

Step 1: List files

This step, implemented by the list_files function below, accepts an initial directory and a list of file extensions and returns a list of all files under the initial directory and all its subdirectories, with extensions in the list.

import os

def list_files(initdir: str, file_extensions: list):
'''
Returns a list of file under initdir and all its subdirectories
that have file extension contained in file_extensions.
'''
file_list = []
file_count = {key: 0 for key in file_extensions} # for reporting only

# Traverse through directories to find files with specified extensions
for root, _, files in os.walk(initdir):
for file in files:
ext = file.split('.')[-1].lower()
if ext in file_extensions:
file_path = os.path.join(root, file)
file_list.append(file_path)
# increment type of file
file_count[ext] += 1

total = len(file_list)
print(f'There are {total} files under dir {initdir}.')
for k, n in file_count.items():
print(f' {n} : ".{k}" files')
return file_list

This function could be further enhanced by storing the name and timestamp or hash of each file and testing if the file has changed since the last visit, in which case it would need to be indexed again.

Step 2: Convert file to text + metadata and split it

In order to process non-plain-text files (e.g., docx, xlsx, pdf, etc.) we must first convert them to a textual representation. The text content may be broken up into chunks in a process called splitting. Splitting is needed because the embedding models are limited in the size of text (more specifically, tokens) they can process. Moreover, having a long text represented by several embedding vectors, instead of just one, allows a finer-grain representation of the full text, which can improve search results.

LangChain contains several classes for converting different types of files into textual representations. These classes, called document loaders, use file converters to generate text, add associated metadata and create a list of objects of type Document.

The following function takes a text splitter method and a file path, and depending on the type of file, it calls the appropriate text converter, followed by the text splitter, which creates a list of Documents. If there are no errors, this list of Documents is returned.

def convert_and_split(text_splitter, file_path: str) -> list:
'''
It converts the file to a textual representation, adds metadata, and
calls the text_splitter to split long sequences.
It returns a list of Documents or None is there is any error in the
file conversion.
'''

texts = None
try:
if file_path.endswith('.pdf'):
loader = PyMuPDFLoader(file_path)
pages = loader.load()
texts = text_splitter.split_documents(pages)

elif file_path.endswith('.docx') or file_path.endswith('.doc'):
loader = UnstructuredWordDocumentLoader(file_path)
docs = loader.load()
texts = text_splitter.split_documents(docs)

elif file_path.endswith('.txt'):
loader = TextLoader(file_path)
docs = loader.load()
texts = text_splitter.split_documents(docs)

elif file_path.endswith('.xlsx') or file_path.endswith('.xls'):
loader = UnstructuredExcelLoader(file_path)
docs = loader.load()
texts = text_splitter.split_documents(docs)

elif file_path.endswith('.csv'):
loader = CSVLoader(file_path)
docs = loader.load()
texts = text_splitter.split_documents(docs)

elif file_path.endswith('.pptx') or file_path.endswith('.ppt'):
loader = UnstructuredPowerPointLoader(file_path)
docs = loader.load()
texts = text_splitter.split_documents(docs)

else:
print(f"Error: invalid file type: {file_path}")

except Exception as e:
print(f"Error processing {file_path}: {e}")

return texts

There are several file converters, also called Document Loaders in LangChain. You can find descriptions and code examples of document loaders integrated with other systems and document loaders for different file formats.

There are several types of text splitters supported by LangChain. A good discussion of different methods is presented here. After trying a few different splitters and checking the search quality, I chose to use the NLTK-based splitter with default parameters, but I recommend you experiment with your own data and see what works best. The definition of the text splitter I used is:

from langchain.text_splitter import NLTKTextSplitter

text_splitter = NLTKTextSplitter(chunk_size=1000)

Steps 3+4: Generate embeddings and store data in Vector DB

Embeddings are numerical representations of data, as dense vectors of real numbers. These vectors capture the relationships between different parts of entities within the data. Embeddings have been successfully used to represent images, videos, graphs, and text. For a good overview of embeddings, check out articles here and here.

Semantic search and many other natural language applications rely on embeddings to capture relationships between words and sentences. The actual embeddings are generated using neural network models. Here I will use text-embedding-ada-002 provided by OpenAI. LangChain has classes that integrate with several model providers, both paid and open-source.

Our embedding function is defined as follows. You will need to create an account with OpenAI and get an API key. By default, the embedding vectors created using model text-embedding-ada-002 have 1536 dimensions. For security reasons, it is better not to pass the API key directly in the code, use an environment variable.

from langchain.embeddings.openai import OpenAIEmbeddings

embedding = OpenAIEmbeddings(model="text-embedding-ada-002", openai_api_key=<insert-your-key-here>)

Once the embedding function is defined, we can take the list of Documents returned in Step 2 and pass it to the method that creates the embedding vectors for the Documents and stores them into the vector database.

Here I used the Milvus Database. I tried using Chroma DB as well, and although it works, I found Milvus more intuitive, easier to install (using docker) and with a visual interface (Attu) that allowed me to inspect the stored entities and their vectors easily. Check this link for details on how to install Milvus and Attu using docker.

This function add_docs_to_milvus (below) calls method Milvus.from_documents(…) to actually insert the documents into the database. If a schema is not pre-defined, this method uses the metadata present in the first document to define the schema fields, and in addition, it adds to the schema a primary key, a text and a vector fields. This schema is defined as static.

This, however, causes an error if we try to insert a second document with fields that do not match the first one. This is exactly the case for when we convert files of different types. PDF files, when converted to LangChain Documents will have certain metadata fields which do not match with the fields of the converted MS Excel file.

In order to avoid this error, and make sure all types of files processed by the system end up stored with the same fields in Milvus (important for search), we need to define a schema with precisely the desired fields and make sure that all file types, when converted, contain these desired fields.

For the purposes of doing semantic search on documents, we need basically the embeddings vector, the actual text that creates the embeddings, and the full path of the file. All LangChain file loaders create Documents that have a primary key, an embedding vector, a text field, and a source file (containing the file path). Hence we need a schema that defines these four fields.

def create_collection_schema(collection_name: str):
'''
Defines a schema for a collection, with 4 fields.
'''
# define the necessary fields and their types
fields = [
FieldSchema(name="pk", dtype=DataType.INT64, is_primary=True, auto_id=True),
FieldSchema(name="source", dtype=DataType.VARCHAR, max_length=512),
FieldSchema(name="text", dtype=DataType.VARCHAR, max_length=65535),
FieldSchema(name="vector", dtype=DataType.FLOAT_VECTOR, dim=1536)
]

# create a collection schema with fields
schema = CollectionSchema(
fields,
description="Schema for different file types",
)

collection = Collection(collection_name, schema)
return collection

With the schema defined for a collection, we can add the documents to Milvus with the following function. Note that we only need to pass the collection name, being the same for which the schema was associated with.

def add_docs_to_milvus(docs, embedding, collection_name, file_path):
'''
Store embedding vectors for docs int Milvus DB, under collection_name.
Return vector_db if successful, otherwise return None
'''
n = 0
vector_db = None

# now generate embeddings and store texts and vectors into vector database
try:
vector_db = Milvus.from_documents(
documents=docs,
embedding=embedding,
collection_name=collection_name,
connection_args={"host": "localhost", "port": "19530"},
)

except Exception as e:
print(f"Error storing {file_path} into Milvus: {e}")
vector_db = None

finally:
# flush new entities added to collection.
# Not really needed. Just a way to keep
# track of collection size as files are processed.
collection = Collection(collection_name)
collection.flush()
print(f'Collection {collection_name} has {collection.num_entities} entities after processing file {file_path}')

return vector_db

Putting it all together

Using the functions define above, we can define a function that will visit all files under a certain directory and index them all into Milvus.

def index_files_milvus(filelst: list, collection_name: str):
'''
Given a list of files and a collection name, visit all files, convert them
to text and metadata, split them into chunks and store their embeddings
and metadata into Milvus for future semantic search.
'''
# define the text_splitter and chunk size
text_splitter = NLTKTextSplitter(chunk_size=1000)

# define the language model to be used for generating the vector embeddings
# For security reasons, it is better not to pass the API key directly in
# the code, use an environment variable.
embedding = OpenAIEmbeddings(
model="text-embedding-ada-002",
openai_api_key='insert-your-OpenAI-api-key-here'
)

# create a schema and a collection using it
create_collection_schema(collection_name)

# bookeeping variables
error_files = [] # list of files that failed
cnt = 0 # number of files successfully stored

# Process each file by calling the appropriate file loader, depending on the file type
for file_path in filelst:
cnt += 1
print(f'=> processing file {cnt} / {len(filelst)}: {file_path}')

# convert the file into a list of Documents to be stored
texts = convert_and_split(text_splitter, file_path)

if not texts:
error_files.append(file_path) # keep track of files that failed
cnt -= 1 # count only successful conversions
continue

# now generate embeddings and store texts and vectors into vector database
vector_db = add_docs_to_milvus(texts, embedding, collection_name, file_path)

if not vector_db:
error_files.append(file_path) # keep track of files that failed
cnt -= 1 # count only successful conversions
continue
pass # end of for loop

# print summary
print(f'[index_files] Generated embeddings for {cnt} / {len(filelst)} under dir {initdir}')
if len(error_files) > 0:
print(f' Files with problems:')
for f in error_files:
print(f' {f}')
else:
print(f' All files successfully processed')

pass # end of function

The code below runs everything, given a list of file extensions and an initial directory.

import os
import openai

from langchain.vectorstores import Milvus
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import NLTKTextSplitter
from langchain.document_loaders import TextLoader
from langchain.document_loaders import UnstructuredExcelLoader
from langchain.document_loaders import UnstructuredWordDocumentLoader
from langchain.document_loaders import UnstructuredPowerPointLoader
from langchain.document_loaders import CSVLoader
from langchain.document_loaders import PyMuPDFLoader

from pymilvus import Collection, FieldSchema, CollectionSchema, DataType, connections

if __name__ == "__main__":

file_extensions = ['pdf', 'doc', 'docx', 'xlsx', 'xls', 'ppt', 'pptx', 'txt', 'csv']
initdir = 'add_your_initial_directory'

# Connect to the milvus server
conn = connections.connect(host="127.0.0.1", port=19530)

# define a collection name (it can be any name)
collection_name='testfiles_repo'

# visit directory structure and create list of files with given extensions
filelst = list_files(initdir, file_extensions)

# visit all files, convert them to text and index them into Milvus
index_files_milvus(filelst, collection_name)
pass

Test data and entities stored in Milvus

I created a simple directory structure with subdirectories containing files of all allowed types. I chose files freely available on the web and you can download them from my github if you want to duplicate exactly my environment. These files are relatively small and I encourage you to give them a quick look — it will help understand the queries I will make about them later on.

Description of files used for indexing and searching

CSV
All Premier League football games in the 2022-2023 season.
./testfiles/csv/premier_league_all_matches_2022-2023-season.csv
Downloaded from https://www.kaggle.com/datasets/sinansaglam/premier-league-last-5-seasons-match-scores-and-xgs
and removed some columns to decrease the amount of data.

EXCEL
Results of the 2020 Presidential election by state and candidate
./testfiles/excel/2020_presidential_elections.xlsx
Downloaded from https://www.kaggle.com/datasets/tunguz/us-elections-dataset
as csv, removed data from other elections, removed non-relevant columns to
decrease the amount of data, and saved as .xlsx format.

PDF
Two articles about Generative Adversarial Networks, available on Arxiv
./testfiles/pdf/InsetGAN.pdf (https://arxiv.org/pdf/2203.07293.pdf)
./testfiles/pdf/garmentGAN.pdf (https://arxiv.org/pdf/2003.01894.pdf)

PPT
Presentation by Washington State Department of Social and Health Services on Generations in the workplace.
./testfiles/powerpoint/Generations in the Workplace PPT (Final).pptx
Downloaded from https://www.dshs.wa.gov/sites/default/files/DDA/dda/documents/training/Generations%20in%20the%20Workplace%20PPT%20%28Final%29.pptx

TEXT
Blog post about Characteristics of Generation X Professionals
./testfiles/text/genx_characteristics.txt
Downloaded from https://www.indeed.com/career-advice/career-development/generation-x-professional-characteristics
and saved as text file.

WORD
Made up reviews received from a hypothetical journal about a hypothetical
manuscript submitted to the journal.
./testfiles/word/anon_reviews.docx

By running the main code above on this directory structure (./testfiles), we get the following log. The only reason for including this log is to draw your attention to the number of entities added to the vector database for each file.

There are 7 files under dir ./testfiles.
2 : ".pdf" files
0 : ".doc" files
1 : ".docx" files
1 : ".xlsx" files
0 : ".xls" files
0 : ".ppt" files
1 : ".pptx" files
1 : ".txt" files
1 : ".csv" files
=> processing file 1 / 7: ./testfiles/csv/premier_league_all_matches_2022-2023-season.csv
File ./testfiles/csv/premier_league_all_matches.csv added 380 entities
to collection testfiles_repo, to a total of 380 entities.

=> processing file 2 / 7: ./testfiles/excel/2020_presidential_elections.xlsx
Created a chunk of size 6016, which is longer than the specified 1000
Created a chunk of size 21291, which is longer than the specified 1000
File ./testfiles/excel/2020_presidential_elections.xlsx added 3 entities
to collection testfiles_repo, to a total of 383 entities.

=> processing file 3 / 7: ./testfiles/pdf/garmentGAN.pdf
File ./testfiles/pdf/garmentGAN.pdf added 46 entities
to collection testfiles_repo, to a total of 429 entities.

=> processing file 4 / 7: ./testfiles/pdf/InsetGAN.pdf
File ./testfiles/pdf/InsetGAN.pdf added 88 entities
to collection testfiles_repo, to a total of 517 entities.

=> processing file 5 / 7: .testfiles/powerpoint/Generations in the Workplace PPT (Final).pptx
Created a chunk of size 1875, which is longer than the specified 1000
File ./testfiles/powerpoint/Generations in the Workplace PPT (Final).pptx added 19 entities
to collection testfiles_repo, to a total of 536 entities.

=> processing file 6 / 7: ./testfiles/text/genx_characteristics.txt
File ./testfiles/text/genx_characteristics.txt added 7 entities
to collection testfiles_repo, to a total of 543 entities.

=> processing file 7 / 7: ./testfiles/word/anon_reviews.docx
File ./testfiles/word/anon_reviews.docx added 8 entities
to collection testfiles_repo, to a total of 551 entities.

[index_files] Generated embeddings for 7 / 7 under dir ./testfiles
All files successfully processed

CSV file premier_league_all_matches_2022–2023-season.csv has 380 rows (excluding headers) and got mapped by CSVloader into 380 entities, or one entity per csv row. For example, the first line in this csv file gets stored in Milvus as follows (visible in Attu):

This line-by-line splitting and storing is not usually the case for other file types. PDF and text files, for example get split by chunks of tokens or sentences. Excel files, when using the UnstructuredExcelLoader, get split in a way that several rows (or even half rows) are placed into one document, and consequently into one Milvus entity. For example, Excel file ./testfiles/excel/2020_presidential_elections.xlsx, which has 547 rows (excluding headers), got split into 3 Documents, stored in Milvus as 3 entities. The first entity included 81 rows, the second entity had 312 rows and the third, 154 rows. Considering that all data in each entity gets mapped to a single vector, the information is stored with widely different granularities, depending on the file type, the file loader and the text splitter used. This has a direct impact in the quality of the search, as we will see in the next section.

User Query Flow

In order to make a user query on the context defined by the files that were indexed we need to follow the following steps.

Step A: Load vector database

In order to search, we first need to load the vector database that was indexed and stored by the Indexing Flow. This is done by creating a Milvus object, pointing to the previously created collection. The embedding_function parameter is the same as defined in the previous section.

from langchain.vectorstores import Milvus

vector_db = Milvus(embedding_function=embeddings,
collection_name='testfiles_repo',
connection_args={"host": "localhost", "port": "19530"})

Step B: Create a Retriever from the vector DB

The retriever model encodes questions and contexts into the same vector space. It encodes questions to be compared to the context vectors in a vector database to retrieve the most relevant contexts. Using the vector_db we can define the retriever as:

'''
This define a retriever from the Milvus database. Specifying that 5 documents
should be returned. Other parameters as default.
'''
retriever = vector_db.as_retriever(search_kwargs={"k": 5})
print(retriever)
---- Output ----
tags=['Milvus', 'OpenAIEmbeddings']
metadata=None vectorstore=<langchain.vectorstores.milvus.Milvus object at 0x7fcea5e24c10>
search_type='similarity'
search_kwargs={'k': 5}

Step C: Configure a Retrieval Chain

LangChain defines chain as a way to combine multiple components together to create a single, coherent application.

The chain that we need in our application takes user input, formats it with a PromptTemplate, passes the prompt augmented with the context retrieved from a vector database to an LLM and returns the LLM’s response.

Prompt Template

The prompt combines instructions to the language model with a question (query). If we want the model to respond to queries using exclusively data from the context (i.e., the contents of the vector database), then we need to tell it. Failure to do so may result in the model returning data from its own training (not from the vector db) if the vector db search does not find anything relevant.

from langchain.prompts import PromptTemplate

prompt_template = \
"""
Use the following pieces of context to answer the question at the end.
If you don't know the answer based on the context only, say you do not
know the answer.

{context}

Question: {question}
"""
PROMPT = PromptTemplate(
template=prompt_template, input_variables=["context", "question"]
)

chain_type_kwargs = {"prompt": PROMPT}

Retrieval Chain

We used LangChain’s RetrievalQA chain which is designed for Q&A type of applications. The parameters to this function allow us to specify the language model to be used (for submitting the query and generating answers), the chain type, the retriever function and the prompt to be used.

from langchain.chains import RetrievalQA
from langchain.chat_models import ChatOpenAI

qa_chain = RetrievalQA.from_chain_type(
llm=ChatOpenAI(temperature=0.2, model_name="gpt-3.5-turbo", openai_api_key=<insert-your-key-here>),
chain_type="stuff",
retriever=retriever,
chain_type_kwargs=chain_type_kwargs,
return_source_documents=True
)

The llm parameter points to the API that interfaces with the language model. In this case we are using ChatOpenAI and the gpt-3.5-turbo model with a low temperature value in order to force the replies to be deterministic and closer to the content provided.

The chain_type=”stuff” parameter performs the basic function of taking a list of documents, inserting them all into the prompt and passing the prompt to the LLM. The documentation indicates that this chain type is better-suited for applications where documents are small and only a few are passed in for most calls.

We also need to pass the prompt through the chain_type_kwargs parameter. Finally, we also set return_source_documents to True to get a list of the source documents whose parts were used in the query.

Step D: Submitting the User Query

We can simply use the retrieval chain (qa_chain) defined above to query the model, as follows. For illustrative purposes, we used function process_llm_response to print the results and the source files used if any.

def process_llm_response(llm_response):
print(llm_response['result'])
print(f'{len(llm_response["source_documents"])} Source documents:')
# print(llm_response['source_documents'])
srcset = set()
for source in llm_response["source_documents"]:
srcset.add(source.metadata['source'])
if srcset:
print('\n\nSources:')
for s in srcset:
print(s)

Query 1
Content from specific lines of the .csv file.

query1 = "What were the scores of the games between Arsenal and Aston Villa?"
llm_response = qa_chain(query1)
process_llm_response(llm_response)
---- Output ----
Based on the given context, the scores of the games between Arsenal and Aston Villa were:

- On 8/31/2022, Arsenal won with a score of 2 x 1.
- On 2/18/2023, Aston Villa lost with a score of 2 x 4.

Sources:
./testfiles/csv/premier_league_all_matches_2022-2023-season.csv

Query 2
Content from specific lines of the .csv file.

query2 = "Which team has its home at 'The American Express Community Stadium'?"
llm_response = qa_chain(query2)
process_llm_response(llm_response)
---- Output ----
Brighton

Sources:
./testfiles/csv/premier_league_all_matches_2022-2023-season.csv

Query 3
Tricky question — to make sure only the indexed content is used. According to the prompt, it should say it does not know the answer.

query3 = "When was the Penicillin discovered?"
llm_response = qa_chain(query3)
process_llm_response(llm_response)

---- Output ----
Based on the given context, there is no information provided about the
discovery of Penicillin. Therefore, I do not know the answer to the question.

Query 4
Summarizing specific info about GANs from .pdf files.

query4 = "What can you tell me about Generative adversarial networks?"
llm_response = qa_chain(query4)
process_llm_response(llm_response)
---- Output ----
Based on the given context, generative adversarial networks (GANs) are a
successful image generation paradigm. They consist of two neural networks,
a generator and a discriminator. The generator learns the distribution of
real images by generating images that are indistinguishable from real images,
while the discriminator learns to classify the images into real and fake.
GANs have impressive capabilities for synthesizing realistic content and are
commonly used in various synthesis tasks, such as text-to-image and
image-to-image tasks. GANs have also been used for person image generation,
which is a challenging task due to the high variability of human pose, shape,
and appearance.

Sources:
./testfiles/pdf/InsetGAN.pdf
./testfiles/pdf/garmentGAN.pdf

Query 5
Asking about specific data present in .pdf files — note that the output includes data from the paper references.

query5 = "Can you give me several examples of AI systems using GANs?"
llm_response = qa_chain(query5)
process_llm_response(llm_response)
---- Output ----
Based on the given context, several examples of AI systems using GANs are:

1. StyleGAN: It is a method used for creating near photorealistic images for multiple classes such as human faces, cars, and landscapes.
2. BigGAN: This architecture is often used for class-conditional image generation on the ImageNet dataset.
3. TileGAN: It is a method used for the synthesis of large-scale non-homogeneous textures.
Please note that these examples are mentioned in the context, and there may be other AI systems using GANs that are not mentioned here.

Sources:
./testfiles/pdf/garmentGAN.pdf
./testfiles/pdf/InsetGAN.pdf

Query 6
Asking for a summary of one of the .pdf files.

query = "Can you give me a brief summary of garmentGAN?"
llm_response = qa_chain(query)
process_llm_response(llm_response)
---- Output ----
GarmentGAN is a new algorithm that uses generative adversarial methods to
perform image-based garment transfer. It allows users to virtually try on
clothing items before purchase and can handle complex body poses, hand
gestures, and occlusions. The algorithm requires two input images: a picture
of the target fashion item and an image of the customer. The output is a
synthetic image where the customer is wearing the target apparel. GarmentGAN
improves on existing methods in terms of the realism of generated imagery and
solves problems related to self-occlusions. It incorporates additional
information during training, such as segmentation maps and body key-point
information, to synthesize photographs at high levels of realism.
The algorithm comprises two separate GANs: a shape transfer network and an
appearance transfer network. It also uses a geometric alignment module and a
method of masking semantic segmentation maps to handle complex body poses and
occlusions. GarmentGAN preserves the identity of the reference person and
clothing items that should remain unchanged during the try-on task.
The algorithm has been evaluated and shown to outperform other state-of-the-art
models in terms of realism and quality of generated images.

Sources:
./testfiles/pdf/garmentGAN.pdf

Query 7
Querying the data present in files .txt and .pptx.

query7 = "How is a generation defined?"
llm_response = qa_chain(query7)
process_llm_response(llm_response)
---- Output ----
A generation is defined as a group of individuals born and living
contemporaneously who share common knowledge and experiences that affect
their thoughts, attitudes, values, beliefs, and behaviors.

Sources:
./testfiles/powerpoint/Generations in the Workplace PPT (Final).pptx
./testfiles/text/genx_characteristics.txt

Query 8
Querying specific information from files .txt and .pptx. This info is present in 3 places with different values: pages 6 and 13 of .pptx file, and in .txt file. It chose to return the data from page 13 of .pptx file.

query8 = "When were Baby Boomers born?"
llm_response = qa_chain(query8)
process_llm_response(llm_response)
---- Output ----
Baby Boomers were born between 1946 and 1968.

Sources:
./testfiles/powerpoint/Generations in the Workplace PPT (Final).pptx
./testfiles/text/genx_characteristics.txt

Query 9
Querying information with semantic understanding, present in files .pptx and .txt.

query9 = "who are the groups that preceeded and succeeded Generation X?"
llm_response = qa_chain(query9)
process_llm_response(llm_response)
---- Output ----
The groups that preceded Generation X are the baby boomers, born between 1943
and 1964. The group that succeeded Generation X is the millennials, born
between 1981 and 2000.

Sources:
./testfiles/powerpoint/Generations in the Workplace PPT (Final).pptx
./testfiles/text/genx_characteristics.txt

Query 10
Asking for a summary of separate pieces of information contained in .docx file.

query10 = "Summarize the reviews of paper entitled 'The Best Search System' and tell me if the paper was accepted or not."
llm_response = qa_chain(query10)
process_llm_response(llm_response)
---- Output ----
The reviews of the paper entitled 'The Best Search System' are mixed.
Review 4 gave a strong reject, Review 1 gave an accept, Review 2 gave a reject,
Review 3 gave a weak accept, and the metareview recommended rejection.
Therefore, the paper was not accepted.

Sources:
./testfiles/word/anon_reviews.docx

Queries that did not work well and why

Queries that required aggregating data from a higher number of documents did not work. We defined our vector db retriever in Step B with k = 5, that is, the search will return up to 5 documents, the most similar to the query, where a document is likely to be a fragment of an original file. Hence, depending on the 5 documents that are retrieved, the retrieval chain may not be able to compose a complete answer (if at all). Let us look at some examples.

Query 11
This query requires all rows in the .csv file to be returned by the query (since each row gets mapped into a vector db entity), but since we are only returning 5 documents, the answer is wrong.

query11 = "What match had the highest attendance among all Premier League matches?"
llm_response = qa_chain(query11)
process_llm_response(llm_response)
---- Output ----
The match with the highest attendance among all Premier League matches is the
one between Manchester City and Everton on 12/31/2022 at the Etihad Stadium,
with an attendance of 53444.

Sources:
./testfiles/csv/premier_league_all_matches_2022-2023-season.csv

The answer above is clearly wrong. By looking at the .csv file, one can get that the highest attendance was 75546 people, at the Manchester United 2 x 1 Manchester City game at the Old Trafford on 01/14/2023. If you ask it directly about this game, it does return the correct attendance.

Query 12
Asking the for specific information from .xlsx files. As we discussed in section Test data and entities stored in Milvus, the .xlsx file is split and stored as 3 entities only, with high numbers of tokens (much higher than the chunk_size=1000 specified in the text splitter). As a result the context and query end up containing too many tokens for the OpenAi model to process (under my plan) and we get an error about exceeding the number of tokens.

query12 = "How many votes did Biden get in Alabama in 2020?"
llm_response = qa_chain(query12)
process_llm_response(llm_response)
---- Output ----
...
...
InvalidRequestError: This model's maximum context length is 4097 tokens.
However, your messages resulted in 16202 tokens. Please reduce the length of
the messages.

This file would have to be split into smaller chunks for this query to work. However the function UnstructuredExcelLoader used with the NLTKTextSplitter is not able to do that, even if smaller chunk_size is used. If you need to search on a row information a better option would be to convert the .xlsx to .csv and use the CSVLoader function.

Conclusion

In this article I explained how to create a search engine for your own files using natural language. The availability of LangChain and large language models from different companies makes it easier for complex applications using be developed quickly. However, the crux is in the details, and as I tried to show, querying a language model, even using limited content, does not always return satisfying results, although it does offer significantly more powerful ways of searching than using keyword-based search.

Many of the methods that we used in this article accept different parameters which can directly affect the quality of the results. I did not go into details on how to tweak these parameters, but I included enough links to the documentation so the interested reader is able to do that. I used OpenAI models which are not free — however for all calls (several hundred) that I made to the models while developing this article, the cost did not reach $2 dollars.

All code and test files presented here may be downloaded from this github repository.

--

--

Reinaldo Bergamaschi

From circuit design, to cad, to simulation, to search, to data mining, to AI and a lot of coding in between. Researcher, Entrepreneur, Manager, Developer.