import sys
from pathlib import Path
1, str(Path.cwd().parent)) sys.path.insert(
31 Retrieval
Retrieval is the centerpiece of our retrieval augmented generation (RAG) flow.
Let’s get our vectorDB from before.
31.1 Vectorstore retrieval
#!pip install lark
31.1.1 Similarity Search
from langchain.vectorstores import Chroma
from langchain.embeddings.openai import OpenAIEmbeddings
= 'docs/chroma/' persist_directory
= OpenAIEmbeddings()
embedding = Chroma(
vectordb =persist_directory,
persist_directory=embedding
embedding_function )
/var/folders/70/7wmmf6t55cb84bfx9g1c1k1m0000gn/T/ipykernel_13610/2897838732.py:1: LangChainDeprecationWarning: The class `OpenAIEmbeddings` was deprecated in LangChain 0.0.9 and will be removed in 1.0. An updated version of the class exists in the :class:`~langchain-openai package and should be used instead. To use it run `pip install -U :class:`~langchain-openai` and import as `from :class:`~langchain_openai import OpenAIEmbeddings``.
embedding = OpenAIEmbeddings()
/var/folders/70/7wmmf6t55cb84bfx9g1c1k1m0000gn/T/ipykernel_13610/2897838732.py:2: LangChainDeprecationWarning: The class `Chroma` was deprecated in LangChain 0.2.9 and will be removed in 1.0. An updated version of the class exists in the :class:`~langchain-chroma package and should be used instead. To use it run `pip install -U :class:`~langchain-chroma` and import as `from :class:`~langchain_chroma import Chroma``.
vectordb = Chroma(
print(vectordb._collection.count())
151
= [
texts """The Amanita phalloides has a large and imposing epigeous (aboveground) fruiting body (basidiocarp).""",
"""A mushroom with a large fruiting body is the Amanita phalloides. Some varieties are all-white.""",
"""A. phalloides, a.k.a Death Cap, is one of the most poisonous of all known mushrooms.""",
]
= Chroma.from_texts(texts, embedding=embedding) smalldb
= "Tell me about all-white mushrooms with large fruiting bodies" question
=2) smalldb.similarity_search(question, k
=2, fetch_k=3) smalldb.max_marginal_relevance_search(question,k
31.1.2 Addressing Diversity: Maximum marginal relevance
Last class we introduced one problem: how to enforce diversity in the search results.
Maximum marginal relevance
strives to achieve both relevance to the query and diversity among the results.
= "what did they say about matlab?"
question = vectordb.similarity_search(question,k=3) docs_ss
0].page_content[:100] docs_ss[
1].page_content[:100] docs_ss[
Note the difference in results with MMR
.
= vectordb.max_marginal_relevance_search(question,k=3) docs_mmr
0].page_content[:100] docs_mmr[
1].page_content[:100] docs_mmr[
31.1.3 Addressing Specificity: working with metadata
In last lecture, we showed that a question about the third lecture can include results from other lectures as well.
To address this, many vectorstores support operations on metadata
.
metadata
provides context for each embedded chunk.
= "what did they say about regression in the third lecture?" question
= vectordb.similarity_search(
docs
question,=3,
kfilter={"source":"docs/cs229_lectures/MachineLearning-Lecture03.pdf"}
)
for d in docs:
print(d.metadata)
31.1.4 Addressing Specificity: working with metadata using self-query retriever
But we have an interesting challenge: we often want to infer the metadata from the query itself.
To address this, we can use SelfQueryRetriever
, which uses an LLM to extract:
- The
query
string to use for vector search - A metadata filter to pass in as well
Most vector databases support metadata filters, so this doesn’t require any new databases or indexes.
from langchain.llms import OpenAI
from langchain.retrievers.self_query.base import SelfQueryRetriever
from langchain.chains.query_constructor.base import AttributeInfo
= [
metadata_field_info
AttributeInfo(="source",
name="The lecture the chunk is from, should be one of `docs/cs229_lectures/MachineLearning-Lecture01.pdf`, `docs/cs229_lectures/MachineLearning-Lecture02.pdf`, or `docs/cs229_lectures/MachineLearning-Lecture03.pdf`",
descriptiontype="string",
),
AttributeInfo(="page",
name="The page from the lecture",
descriptiontype="integer",
), ]
Note: The default model for OpenAI
(“from langchain.llms import OpenAI”) is text-davinci-003
. Due to the deprication of OpenAI’s model text-davinci-003
on 4 January 2024, you’ll be using OpenAI’s recommended replacement model gpt-3.5-turbo-instruct
instead.
= "Lecture notes"
document_content_description = OpenAI(model='gpt-3.5-turbo-instruct', temperature=0)
llm = SelfQueryRetriever.from_llm(
retriever
llm,
vectordb,
document_content_description,
metadata_field_info,=True
verbose )
/var/folders/70/7wmmf6t55cb84bfx9g1c1k1m0000gn/T/ipykernel_13610/1005454759.py:2: LangChainDeprecationWarning: The class `OpenAI` was deprecated in LangChain 0.0.10 and will be removed in 1.0. An updated version of the class exists in the :class:`~langchain-openai package and should be used instead. To use it run `pip install -U :class:`~langchain-openai` and import as `from :class:`~langchain_openai import OpenAI``.
llm = OpenAI(model='gpt-3.5-turbo-instruct', temperature=0)
= "what did they say about regression in the third lecture?" question
You will receive a warning about predict_and_parse being deprecated the first time you executing the next line. This can be safely ignored.
= retriever.get_relevant_documents(question)
docs docs
[]
for d in docs:
print(d.metadata)
31.1.5 Additional tricks: compression
Another approach for improving the quality of retrieved docs is compression.
Information most relevant to a query may be buried in a document with a lot of irrelevant text.
Passing that full document through your application can lead to more expensive LLM calls and poorer responses.
Contextual compression is meant to fix this.
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor
def pretty_print_docs(docs):
print(f"\n{'-' * 100}\n".join([f"Document {i+1}:\n\n" + d.page_content for i, d in enumerate(docs)]))
# Wrap our vectorstore
= OpenAI(temperature=0, model="gpt-3.5-turbo-instruct")
llm = LLMChainExtractor.from_llm(llm) compressor
= ContextualCompressionRetriever(
compression_retriever =compressor,
base_compressor=vectordb.as_retriever()
base_retriever )
= "what did they say about matlab?"
question = compression_retriever.get_relevant_documents(question)
compressed_docs pretty_print_docs(compressed_docs)
31.2 Combining various techniques
= ContextualCompressionRetriever(
compression_retriever =compressor,
base_compressor=vectordb.as_retriever(search_type = "mmr")
base_retriever )
= "what did they say about matlab?"
question = compression_retriever.get_relevant_documents(question)
compressed_docs pretty_print_docs(compressed_docs)
31.3 Other types of retrieval
It’s worth noting that vectordb as not the only kind of tool to retrieve documents.
The LangChain
retriever abstraction includes other ways to retrieve documents, such as TF-IDF or SVM.
from langchain.retrievers import SVMRetriever
from langchain.retrievers import TFIDFRetriever
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
# Load PDF
= PyPDFLoader("docs/cs229_lectures/MachineLearning-Lecture01.pdf")
loader = loader.load()
pages =[p.page_content for p in pages]
all_page_text=" ".join(all_page_text)
joined_page_text
# Split
= RecursiveCharacterTextSplitter(chunk_size = 1500,chunk_overlap = 150)
text_splitter = text_splitter.split_text(joined_page_text) splits
# Retrieve
= SVMRetriever.from_texts(splits,embedding)
svm_retriever = TFIDFRetriever.from_texts(splits) tfidf_retriever
= "What are major topics for this class?"
question =svm_retriever.get_relevant_documents(question)
docs_svm0] docs_svm[
= "what did they say about matlab?"
question =tfidf_retriever.get_relevant_documents(question)
docs_tfidf0] docs_tfidf[