30  Vectorstores and Embeddings

Recall the overall workflow for retrieval augmented generation (RAG):

overview.jpeg
import sys
from pathlib import Path
sys.path.insert(1, str(Path.cwd().parent)) 
ls = []
ls.extend(["a"])
ls.extend(["b"])
ls
['a', 'b']

We just discussed Document Loading and Splitting.

from langchain.document_loaders import PyPDFLoader

# Load PDF
loaders = [
    # Duplicate documents on purpose - messy data
    PyPDFLoader("docs/MachineLearning-Lecture01.pdf"),
    PyPDFLoader("docs/MachineLearning-Lecture02.pdf"),
    PyPDFLoader("docs/MachineLearning-Lecture03.pdf")
]
docs = []
for loader in loaders:
    docs.extend(loader.load())
# Split
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 1500,
    chunk_overlap = 150
)
splits = text_splitter.split_documents(docs)
len(splits)
151

30.1 Embeddings

Let’s take our splits and embed them.

from langchain.embeddings.openai import OpenAIEmbeddings
embedding = OpenAIEmbeddings()
/var/folders/70/7wmmf6t55cb84bfx9g1c1k1m0000gn/T/ipykernel_11952/1742550774.py:2: LangChainDeprecationWarning: The class `OpenAIEmbeddings` was deprecated in LangChain 0.0.9 and will be removed in 1.0. An updated version of the class exists in the :class:`~langchain-openai package and should be used instead. To use it run `pip install -U :class:`~langchain-openai` and import as `from :class:`~langchain_openai import OpenAIEmbeddings``.
  embedding = OpenAIEmbeddings()
sentence1 = "i like dogs"
sentence2 = "i like canines"
sentence3 = "the weather is ugly outside"
embedding1 = embedding.embed_query(sentence1)
embedding2 = embedding.embed_query(sentence2)
embedding3 = embedding.embed_query(sentence3)
[-0.005669887070811415,
 -0.007792916467724385,
 0.00620382220901969,
 0.006839460098766079,
 0.00394413100574398,
 0.004141178369723112,
 -0.00700472561482402,
 -0.009858738445246954,
 0.005752519596009746,
 -0.04164697151388982,
 -0.012414001625813919,
 0.01296700605903942,
 0.00300656556074627,
 0.010532513993705238,
 -0.01533157722075668,
 -0.021840506753068156,
 0.023378748006189318,
 0.014619663082263942,
 0.027408690606680178,
 -0.01475950395817523,
 0.01051344516434929,
 0.020454816134794575,
 0.008720947265213481,
 -0.01086940223359566,
 -0.005584075941725814,
 -0.01997173167386344,
 0.006705975964968051,
 -0.03523974426745518,
 0.02161167707550655,
 -0.000794546954729368,
 0.007061933034214421,
 -0.021865931858876086,
 -0.024688161444393994,
 -0.028603689206104052,
 -0.0063627319142869265,
 -0.02261598365608072,
 0.013335675681189754,
 -0.009528206481808516,
 0.004115753263915181,
 0.008466692481843948,
 0.0068076782508448876,
 -0.019768327102109767,
 -0.004252415070277922,
 -0.013589930464559287,
 -0.023569440025039023,
 0.026366244504749,
 0.004118931402141173,
 -0.008098021928371057,
 -0.012897086086745054,
 0.02284481333364232,
 0.018967425093289275,
 -0.013577217911655321,
 -0.014238280907209641,
 -0.03373964067304591,
 -0.015725671948714944,
 0.026391671473202045,
 -6.157738855496857e-05,
 -0.004277840641747131,
 0.009051478297329362,
 -0.00992865841754132,
 -0.01988274194089057,
 -0.012458495560977797,
 -0.02822230703104975,
 -0.004474888471387542,
 -0.004309622489668322,
 -0.012706394067895348,
 -0.004977041761674625,
 -0.012922511192552985,
 -0.013310250575381825,
 -0.009064191781555885,
 0.01596721417918051,
 0.02420507698346286,
 0.002240622373919762,
 -0.01723848995867329,
 -0.002107138472952374,
 -0.00408397141599399,
 0.0026696777865171275,
 0.004818132522068667,
 -0.011492326639115527,
 0.010087568122808557,
 0.03353623610129224,
 -0.026951031251556973,
 -0.02687475593413318,
 0.014123866068428841,
 0.02252699392310785,
 0.0073543264076184065,
 -0.018789445627343533,
 0.021281145112068113,
 -0.012490277874560267,
 -0.01324668687953944,
 0.006200644070793698,
 0.00032993562596361645,
 0.003982269130117154,
 -0.009235813108404529,
 -0.004719608840079101,
 0.02135742229213702,
 -0.008256931633638293,
 0.04167239661969775,
 0.0008032869512661637,
 -0.021128592614575416,
 -0.005870112573016539,
 0.005177267729541027,
 -0.010170200648006887,
 -0.0060894078359001675,
 -0.042308035440766695,
 -0.002086480341652791,
 0.011416050390369178,
 -0.00949006882309662,
 0.010392674049116507,
 -0.026188266901448373,
 0.004725965116531084,
 0.016666415764769283,
 0.013806047589216925,
 -0.04345218382857471,
 0.0021389203208735654,
 -0.015509555755379864,
 0.013895036390867239,
 -0.008771798408151898,
 -0.02161167707550655,
 -0.003155940385674254,
 0.011892778574848332,
 0.028120604745172917,
 0.013005143717751317,
 -0.010189269477362836,
 0.010526157717253256,
 -0.008746372371021412,
 -0.02559076760173644,
 -0.016552000925988485,
 -0.03867218943087922,
 0.005301216982999802,
 0.025768745205037063,
 0.03201070740107504,
 0.020480241240602506,
 -0.004166603941192321,
 -0.021065027987410476,
 -0.0018385816019046016,
 -0.03419730002816912,
 0.011886422298396348,
 -0.01032275407682214,
 0.0021325640444215827,
 0.0005371137420955447,
 0.029366455418857764,
 -0.019819179176370744,
 0.0037280141139169817,
 -0.014594237976456011,
 0.0024281353232209206,
 0.014022163782552005,
 -0.0031495841092222714,
 0.0061911096561157244,
 -0.03544314883920885,
 0.0061879315178897335,
 -0.023874546882669535,
 0.01486120531272951,
 -0.02349316284497012,
 0.017721574419604424,
 0.022183749406765443,
 -0.019755614549205804,
 0.01475950395817523,
 -0.02649337375907888,
 -0.008212436767151858,
 0.006699619688516069,
 0.011416050390369178,
 0.020950613148629678,
 0.02041667847608268,
 -0.00045368630399092696,
 0.01246485183742978,
 0.003082842042323258,
 0.0029747837128250784,
 0.004376364323736697,
 -0.0014492536156240454,
 -0.0018973780904079978,
 -0.002606113857844105,
 -0.02191678207049195,
 0.033459960783868446,
 0.0009931836788597951,
 0.03534144841597713,
 -0.005669887070811415,
 0.02534922537127087,
 -0.033332831529538566,
 -0.023353322900381387,
 -0.031400493685814025,
 -0.0028238198187840987,
 0.0176580116550846,
 0.015458704612441445,
 -0.0024313136942775513,
 -0.00700472561482402,
 0.011180864436355593,
 0.00028663279414256707,
 0.014390834336024895,
 -0.02064550815364428,
 0.03579910777110033,
 0.003680341342035194,
 0.022984653278231054,
 -0.0131831222523745,
 -0.6814034452791066,
 -0.002186593092755353,
 0.0308411339074591,
 0.012585622952662565,
 0.005059674752534235,
 0.012236023091190734,
 0.011695731210869198,
 0.006178397103211759,
 -0.022336303766903255,
 0.01898013764619324,
 0.005895538144485748,
 -0.0006229248129734858,
 -0.016729980391934223,
 -0.0023947644061867333,
 0.019870029387986606,
 -0.005666708932585423,
 -0.004131643955045138,
 -0.020785348098233014,
 -0.0071382097486220476,
 0.017441894530426962,
 -0.007862835974357472,
 0.015026471294448729,
 -0.010945678482342009,
 -0.016463013055660725,
 -0.0019688873646459986,
 -0.01095203475879399,
 -0.0046910051303839006,
 -0.0385196350707414,
 -0.01445439803186728,
 0.03007836769470539,
 -0.02855283899448819,
 0.01604349135924942,
 -0.00835863391951513,
 -0.022666835730341693,
 0.05893631168417897,
 -0.00689666751815648,
 -0.01243942673162185,
 0.04518111710054558,
 0.0312987932625823,
 0.03442612970573072,
 -0.023251620614504552,
 -0.0063627319142869265,
 0.017124075119892487,
 -0.0121343208053139,
 -0.006044912969413732,
 0.014085728409716945,
 0.009280307974890965,
 0.011517752676246015,
 0.01308142089782022,
 0.008199724214247892,
 0.03900271953167254,
 -0.0034578681737562136,
 -0.0011830803482457664,
 0.01671726783903026,
 -0.001098858404688481,
 0.010112993228616486,
 0.007150922301526013,
 -0.017670724207988566,
 -0.0025965794431661313,
 0.013678920197532158,
 0.004179316959757565,
 0.014619663082263942,
 -0.026467946790625838,
 -0.01277631497151227,
 -0.004233346240921974,
 -0.0003184146711675886,
 -0.01950135976583627,
 0.006839460098766079,
 0.0019688873646459986,
 0.008301426500124729,
 0.016640990658961352,
 0.02634081939894107,
 -0.006006774845040558,
 0.022018484356368783,
 0.010049429532774103,
 0.0087145909887615,
 0.027154435823310643,
 -0.023633004652203964,
 -0.010132062057972435,
 0.020429391028986645,
 0.006985656785468072,
 -0.03066315444151336,
 -0.014835780206921579,
 0.006527997430344867,
 0.024370343896504634,
 -0.011021954731088357,
 -0.003702588775278412,
 0.008040814508980657,
 0.013945887533805656,
 -0.006655125287690912,
 0.005450591807927787,
 0.017886841332646203,
 -0.026951031251556973,
 -0.013640781607497706,
 -0.0004014448419939574,
 0.006092585974126158,
 -0.00838405902532306,
 0.03826538028737187,
 0.03366336535562212,
 -0.02875624170359675,
 0.01632317124842688,
 -0.0016876177078636216,
 0.01130799182804036,
 0.011791076288971495,
 0.016640990658961352,
 0.018687743341466695,
 -0.014212855801401712,
 0.0045193828722126985,
 -0.0017098650246915197,
 0.0007305858751934162,
 0.017429181977523,
 -8.777260353014629e-06,
 -0.01690795799523485,
 0.004608372139524291,
 0.011733868869581094,
 -0.02162438962841052,
 0.039206124103426214,
 -0.020136996724260104,
 0.026417096579009976,
 -0.018077531488850794,
 0.0007854096327066635,
 0.008638313808692594,
 0.023416887527546327,
 -0.018547903396877966,
 -0.005924141854180948,
 0.014428971994736792,
 0.015293439562044783,
 -0.013055994860689734,
 -0.008085309375467092,
 -0.022082048983533723,
 0.011352486694526796,
 0.008657383569371098,
 0.02875624170359675,
 -0.018471626216809058,
 0.03437527949411486,
 -0.0017321123415194178,
 0.030764856727390193,
 -0.005313930001565046,
 0.03302772653455317,
 -0.012261448196998665,
 -0.02743411571248811,
 -0.011848283708361896,
 -0.016170618750934184,
 0.02135742229213702,
 0.012572910399758599,
 -0.045079412952023626,
 -0.016348598216879926,
 -0.00496432920877066,
 -0.012769958695060288,
 0.024688161444393994,
 -0.015344290704983202,
 -0.004271484365295148,
 -0.01542056695372955,
 -0.002580688519205535,
 -0.02748496778674908,
 -0.004875339941459067,
 0.033002301428745245,
 -0.004042654687733546,
 -0.009483712546644638,
 0.003081252973210262,
 -0.0003962802800652318,
 0.00477363812124351,
 -0.004741856273322319,
 -0.01721306485286536,
 -0.022654121314792615,
 -0.015369715810791131,
 -0.016374023322687857,
 0.040045163770958604,
 0.0008859198257104537,
 -0.025641617813352297,
 0.0092103880025966,
 -0.01690795799523485,
 0.010411742878472454,
 0.008593819873528715,
 -0.012515702980368198,
 0.008975202048583014,
 -0.03173102751189758,
 -0.03999431355934274,
 -0.012159745911121829,
 -0.011403337837465213,
 0.020175136245617113,
 -0.0010289383159787953,
 -0.02311178066991582,
 -0.011371555523882743,
 -0.0021341531135345786,
 -0.016946097516591863,
 0.006527997430344867,
 0.02707815864324174,
 -0.031781877723513444,
 0.029620710202227295,
 -0.0022787604982929363,
 0.0012823987976224697,
 -0.020798060651136978,
 0.00383289430518917,
 -0.007030151186293229,
 -0.01806481893594683,
 -0.0016018065787780208,
 0.017149500225700418,
 0.005069209632873487,
 0.006505750462762928,
 0.01848434063235814,
 -0.004945259913753434,
 0.01542056695372955,
 -0.008123447965501544,
 -0.0010813784116148893,
 -0.00954727624248702,
 0.01773428697250839,
 -0.025247523085394033,
 0.013895036390867239,
 0.009343871670733347,
 0.016755405497742154,
 -0.02407794959177809,
 -0.024179651877654928,
 0.007303475264679988,
 -0.02781549975018752,
 0.034680384489100255,
 -0.002650608491499901,
 0.024319491822243657,
 0.014937482492798415,
 -0.01905641296361703,
 -0.011466901533307596,
 -0.011104588187609244,
 0.023683854863819825,
 0.004452641038144324,
 -0.011199933265711542,
 -0.005787480047818207,
 0.011206289542163524,
 0.020454816134794575,
 0.01343737796706659,
 -0.007894618287939942,
 -0.02784092485599545,
 -0.0018973780904079978,
 0.012954293506135455,
 0.011015598454636373,
 -0.020759922992425083,
 -0.010564296307287708,
 -0.012941580021908932,
 -0.021484549683821786,
 0.03923154920923414,
 0.002683979641364728,
 0.0023836406895651244,
 -0.0058574000201125735,
 0.011041024491766862,
 -0.00595592370210214,
 0.011848283708361896,
 -0.006559779278266059,
 0.04874068779300927,
 -0.006324593789913752,
 -0.02546364021005167,
 0.0040712583974287465,
 0.013767908999182472,
 -0.008409485062453547,
 0.0011822858136892687,
 0.011428762943273144,
 0.018077531488850794,
 -0.04472345774542237,
 0.023709279969627756,
 -0.004659223282462709,
 0.029824112911335854,
 0.02972241062545902,
 0.0022088405259985708,
 0.0011393802491464682,
 -0.006448543043372527,
 -0.0007524359247430653,
 0.027001883325817946,
 -0.010297328039691654,
 0.0006793376395997288,
 -0.00014897754309782028,
 0.010278259210335705,
 0.02228545169264228,
 -0.011466901533307596,
 -0.02041667847608268,
 0.008638313808692594,
 -0.020086146512644242,
 0.0009892108896619861,
 -0.018611468024042906,
 0.013011500925525856,
 0.006763183384358452,
 0.02228545169264228,
 0.012630117819149,
 0.02314992019127283,
 -0.03262091739104583,
 -0.011511395468471474,
 0.018751307968631635,
 -0.017060512355372662,
 -0.017721574419604424,
 -0.004004516563360372,
 -0.015166312170360017,
 -0.011683017726642677,
 -0.005272613273304602,
 0.0007814369017165145,
 0.01923439242956277,
 0.00667419411704686,
 -0.005533224798787396,
 0.018751307968631635,
 -0.0006555011954511752,
 0.0304597517324048,
 -0.03450240502315451,
 -0.004913478065832241,
 0.0018226906779440057,
 0.006391335623982127,
 -0.025412788135790697,
 0.002874670728891877,
 -0.02234901631980722,
 -0.006585204849735267,
 0.005116882171924635,
 -0.0036835194802611855,
 -0.0053361769691469855,
 -0.002326433503005363,
 0.015725671948714944,
 -0.01597992859472959,
 -0.006057625987978975,
 -0.012426714178717884,
 0.01671726783903026,
 -0.03666357254444066,
 -0.010564296307287708,
 -0.005120060310150627,
 -0.014250993460113607,
 0.02517124590532513,
 0.02768837049585764,
 -0.003063772980136671,
 -0.008187011661343929,
 -0.04096048248120501,
 0.004166603941192321,
 0.0898537246343521,
 0.02032768874310981,
 0.003661272279848607,
 0.016526575820180554,
 0.0353160233101692,
 0.007494166817868416,
 -0.01078041250062279,
 -0.03867218943087922,
 0.0024186009085429465,
 -0.010939322205890025,
 0.00785647969790549,
 -0.011594028924992363,
 0.010049429532774103,
 0.0071254967300568035,
 0.01020833923804134,
 -0.007621294209553183,
 0.006019487397944522,
 0.01161309775434831,
 0.020098859065548206,
 0.006718688983533295,
 0.006063982264430958,
 -0.015179024723263983,
 -0.019806466623466776,
 0.030002092377281596,
 0.023505877260519198,
 0.025222097979586102,
 0.00992865841754132,
 -0.011396981561013231,
 0.007684857905395566,
 -0.0010345000578742803,
 -0.024421194108120495,
 0.016857107783618992,
 -0.012204240777608264,
 0.0030653620492496663,
 0.0035118974549206232,
 0.023721992522531724,
 -0.0015096391732404373,
 0.00802174567962471,
 0.0253746504770788,
 -0.01736561735035806,
 0.0026331284984263095,
 0.02626454408151728,
 0.016119768539318322,
 -0.027154435823310643,
 0.017149500225700418,
 -0.008371346472419095,
 -0.005062852890760226,
 0.017289342032934266,
 0.010379961496212542,
 -0.020721783471068074,
 0.023696567416723793,
 0.0008398361229416619,
 0.006756827107906469,
 -0.02035311384891774,
 -0.002362982558265542,
 0.012591980160437104,
 -0.012388575588683432,
 0.007322544559697214,
 -0.017276627617385187,
 -0.009604483661877421,
 -0.014543386833517594,
 -0.017289342032934266,
 -0.0036644504180745985,
 0.009483712546644638,
 -0.00832685160593266,
 -0.02768837049585764,
 -0.011047380768218843,
 0.003740726899651586,
 -0.024090662144682057,
 -0.00344833375907824,
 0.015153598686133495,
 -0.022921090513711228,
 -0.013208548289504988,
 0.015573119451222247,
 -0.00813616051840551,
 0.013589930464559287,
 -0.006334128204591726,
 -0.02559076760173644,
 0.008975202048583014,
 0.0063627319142869265,
 -0.013119558556532117,
 -0.013386526824128173,
 0.014912056455667927,
 0.016857107783618992,
 -0.008237862804282346,
 -0.013005143717751317,
 0.01970476433758994,
 0.0097633924358221,
 0.0017623051203276138,
 0.014784929063983161,
 0.009706185016431699,
 0.014111153515524876,
 0.004182495097983556,
 -0.01770886186670046,
 0.015763811470071953,
 -0.006022665536170514,
 0.008187011661343929,
 0.007506879370772382,
 -0.002351858841643933,
 -0.009604483661877421,
 0.027713797464310685,
 -0.012013549690081116,
 -0.007983607089590256,
 0.00540609740710263,
 0.008301426500124729,
 0.006699619688516069,
 -0.0006197466165398346,
 0.01638673587559182,
 0.010284615486787689,
 -0.010888471062951608,
 0.027103585611694785,
 -0.008333207882384642,
 -0.008403128786001565,
 0.0027904489017499113,
 -0.024065237038874126,
 -0.0008700289017498578,
 -0.0021039603347263827,
 0.00866373984582308,
 0.006833103822314096,
 -0.00985238216879497,
 -0.013144984593662605,
 -0.04169782172550568,
 0.0008215615371039128,
 0.005733450766653798,
 0.011435119219725126,
 -0.0025965794431661313,
 -0.0001390457156224479,
 -0.05303759772977362,
 -0.033459960783868446,
 -0.006667837840594877,
 -0.007303475264679988,
 0.006095764112352149,
 0.004671935835366674,
 -0.03277347175118364,
 -0.030332624340720035,
 -0.02367114231091586,
 0.0021293856733649525,
 0.007347970131166424,
 -0.002246978650371745,
 -0.0028142851712754853,
 -0.01999715677967137,
 -0.011880066021944366,
 -0.00656613602037932,
 -0.015204449829071914,
 -0.004843558093537876,
 -0.021192155379095245,
 -0.005606322909307754,
 -0.0012148623125822777,
 -0.011371555523882743,
 0.026773053648256346,
 -0.006756827107906469,
 -0.01831907371931636,
 0.015929076520468616,
 0.0063722663289649,
 -0.0003680738667521103,
 -0.023950822200093324,
 0.0015096391732404373,
 0.00700472561482402,
 0.011479614086211561,
 0.025768745205037063,
 0.02452289639399733,
 -0.01575109891716799,
 0.025425502551339775,
 0.0009463053833268456,
 -0.014645089119394428,
 -0.002606113857844105,
 -0.0022104295951115662,
 0.0003488061127187389,
 -0.050444195959172204,
 0.01867503078856273,
 0.023709279969627756,
 0.004064902120976764,
 0.020505666346410437,
 -0.019984444226767404,
 -0.013272111985347371,
 0.013958600086709622,
 0.004455819176370316,
 -0.0010996529392449787,
 -0.015280727009140817,
 -0.006057625987978975,
 0.002086480341652791,
 0.008867143486254196,
 -0.01633588566397596,
 0.03722293604808581,
 0.005679421485489389,
 -0.015840086787495745,
 0.010170200648006887,
 0.017835989258385226,
 0.016984235175303758,
 -0.011327060657396307,
 0.03066315444151336,
 0.005593610356403788,
 0.010081211846356573,
 -0.0023105425790447675,
 0.009439217680158202,
 0.00043024715622445224,
 -0.01202626224298508,
 0.005193158886332263,
 -0.0059749929971193655,
 0.018827583286055428,
 0.006686907135612104,
 0.0012371095129948561,
 -0.026417096579009976,
 -0.005777945167478955,
 -0.00800267685026876,
 0.021929494623395912,
 -0.005485551794074969,
 -0.021598964522602588,
 0.01938694492705547,
 -0.024866140910339733,
 -0.003340275196749421,
 -0.0051232384483766175,
 -0.024090662144682057,
 -0.03590080819433206,
 0.021509974789629717,
 0.0019482291169310962,
 -0.0043636517708327315,
 0.009915945864637354,
 -0.021929494623395912,
 -0.02173880446719132,
 -0.028425711602803425,
 -0.0042047425312267735,
 0.022984653278231054,
 0.00046401539874454825,
 0.02517124590532513,
 0.017467319636234893,
 -0.010316397800370159,
 -0.027001883325817946,
 0.006744114555002504,
 -0.004039476549507554,
 0.018598753608493824,
 0.004449462899918333,
 0.003597708584006224,
 -0.01610705598641436,
 -0.0018417597401305928,
 -0.000899427146001556,
 0.03343453567806052,
 0.012388575588683432,
 -0.011009242178184392,
 0.023658429758011894,
 0.04118931215876662,
 0.01362806905459374,
 -0.0029493581413558693,
 -0.03465495938329232,
 -0.032875172174415365,
 0.020124284171356137,
 -0.015827374234591778,
 -0.024561034052709225,
 -0.02162438962841052,
 -0.026239117113064234,
 -0.001001129140840093,
 0.02443390666102446,
 -0.034171874922361185,
 0.020365826401821704,
 0.008803579790411813,
 0.0006773512450008244,
 0.011371555523882743,
 -0.008879856039158161,
 -0.007538661218693573,
 0.0007901768982533102,
 0.01638673587559182,
 0.001102036542914472,
 -0.004684648853931918,
 -0.012153389634669847,
 -0.0005116882288339955,
 -0.004020407254490328,
 0.013144984593662605,
 0.011098231911157263,
 -0.012153389634669847,
 0.02419236443055889,
 -0.02916305084710409,
 -0.008720947265213481,
 0.02405252448597016,
 0.012712751275669888,
 0.0030939657589448668,
 0.013653494160401672,
 -0.002002258281680186,
 -0.024446619213928426,
 0.024637311232778132,
 -0.025972149776790736,
 0.0037153013281823774,
 0.009896876103958849,
 -0.010252833173205219,
 0.01707322490827663,
 -0.009312090288473435,
 -0.014327270640182512,
 -0.0038106471047765914,
 -0.007309831541131971,
 -0.024955130643312604,
 -0.01051344516434929,
 0.023633004652203964,
 -0.010367248011986019,
 -0.010087568122808557,
 0.009483712546644638,
 0.019425082585767365,
 -0.00411257512568919,
 0.00954727624248702,
 0.022158324300957516,
 -0.03742633689454925,
 0.009064191781555885,
 -0.00018304375062851792,
 0.005367959282729456,
 -0.0012124785924974646,
 0.009439217680158202,
 -0.005917785577728965,
 -0.02214561174805355,
 0.013005143717751317,
 -0.00656613602037932,
 -0.015929076520468616,
 0.0015525447377832378,
 -0.0007687241741895699,
 0.008701878435857534,
 -0.011651236344382764,
 -0.026213692007256303,
 0.020874337831205885,
 0.014606950529359976,
 -0.0012744532192268522,
 -0.041011332692820875,
 -0.009184962896788669,
 0.037451765725647415,
 0.007278049693210779,
 0.016501150714372623,
 0.025603480154640403,
 -0.02037853895472567,
 0.015140886133229529,
 0.0177724264938654,
 0.036053362554469864,
 -0.015890938861756722,
 -0.024065237038874126,
 -0.0020976038254437606,
 0.0042301676370347045,
 0.015268013524914296,
 0.012789027524416236,
 -0.03264634249685376,
 0.002092836618104774,
 0.011206289542163524,
 -0.009470999062418115,
 0.027790072781734478,
 -0.009273951698438983,
 0.019158115249493866,
 0.012947937229683473,
 0.01613248109222229,
 -0.03180730282932137,
 0.0013690043448492488,
 -0.017556309369207764,
 -0.008117091689049562,
 -0.010271902933883723,
 -0.018141096116015734,
 -0.014327270640182512,
 -0.0151154610274216,
 -0.006972943766902828,
 0.010100280675712522,
 0.01699694772820772,
 -0.01051344516434929,
 -0.03490921416666185,
 0.01400945122964804,
 -0.02822230703104975,
 0.011333417865170847,
 0.017136787672796455,
 -0.010519801440801273,
 0.03066315444151336,
 -0.007805629020628349,
 -0.0176580116550846,
 0.05730908256073005,
 0.014034877266778528,
 0.019653912263328965,
 0.002741186827924489,
 -0.003340275196749421,
 0.0020086145581321682,
 -0.009687116187075752,
 0.003994982148682397,
 -0.007875549458583994,
 -0.005708025195184589,
 -0.025984862329694703,
 0.0003124555746823652,
 -0.0028953290930220994,
 0.0019021452977469847,
 0.022310878661095327,
 -0.005714381471636572,
 0.0067949652322796435,
 7.955400938211603e-05,
 -0.012032618519437063,
 -0.004989754780239869,
 0.021052315434506513,
 0.014606950529359976,
 -0.02284481333364232,
 -0.02890879606373456,
 -0.01797582920297396,
 0.00979517474940457,
 -0.00945828650951415,
 0.003127336675979054,
 0.008975202048583014,
 -0.002601346650505118,
 -0.023696567416723793,
 -0.014619663082263942,
 -0.01703508538691962,
 -0.012375863035779467,
 -0.03671442275605651,
 0.0069665874904508455,
 -0.0006543093936164285,
 -0.01133977414162283,
 0.022997365831135017,
 -0.006178397103211759,
 -0.005072387771099479,
 -0.004957972932318677,
 -0.015636684078387187,
 -0.02026412411594487,
 -0.014352695745990443,
 -0.001233136840212367,
 -0.00959812645410288,
 -0.0098205998552125,
 -0.0025695648025839263,
 0.01031004059259562,
 -0.005933676268858922,
 0.001019403668470182,
 0.01023376434384927,
 -0.0190309878578091,
 -0.006731401536437261,
 0.0035118974549206232,
 0.004557520996585873,
 0.005186802144219001,
 0.030891984119074963,
 -0.014530674280613628,
 0.0002987496498559683,
 0.0006753649086095797,
 -0.0019784217793239723,
 0.013526366768716904,
 -0.009947727246897268,
 0.004827666936746641,
 0.00021333584625627612,
 -0.017251202511577256,
 -0.03490921416666185,
 0.01967933923178201,
 0.024484758735285436,
 -0.01985731683508264,
 0.018764020521535602,
 0.20503124208358667,
 -0.0011576548931918772,
 -0.028146029850980844,
 0.021700664945834308,
 0.017174927194153464,
 0.021344707876587942,
 0.006102120388804132,
 -0.00587329071124253,
 0.0025298373762671173,
 0.005806548877174155,
 0.00424923693205193,
 -0.019221679876658807,
 -0.017047799802468698,
 0.0034546900355302223,
 0.025094970587901336,
 -0.0063150589095745,
 -0.022717685941957555,
 -0.006725045259985278,
 -0.02758667007262592,
 -0.012986074888395368,
 -0.0023677497656045287,
 -0.024408481555216528,
 -0.003985447268343145,
 -0.015001046188640798,
 0.02817145681943389,
 -0.000888303487587607,
 -0.02543821510424374,
 0.004439928019579081,
 0.029010498349611395,
 0.00913411175385025,
 -0.0070428642048584724,
 0.00016745076257469385,
 0.00656613602037932,
 -0.006858528928122027,
 -0.00720177344446443,
 -0.009331159117829384,
 0.013221260842408954,
 -0.01594178907337258,
 -0.0007333668043488184,
 0.02049295379350647,
 0.011663948897286728,
 0.0067949652322796435,
 0.012846234943806637,
 -0.004614728415976274,
 -0.0037184796992390076,
 0.004802241830938711,
 ...]
import numpy as np
np.dot(embedding1, embedding2)
0.9631510802407719
np.dot(embedding1, embedding3)
0.7702031204123156
np.dot(embedding2, embedding3)
0.7590539714454778

30.2 Vectorstores

# ! pip install chromadb
from langchain.vectorstores import Chroma
persist_directory = 'docs/chroma/'
!rm -rf ./docs/chroma  # remove old database files if any
vectordb = Chroma.from_documents(
    documents=splits,
    embedding=embedding,
    persist_directory=persist_directory
)
print(vectordb._collection.count())
151

30.3 Failure modes

This seems great, and basic similarity search will get you 80% of the way there very easily.

But there are some failure modes that can creep up.

Here are some edge cases that can arise - we’ll fix them in the next class.

question = "what did they say about matlab?"
docs = vectordb.similarity_search(question,k=5)

Notice that we’re getting duplicate chunks (because of the duplicate MachineLearning-Lecture01.pdf in the index).

Semantic search fetches all similar documents, but does not enforce diversity.

docs[0] and docs[1] are indentical.

docs[0]
docs[1]

We can see a new failure mode.

The question below asks a question about the third lecture, but includes results from other lectures as well.

question = "what did they say about regression in the third lecture?"
docs = vectordb.similarity_search(question,k=5)
for doc in docs:
    print(doc.metadata)
print(docs[4].page_content)

Approaches discussed in the next lecture can be used to address both!