Following my previous post, where I successfully embedded text from PDF documents and made queries based on that data, I’m now going to save that embedded and vectorised data into a database.
I’ll be using Azure Cosmos MongoDB, but you can also use a regular MongoDB that Atlas offers, since it’s essentially the same MongoDB.
PS: I LOVE Azure for this. They didn’t try to reinvent MongoDB like AWS did with their DocumentDB, which unfortunately has some significant missing features.
Refactoring
I’m going to split my scripts into two parts: ingestion and querying, so that we don’t re-ingest the data into our database every time. I’ll also move our setup into a separate file, including things like setting LLM and embedding models, loading the .env file, and initialising the vector store, since we’ll need all of that in both functions.
As a result, we’ll have a setup file that looks like this:
Our previous main.py will be renamed to ingest.py, and we’ll keep only the embedding functionality:
That’s it for the ingestion part.
Querying
Previously, for querying, we had three lines:
Index from documents
Query Engine, and
Response, which was a result of our query.
Since we’ve already ingested the data and it now lives in our database, we no longer need to set up a VectorStoreIndex from documents. Instead, we’ll make it point to our CosmosDB:
The QueryEngine initialisation remains the same.
And that’s it. Now when we query, we’ll get exactly the same result, except we’ll go straight to the database and won’t process our PDF documents again. I’ve put the content in a text file and am reading from it, so we’ll have something like this: