Embedchain #
Overview #
Embedchain combines several steps of build RAG and Agentic RAG apps. It helps with:
- Data extraction from various file formats (text, html with BeatifulSoup4, pdf).
- Chunking - splitting the cleaned data into chunks.
- Embedding - it can use different AI providers (of offline functions) to calculate the embeddings.
- Storing Embeddings in a vector database, like ChromaDB (default).
- Querying and passing top-k matches to an LLM to generate the answer.
Because Embedchain combines many of the steps without requiring extensive configuration, it is a good library to quickly test a RAG app concept. Furthermore, we can customize different steps separately.
A simple example:
from embedchain import App
# Initialize the Embedchain app
app = App()
# Add sources
app.add("path/to/document.pdf")
app.add("https://example.com/about")
# Query the app
response = app.query("What kind of services Example Inc. provides?")
print(response)
Refs:
Data Extraction #
Cache Fetched Data #
If we want to embed a sequence of mostly unchanging pages from a website, it could better to download all of the pages and proxy them from a local server. For example, instead of
app.add("https://example.com/category1/page1.html", data_type="web_page")
app.add("https://example.com/category1/page2.html", data_type="web_page")
app.add("https://example.com/category1/page3.html", data_type="web_page")
We could serve them at localhost:
app.add("http://localhost/category1/page1.html", data_type="web_page")
app.add("http://localhost/category1/page2.html", data_type="web_page")
app.add("http://localhost/category1/page3.html", data_type="web_page")
This allows conducting experiments re-generating the embeddings without repeatedly downloading the same pages from the remote sources.
Caveat: This would cause metadata in the vector database to have references pointing to the local URLs instead of the real ones, which can be adjusted either post-upload to the vector DB or when serving the results back to the user.
Manually Preprocess HTML Files #
If we want to manually extract some metadata from HTML pages, we can use the BeautifulSoup4 package to parse, extract metadata, and clean the files.
The script below prints out cleaned contents of a page:
from bs4 import BeautifulSoup
with open("file1.html", "r") as f:
text = f.read()
soup = BeautifulSoup(text, 'html.parser')
print(soup.prettify())
By default BeautifulSoup4 uses the default HTML parser in the active Python evironment. However, it does not work with some documents. In such cases, we can substitute a different parser.
To use html5lib, first, we should install it:
pip install html5lib
Then, in a Python script:
soup = BeautifulSoup(text, "html5lib")
To use lxml, also install it with:
pip install lxml
and use the parser with:
soup = BeautifulSoup(text, "lxml")
Manual Embedding #
When not configured to use an external embedding function, embedchain defaults to using a default embedding function in ChromaDB. Of course, big LLM providers probably offer better embedding models compared to simpler functions like the default one in ChromaDB. The ChromaDB library has integrations with popular LLM providers, so we can configure it to use an external embedding function to populate the database. Note that the same embedding function must be used both for generating the embeddings and when querying the database.
If want to use custom embedding functions or embedding APIs directly, we can manually write the embeddings into the vector database and let the embedchain app use this popuplated DB.