THE LAB #78: Building a Web Scraping Knowledge Assistant with RAG - Part2
Optimizing the content storage and creating a CLI for our assistant
In our previous article, we saw how to scrape this newsletter with Firecrawl and transform the posts into markdown files that can be loaded into a VectorDB in Pinecone.
After releasing the first part of the article, I kept querying the VectorDB with different queries. I was unhappy with the results, so I wanted to optimize the data ingestion on Pinecone (or at least try it) a bit.
Before proceeding, let me thank NetNut, the platinum partner of the month. They have prepared a juicy offer for you: up to 1 TB of web unblocker for free.
Improving the data quality
First of all, I tried to clean the markdown from the link to images, new lines, separators, and other stuff so that the files passed to Pinecone are more readable.
So, I created a small function with regular expressions (thanks, ChatGPT!) to preprocess the markdown extracted by Firecrawl before passing it to Pinecone.
def clean_markdown(md_text):
"""Cleans Markdown text by removing images and dividers."""
import re
md_text = re.sub(r"!\[.*?\]\(.*?\)", "", md_text) # Remove markdown images
md_text = re.sub(r'<img[^>]*>', "", md_text) # Remove HTML images
md_text = re.sub(r"(\*{3,}|-{3,})\n(.*?)\n(\*{3,}|-{3,})", "", md_text, flags=re.DOTALL) # Remove dividers
md_text = re.sub(r'\n\s*\n', '\n', md_text).strip() # Remove extra newlines
return md_text
Splitting in fixed-length chunks
Another technique that can be used to improve the pertinence of the data retrieved is to split articles into chunks. Instead of ingesting the whole article in the same entry of the index, it is split into chunks and inserted into several entries of the index.
This way, a single entry should contain a single concept instead of an entire article, making calculating its relevance to the user’s query easier. You can find this approach used in the file firecrawl_get_data_with_chunks.py on the GitHub repository of The Lab.
I’m well aware that this is far from perfect. I’m simply splitting the content into chunks of fixed length and ignoring the content of the chunks. The same paragraph could be split into different chunks, which is quite approximative.
A smarter approach could be to read the article with ChatGPT, making it summarize the different paragraphs and then load each summary in a different chunk. In this way, we can get clean data and chunks that have an entire paragraph inside.
Thanks to the gold partners of the month: Smartproxy, Oxylabs, Massive and Scrapeless. They’re offering great deals to the community. Have a look yourself.
Splitting in chunks with GPT4-o
That’s exactly what I did in my last attempt: I gave the markdown files of the article in input to GPT4-o and asked to rewrite them, using different paragraphs per single post.
Every paragraph became a chunk in Pinecone. In this case, it has a beginning and end, being a fully developed concept instead of a string of X tokens.
You can find this chunking method in the repository file firecrawl_get_data_with_chunks_openai.py.
It took several hours for me to develop and test these ideas, which were not part of the initial article, that’s why this episode is published on Friday instead of the usual Thursday.
Querying the Pinecone database
At the end of the chunking tests, we have three different Pinecone indexes that used the same input data but split it in a different way:
article-index, with one entry per article
article-index-with-chunks, with the articles split into different chunks based on the number of tokens used
article-index-with-chunks-openai, with the articles split by chapters written by OpenAI
All three indexes have the same structure instead:
a values field, where we have the vectorial representation of the text we passed. This will be used to find the most relevant text for the input query using a proximity algorithm.
a chunk_text field, where we store the text that will be used as the output of the query (the full article or the chunk of the article selected)
three metadata fields (author, title, and URL) that we’ll use to cite the articles used to answer the query
But how can we query these indexes to retrieve the results?
The theory is quite simple, at least on a surface level. When we write a prompt, we’re basically writing a query in a natural language. This query is then embedded with the same algorithm used when we embedded the articles inserted in Pinecone. Our query itself becomes a series of numbers so that Pinecone (or other vectorial databases) can perform a proximity search between the values of the query and the values of the articles. The nearest results are then returned with a certain degree of proximity and we can filter the records returned by using only the nearest ones.
def retrieve_articles(query, top_k=3, confidence_threshold=0.3):
"""Retrieve the most relevant articles from Pinecone for a user query."""
# Generate query embedding
query_embedding = pc.inference.embed(
model="llama-text-embed-v2",
inputs=[query],
parameters={"input_type": "query"}
)[0]["values"]
# Query Pinecone
results = index.query(
vector=query_embedding,
top_k=top_k,
namespace="articles",
include_metadata=True
)
# Extract relevant articles
retrieved_docs = []
for match in results["matches"]:
score = match["score"]
metadata = match["metadata"]
# Add article details
retrieved_docs.append({
"title": metadata["title"],
"url": metadata["url"],
"author": metadata["author"],
"content": metadata["chunk_text"],
"score": score
})
# Compute highest confidence score
max_score = max([doc["score"] for doc in retrieved_docs], default=0)
# Decide whether to use Pinecone or fallback to GPT-4o
use_pinecone = max_score >= confidence_threshold
return retrieved_docs if use_pinecone else None, use_pinecone
Once the records are returned, it’s just a matter of prompt engineering. We need to append the values contained in the chunk_text field of the records to the context windows of the prompt and try to find the best way to describe the desired output.
def generate_answer(query):
"""Generates a long-form instructional answer using retrieved articles."""
retrieved_docs, use_pinecone = retrieve_articles(query)
if use_pinecone:
# Extract full text from relevant articles
context_text = "\n\n".join([
f"Title: {doc['title']}\nAuthor: {doc['author']}\nContent:\n{doc['content']}..."
for doc in retrieved_docs
])
# Construct the GPT prompt
prompt = (
"Using the following extracted content from expert-written articles, "
"provide a long-form, step-by-step, detailed answer with practical instructions. "
"Make sure to extract key information and structure the answer properly.\n\n"
f"{context_text}\n\n"
f"📌 **User's Question**: {query}\n\n"
f"💡 **Detailed Answer**:"
)
else:
# No relevant articles, fall back to GPT-4o general knowledge
prompt = f"Provide a long-form, detailed answer with step-by-step instructions based on your general knowledge:\n\n📌 **User's Question**: {query}\n\n💡 **Detailed Answer**:"
# Query GPT-4o
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}]
)
# Extract paragraphs from the response
return response.choices[0].message.content.strip(), use_pinecone, retrieved_docs
The scripts mentioned in this article are in the GitHub repository's folder 78.ASSISTANT, which is available only to paying readers of The Web Scraping Club.
If you’re one of them and cannot access it, please use the following form to request access.
This prompt can probably be improved, but the results are quite good with all three indexes, even if I think I’m getting the best answers with the index that used GPT to chunk the articles.
Tests
Keep reading with a 7-day free trial
Subscribe to The Web Scraping Club to keep reading this post and get 7 days of free access to the full post archives.