Using NLP for Entity Extraction From Scraped Data
From theory to practice: how to extract entities from textual scraped data using NLP
Help Shape “The State of Web Scraping 2026” (and get rewarded for it)
Apify and The Web Scraping Club are teaming up to produce a detailed, data-driven report on the web scraping industry.
To make it truly valuable, we need insights from the people who build, break, and run scrapers every day, just like you.
As a thank-you, everyone who completes the survey will receive two free months of The Web Scraping Club membership.
Your answers stay anonymous, but your impact won’t.
When it comes to web scraping, extracting textual data is probably one of the common cases that occur daily. And when you have to deal with textual data, very often you find yourself aggregating text from several sources. In such a case, there are occasions when you would like (or when you need) to know more about the content you have retrieved.
This is where Natural Language Processing (NLP) comes in your help. Without NLP, textual data would be impossible to classify and analyze (and if you are thinking of LLMs, well…NLP is simply at the base of LLMs!).
In this article, I will walk you through entity extraction: an NLP process that is very helpful to label unstructured or semi-structured textual data scraped from web pages.
Let’s get into it!
Before proceeding, let me thank Decodo, the platinum partner of the month, and their Scraping API.
Decodo just launched a new promotion for the Advanced Scraping API, you can now use code SCRAPE30 to get 30% off your first purchase.
What is Entity Extraction?
In the field of Natural Language Processing (NLP), entity extraction is the process of identifying and classifying key information in a body of text. Formally, these key pieces of information are known as “entities”, and the process is called Named Entity Recognition (NER).
You can think of NER as teaching a computer to read a sentence and pick out the important nouns, similarly to how you would highlight keywords when you study a book for an exam.
To make an example, consider the following sentence:
On Monday, Sarah from Acme Corporation flew to Paris to sign a deal worth $1.5 million.
If an entity recognition model were to process this phrase, it could identify the following entities:
“Monday”: Date.
“Sarah”: Person.
“Acme Corporation”: Organization.
“Paris”: Location.
“$1.5 million”: Monetary Value.
Good! But why is this important? Well, entity extraction is a fundamental step for many applications that rely on data because it helps transform unstructured text into structured data. For example, you could use NER to extract entities from scraped data. Then, use these entities to feed a machine learning model to detect patterns in the data you scraped.
In essence, you can use entity recognition as a first step in making sense of vast amounts of text data. And you know…in web scraping, you retrieve tons of data every day!
The list below describes models and libraries you can use for entity extraction. It begins with the most suitable for research use cases, and ends with the best ones for production environments:
NLTK: The Natural Language Toolkit (NLTK) is a foundational library for building Python programs to work with human language data. While modern libraries like spaCy and Hugging Face Transformers have surpassed it for production-level applications in terms of performance, NLTK remains an invaluable tool for education, research, and prototyping use cases.
Hugging Face transformers (BERT, RoBERTa, DistilBERT): Hugging Face provides easy access to the most powerful and accurate deep learning models available. While Hugging Face is a hub for libraries, it is the primary way developers use transformer-based models like BERT that, among other purposes in NLP, can be used for NER purposes.
spaCy: It is the industry standard for fast, efficient, and production-ready NLP in Python. Its core philosophy is “opinionated”, meaning it does not aim to be a research toolkit with a multitude of algorithms for every task. Instead, spaCy provides a single state-of-the-art implementation for each NLP capability. This design choice makes it the best tool for developers who need to integrate NLP into their products and services without the overhead of academic experimentation.
Cloud-Based APIs (Google Cloud NLP, Amazon Comprehend, Azure Language): These platforms offer pre-trained models as a fully managed service. This is the choice for businesses that want state-of-the-art results without the overhead of managing models or infrastructure for NLP tasks.
This episode is brought to you by our Gold Partners. Be sure to have a look at the Club Deals page to discover their generous offers available for the TWSC readers.
💰 - 1 TB of web unblocker for free at this link
💰 - 50% Off Residential Proxies
💰 - Use TWSC for 15% OFF | $1.75/GB Residential Bandwidth | ISP Proxies in 15+ Countries
Entity Extraction: Use Cases in The Scraping Industry
After describing the high-level process under the hood, let’s now consider some cases where NER can be useful on scraped data.
Finance and Investment
Consider an investment firm that wants to automate the analysis of quarterly earnings reports and financial news to make trading decisions. They scrape thousands of news articles from sources like Bloomberg and Reuters, as well as the full text of publicly filed earnings reports.
If you apply a NER model to text in such an industry, it could identify entities like:
ORGANIZATION: “Apple Inc.”, “NVIDIA”, “JPMorgan Chase”.
MONEY: “$1.2 trillion”, “€50 million loss”, “revenue of $22 billion”.
PERSON: “Tim Cook” (CEO), “Jerome Powell” (Fed Chair).
DATE: “Q4 2024”, “fiscal year ending September 30th”.
PRODUCT: “iPhone 16”, “Vision Pro”.
This process helps you transform unstructured news into a structured database of labeled data. When the process is completed, analysts can run queries like:
“Show me all tech companies that missed revenue expectations in the last quarter“.
“Alert me when a CEO of a company in our portfolio is mentioned in the same article as the word ‘investigation’“.
E-commerce and Competitive Intelligence
Let’s consider the case of a company that sells headphones and wants to understand the market landscape and customer feedback by analyzing competitor product reviews. To do so, they scrape 50’000 customer reviews for the top 10 best-selling headphones from Amazon.
A NER model that processes the review texts could classify mentions of entities like the following:
PRODUCT: “Sony WH-1000XM5”, “Bose QuietComfort Ultra”.
ORGANIZATION: “Bose”, “Sennheiser” (competitors mentioned by customers).
FEATURE: “noise cancellation”, “battery life”, “comfort”, “Bluetooth connectivity”.
DEFECT: “plastic hinge cracked”, “app won’t connect”, “poor mic quality”.
So, instead of manually reading one review after another, the company can create a dashboard that analysts can use after the NER model has classified the entities. Analysts can, then, see that, for example, customers love the “noise cancellation“ on a competitor’s product, but frequently complain about its “poor mic quality.” This process provides direct insights for their product development and marketing, which can highlight competitor weaknesses to intercept and features to prioritize.
Human Resources and Talent Acquisition
In this case, you can consider a large tech company that needs to analyze the job market to ensure its job descriptions and salary offers are competitive. To do so, they scrape thousands of job postings for “Software Engineer” roles from different websites like LinkedIn, Indeed, and their competitors’ career pages.
In this case, a NER model extracts details from each job description, which could be:
SKILL: “Python”, “React”, “AWS”, “Kubernetes”.
DEGREE/CERTIFICATION: “Master’s degree”, “PhD”, “AWS Certified Developer”.
SENIORITY: “5+ years”, “minimum of 3 years”.
SALARY: “$150’000”, “$90’000/year”.
LOCATION: “On-site: Austin, TX”, “Remote (US)”, “London, UK”.
Thanks to this entity extraction, analysts in the HR department can analyze trends on specific job searches they are working on. They can answer questions like:
“What are the top 5 most requested skills for Data Scientists in the USA?“.
“What is the average salary range companies are offering for a senior engineer with 7 years of experience?“.
This approach can help them attract top talent by writing more competitive and attractive offers in their job descriptions.
Marketing and Brand Management
Imagine a global sneaker brand that wants to monitor its brand perception and identify the sentiment around it. They scrape posts mentioning their brand name from Instagram, TikTok, Reddit, and popular fashion blogs.
The scraped text and post captions are analyzed by an NER model that can identify entities like the following:
PERSON: “Michael Jordan”, “Tom Hanks”.
PRODUCT: “Air Jordan 1”, “Nike Dunk Low”, “Pegasus 41”.
EVENT: “Paris Fashion Week”, “NBA Finals”, “Coachella”.
COMPETITOR: “Adidas”, “New Balance”, “Hoka”.
When the entities are extracted from the text, you can proceed with subsequent analyses which, for example, can include performing sentiment analysis on the data you scraped.
Before continuing with the article, I wanted to let you know that I've started my community in Circle. It’s a place where we can share our experiences and knowledge, and it’s included in your subscription. Enter the TWSC community at this link.
Using NLP for Entity Extraction From Scraped Data: Step-by-step Tutorial
This section presents two use cases where you can apply NER to scraped data:
E-commerce data.
News data.
You will learn to use the library spaCy for both cases. You will also learn why the model at the core of spaCy is more suitable for one case (news data), and what to do in the other case (e-commerce data).
Follow me along the next subsections to learn more.
Requirements and Dependencies
To reproduce the following tutorials, you need to have at least Python 3.10.1 installed on your machine.
Call the main folder of your project ner-scraping/. At the end of this step, it will have the following structure:
ner-scraping/
├── ner-products.py
├── ner-news.py
└── venv/Where:
ner-products.py contains the logic for solving the e-commerce case using NER.
ner-news.py contains the logic for solving the news case using NER.
venv/ contains the virtual environment.
You can create the venv/ virtual environment directory like so:
python -m venv venvTo activate it, on Windows, run:
venvScriptsactivateEquivalently, on macOS and Linux, execute:
source venv/bin/activateIn the activated virtual environment, install the spaCy library:
pip install spacyFinally, install en_core_web_sm—the pre-trained model at the core of spaCy:
python -m spacy download en_core_web_lgPerfect! Everything is set up for proceeding with the tutorials.
Extracting Entities from Scraped Data: An E-commerce Case
Let’s consider a common case in the e-commerce industry. Often, you scrape data from e-commerce websites and would like to understand if, among all the data, you have retrieved data from the same products. This is useful, for example, for making a price comparison.
In other words, you want consolidation of products. In the case of e-commerce products, this may happen, for example, in the case of the same product that is sold at a price in different currencies. Or, simply, you want to understand if some store has overpriced a product.
Consider the following data retrieved from different e-commerce websites:
This data refers to headphones and reports, among others, the following information:
The source: Refers to the e-commerce website.
The description: Is the description of the headphones with their features, color, product name, etc.
The next steps will walk you through implementing an NER model with spaCy.
Step #1: Write the Code for Entity Extraction
Say the data you scraped is stored in the products.csv file. To extract the entities, write the following code in the ner-products.py file:
import csv
import spacy
# Load the pre-trained model
nlp = spacy.load("en_core_web_lg")
# Open and read the CSV file
with open("products.csv", "r", encoding="utf-8") as f:
reader = csv.DictReader(f)
for row in reader:
source = row["source"]
text = row["description"]
print(f"--- Entities for {source} ---")
# Process the text with the model
doc = nlp(text)
# Print the identified entities and their labels
for ent in doc.ents:
print(f"- Entity: '{ent.text}', Label: '{ent.label_}'")
print("\\n")
This code does the following:
Loads the pre-trained model that will intercept the entities in the textual data for NER.
Opens the CSV file with the data and reads the “source” and “description” columns. In other words, it processes the data that reports the website of the e-commerce and the descriptions of the products.
Processes the text and prints the results.
Very well. You are ready to run the code!
Step #2: Run The Code and Analyze The Results
After running the code, the results you will obtain are as follows:
--- Entities for SoundSavvy.com ---
- Entity: ‘Sony’, Label: ‘ORG’
- Entity: ‘Industry Leading Noise Canceling’, Label: ‘ORG’
- Entity: ‘Integrated Processor V1’, Label: ‘ORG’
--- Entities for TechGiant.de ---
- Entity: ‘up to 30 hours’, Label: ‘TIME’
--- Entities for AudioBargains.net ---
- Entity: ‘Sony’, Label: ‘ORG’
--- Entities for HiFiHaven.co.uk ---
- Entity: ‘Bose QuietComfort Ultra Headphones - Immersive Audio - Sandstone’, Label: ‘ORG’
- Entity: ‘Bose Immersive Audio’, Label: ‘PRODUCT’
- Entity: ‘8849-C-789’, Label: ‘DATE’
--- Entities for GlobalTech.com ---
- Entity: ‘Bose QC Ultra Headphones (Black’, Label: ‘ORG’
- Entity: ‘all-day’, Label: ‘DATE’
--- Entities for iUniverse.com ---
- Entity: ‘AirPods Max - Silver’, Label: ‘PRODUCT’
- Entity: ‘AirPods’, Label: ‘PRODUCT’
- Entity: ‘Active Noise Cancellation’, Label: ‘ORG’
--- Entities for SmartGadgets.ca ---
- Entity: ‘AirPods Max’, Label: ‘PRODUCT’
- Entity: ‘Apple’, Label: ‘ORG’
As you can see, the model has not intercepted the entities as you might have expected. Some data is labeled as “ORG” (organization) when it should not be. Other data is just not labeled. This happened because the model is not the right one to intercept entities into the e-commerce industry. At this point, you have two options:
You can try another model. Perhaps a pre-trained BERT.
Choosing one or the other mainly depends on you. Based on my experience, I can say that pre-trained models may not suit the results you may expect in specific industries, as they are generalistic models. So, if your major scraping commitment is to e-commerce, it is worth fine-tuning an LLM to reuse it!
Extracting Entities from Scraped Data: A News Case
The news industry is more suitable for the en_core_web_lg model, as it specializes in extracting entities such as names, companies, money, and similar.
Consider a situation where you have scraped the summaries of several news articles from different websites, like the following:
Again, as in the previous example, you want to extract the entities from such news. The reasons for doing so can be different. Maybe you want to intercept news regarding the CEO of a specific company. Maybe you want to intercept all the news from a company. Maybe you have other reasons.
In the next step, you will apply spaCy to extract entities in such a case.
Step #1: Write the Code for Entity Extraction
Suppose the data you scraped is stored in the news.csv file. To extract the entities, write the following code in the ner-news.py file:
import csv
import spacy
# Load the pre-trained English model
nlp = spacy.load("en_core_web_lg")
# Open and read the CSV file
with open("news.csv", "r", encoding="utf-8") as f:
reader = csv.DictReader(f)
article_num = 1
for row in reader:
text = row["article_text"]
print(f"--- Entities for Article #{article_num} ---")
# Process the text with the model
doc = nlp(text)
# Print the identified entities and their labels
for ent in doc.ents:
print(f"- Entity: '{ent.text}', Label: '{ent.label_}'")
print("\\n")
article_num += 1
Similarly to before, this code:
Loads the pre-trained model that will intercept the entities in the textual data for NER.
Opens the CSV file with the data and reads the “article_text” column.
Processes the text and prints the results.
Terrific! You are ready to run the code.
Step #2: Run The Code and Analyze The Results
After running the code, you will obtain results as follows:
--- Entities for Article #1 ---
- Entity: ‘Global Tech Inc.’, Label: ‘ORG’
- Entity: ‘Jane Doe’, Label: ‘PERSON’
- Entity: ‘yesterday’, Label: ‘DATE’
- Entity: ‘Berlin’, Label: ‘GPE’
- Entity: ‘over $1.2 billion’, Label: ‘MONEY’
- Entity: ‘Orion’, Label: ‘PRODUCT’
- Entity: ‘Germany’, Label: ‘GPE’
--- Entities for Article #2 ---
- Entity: ‘the first week of December’, Label: ‘DATE’
- Entity: ‘Alex Schmidt’, Label: ‘PERSON’
- Entity: ‘French’, Label: ‘NORP’
- Entity: ‘the Paris Climate Summit’, Label: ‘EVENT’
- Entity: ‘France’, Label: ‘GPE’
- Entity: ‘the United States’, Label: ‘GPE’
--- Entities for Article #3 ---
- Entity: ‘John Carter’, Label: ‘PERSON’
- Entity: ‘Quantum Dynamics Corp.’, Label: ‘ORG’
- Entity: ‘more than 15%’, Label: ‘PERCENT’
- Entity: ‘the third quarter’, Label: ‘DATE’
- Entity: ‘250.50’, Label: ‘MONEY’
- Entity: ‘the New York Stock Exchange’, Label: ‘ORG’
- Entity: ‘this morning’, Label: ‘TIME’
Good job! As you can see, the pre-trained model at the core of Spacy provides you with much better results when applied to news, rather than on e-commerce data. This shows the importance of using the right model for the right use case.
Now that the model has extracted the entities, you can create labels for your scraped data and analyze it with Pandas and Matplotlib.
Conclusion
In this article, you have gone through what entity extraction in NLP is and how it can be useful when applied to scraped data. The tutorials you have implemented have also shown the importance of using the right model for extracting the entities. And if you don’t find the right model, you can fine-tune an LLM of your choice.
To conclude, let’s discuss: do you use such methods for extracting entities from your scraped data?







