Machine learning models for detecting bot detection triggers
Learn how to identify key scraping indicators and choose the right machine learning strategy to protect your online resources.
Web scraping has become very common because, in our digital era, data is a currency, and websites are its vast repositories. The popularity of web scraping creates several problems for website owners:
Their servers slow down from extra traffic.
They pay more for bandwidth.
Sometimes, their important content gets taken quickly by scrapers.
As scraping tools get smarter, websites face more risks to their performance and data. However, the main challenge is that identifying automated scraping tools among normal visitors is difficult. Old methods like blocking certain IP addresses or limiting the number of requests just don't work well anymore.
Modern scraping tools use different IP addresses, change how they appear online, and even try to act like human users. This is where Machine Learning (ML) can become very helpful. ML models can spot the small clues and patterns—the *triggers—*that even clever bots can't hide completely. While basic defenses only catch obvious scraping, ML can detect the more subtle signs.
In this article, you’ll read how machine learning models can help you detect scraping activities on your website.
Before proceeding, let me thank NetNut, the platinum partner of the month. They have prepared a juicy offer for you: up to 1 TB of web unblocker for free.
Understanding Web Scraping Bot Detection Triggers
To effectively apply ML as an anti-bot technique, you first need to understand the signals—the triggers—that can differentiate automated scraping activity from human-like web traffic.
This section provides an overview of how triggers can be categorized.
Request Pattern Triggers
These triggers relate to how a client requests resources from the server:
High request volume and velocity: A common indicator of a scraping bot is an abnormally high number of requests originating from a single IP address or within a single session in a short period. Similarly, pages might be loaded, or navigation might occur at speeds difficult for humans to achieve.
Sequential and deep traversal: Automated scraping tools often systematically crawl through website structures. For example, a bot might iterate through all product IDs in a predictable sequence (for instance, product_1, product_2, etc...) or navigate deeply into paginated category listings without deviation. This contrasts with typical human browsing, which is often varied and randomized.
Targeted content access: Automated tools are usually programmed for a specific data extraction goal, such as retrieving prices, product descriptions, or user reviews. They might repeatedly access only these data-rich sections of websites, ignoring navigational elements, images, or JavaScript files that a human browser would typically download and render.
Lack of engagement with dynamic content: Many modern websites rely heavily on JavaScript to render content or provide interactive features. Some scraping tools, especially simpler ones, might not execute JavaScript, or they might ignore interactive elements like ads, videos, or user interface components that humans typically engage with.
This episode is brought to you by our Gold Partners. Be sure to have a look at the Club Deals page to discover their generous offers available for the TWSC readers.
💰 - 50% off promo on residential proxies using the code RESI50
💰 - 50% Off Residential Proxies
🧞 - Scrapeless is a one stop shop for your bypassing anti-bots needs.
Session and Behavioral Triggers
These triggers focus on the characteristics of a user's session and their interaction with the website:
Short session durations with high activity: An automated session might make hundreds of requests in a very short time before the session ends or the IP changes. Human sessions with comparable activity levels are generally long-lasting.
Absence of human-like interactions: Humans exhibit natural, even if sometimes irregular, mouse movements, scrolling patterns, and keystroke dynamics when filling out forms. Automated sessions often have no mouse movement, programmatic (perfectly linear or instant) scrolling, or instantaneous form fills.
Consistent timings: Automated tools can operate with high precision. The time intervals between their requests or actions might be precisely consistent (for example, exactly 2 seconds between each page load)—a pattern that is difficult to see in human behavior.
Header and Fingerprint Triggers
These triggers derive from the technical information sent by the client's browser or device:
Non-standard or generic User-Agents: The User-Agent string identifies the browser and operating system. Automated tools may use outdated user-agents, known bot-associated user-agents, generic ones, or rapidly cycle through many different user-agents from a single IP address.
Missing or anomalous HTTP headers: Browsers send various HTTP headers with each request. Automated tools might send a minimal set of headers, miss standard ones, or present unusual combinations that don't match any known legitimate browser profile.
IP reputation and origin: Requests originating from data center IP ranges, public proxies, or certain VPN services are often highly correlated with automated activity. Geographically inconsistent requests can also be an indicator.
Headless browser signatures: Some scraping tools use headless browsers, which can execute JavaScript and mimic real browsers. However, these tools sometimes leave fingerprints that can be detected by anti-bot systems.
Machine Learning Paradigms for Scraping Trigger Identification
With an understanding of triggers, you can explore how different machine learning paradigms can be applied to identify scraping behaviors.
Supervised Learning
In supervised learning, data scientists train models on datasets where they explicitly label each web request instance as either 'scraper'—automated traffic—or 'non-scraper'—legitimate human traffic. They derive features for these instances from web scraping triggers such as request rate patterns, suspicious user-agent strings, abnormally short session durations, and absence of natural mouse movement or JavaScript execution.
Below is how a typical training dataset can appear:
The dataset includes 10 key behavioral features that distinguish scrapers from legitimate users:
request_rate_per_minute: Scrapers typically make more frequent requests.
session_duration_seconds: Scrapers have shorter, more focused sessions.
user_agent_suspicious: Binary flag for suspicious user-agent strings. 1 represents a suspicious user agent, 0 represents a legitimate one.
mouse_movement_present: Legitimate users show natural mouse interactions. In this case, 0 is a flag that represents human-like movements registered.
javascript_enabled: Many scrapers disable JavaScript execution.
unique_pages_visited: Scrapers often target specific pages vs. browsing behavior.
referer_header_present: Missing referer headers can indicate automated requests.
cookie_acceptance: Human users are more likely to accept cookies.
request_interval_variance: Scrapers often have regular, predictable timing patterns.
http_error_rate: Scrapers may trigger more errors due to aggressive behavior.
All the features are related to a label: scraper or non-scraper. In other words, engineers have analyzed those behaviours and were able to determine whether each request was coming from legitimate users or not.
When you train a supervised learning algorithm on such data, it learns a mapping function that connects these behavioral input features to the scraping classification labels. During training, the algorithm adjusts its parameters to minimize prediction errors between its classifications and the actual labels. Once trained, the model can be used to predict whether new, incoming web traffic exhibits characteristics related to human behaviours or bots.
Before continuing with the article, I wanted to let you know that I've started my community in Circle. It’s a place where we can share our experiences and knowledge, and it’s included in your subscription. Enter the TWSC community at this link.
Unsupervised Learning
Data scientists use unsupervised learning when they work with unlabeled web traffic data. In other words, the data they have can be exactly the same you have seen in the previous example. However, in this case, there is no label scraper or non-scraper.
In this case, the goal of the algorithms is to autonomously identify inherent structures, distinct patterns, or anomalies within the raw web traffic data. A common way of doing so is called Clustering. In the context of web scraping, clustering algorithms group similar web sessions or IP addresses based on their behavioral features (e.g., request frequency, navigation paths, interaction timings). Below is an example of cluster data in the context of web scraping:
Analysts, then, examine clusters that appear distinctly separate, or those exhibiting highly homogenous and systematic behavior, as these often represent specific types of automated activity and validate weather the behaviours are legitimate or not.
Semi-Supervised Learning
Semi-supervised learning offers a middle ground between the previous two. In this case, data scientists use a small amount of labeled web traffic data with a large volume of unlabeled data. In this case, the few labeled instances help guide the learning process.
Simultaneously, the abundant unlabeled data helps the model better understand the structure and overall distribution of web traffic patterns. This approach is useful in web scraping detection because obtaining a vast, perfectly labeled dataset is often impractical. Yet, security teams might have access to some confirmed instances of scraping or legitimate non-scraping behavior.
To help you understand better, let's imagine a scenario where we have a lot of web traffic data, but we've only managed to label a few specific instances as "scraper" or "legitimate user." A semi-supervised model would use these few labels to help classify the rest. The analysis from a semi-supervised learning model can be completed as shown by the following image:
In this case, the model looks at the features of the blue "legitimate" points and the red "scraper" points. It then uses this information, along with the patterns it observes in all the gray "unlabeled" points, to make predictions about how to classify the rest of the gray dots.
Essentially, the labeled data provides "anchors" or "seeds" of truth. The model then tries to draw boundaries or understand the distributions in a way that respects these known labels while also making sense of the overall structure of the unlabeled data. For example, unlabeled points that are very close (in terms of their features) to the red 'X's are more likely to be classified as scrapers, and those close to the blue circles are more likely to be legitimate.
This approach is powerful because you get some of the accuracy benefits of supervised learning without needing to label every single piece of data, which is often too costly and time-consuming.
The Big Challenge: Getting Features from Many Places
Machine learning models—even the best ones—need good and relevant information to work well on unseen data. This means you need to extract, choose, and modify the most useful pieces of information from raw data.
However, the major problem is that organizations generally have the raw data spread out across different data sources. To use this data well, engineers need a solid plan to pull out and combine features. This generally requires extracting this data from:
Web server logs: These logs give details for every request, like the visitor's IP address, , what page they asked for, what browser they were using, and where they came from. While useful, server logs alone often don't give enough detail to catch bots that act like humans on the surface.
Content Delivery Network (CDN) logs: CDNs help websites load faster by storing content closer to users. They also create their own logs. These can provide information about traffic before it gets to the main website servers, like where visitors are from, how content is being used, and sometimes early warnings about possible threats from the CDN company. So, adding CDN logs to server logs helps build a wider story of each request.
Web Application Firewall (WAF) logs: WAFs check website traffic for known bad patterns and can block or log “strange” requests. These logs can point out requests that set off specific security alarms, giving clear signs of possibly harmful automated activity.
Client-side information: Maybe the most detailed—but also the trickiest—information comes from collecting data directly from the user's browser using JavaScript. This can include things like mouse movement patterns, how fast and far someone scrolls, how they type into forms, details about their browser and device (like screen size, plugins, fonts), and whether their browser can run JavaScript. This information is very helpful for finding headless browsers or bots that don't show real human-like actions.
As understandable, the main difficulty isn't just getting to these data sources, but putting them together correctly and quickly to create a training dataset like the one you have seen above. Each data source often has its own way of storing information, its own logging rules, and its own way of recording time. In this scenario, engineers must create systems that can:
Collect data: Reliably gather huge amounts of data from these different sources. This must also be repeated at defined intervals, as new data comes in continuously.
Make data uniform: Change the different data formats into one standard format that’s good for pulling out features that describe the behaviours (scraper and non-scraper), even if you don’t label them.
Link related events: Connect related actions across different logs for a single user visit or even a single request. For example, matching a CDN log entry to the right web server log entry and then linking that to the browser behavior data for the same action needs smart ways to track visits and connect events. Differences in timing or missing common IDs can make this very tricky.
Create features: Make useful features from the combined data. This might mean calculating things over time (like how many requests per minute from one IP, using server logs), giving scores for “strange things” in headers (from server/CDN logs), or measuring how complex mouse movements are.
How well you do this multi-source feature work affects how well the ML model can "see" everything a bot (or a human) is doing. If a system only uses server logs, it might easily miss a clever scraper that uses a headless browser and changes IPs but doesn't show any real mouse movement—a detail only browser information could show. So, even though it's complicated, building the ability to pull out and combine data from different parts of the website system to create the features that will feed the ML model is fundamental. Without this step, you can’t use ML for scraping detection.
Making a Smart Choice: Picking the Right ML Approach
Once engineers decide the strategy to get and create good features, they have another big decision to make: choosing the machine learning approach—supervised, unsupervised, or semi-supervised. This choice is very important because it affects what the detection system can do, how much work it takes to run, and how well it can keep up with web scrapers that are always changing.
The Supervised Learning Way: Accurate, But Needs Work
Supervised learning, as discussed, trains models on data that’s marked as scraper or non-scraper (but you can call them what you want, like bot or human). Here are the pros and cons of this approach:
Why it's good: When you have lots of good, accurate labels, supervised models give the best results for catching known scraping patterns. They learn direct links between certain combinations of triggers and how to classify them.
The problems: The challenges to solve using this paradigm are multiple:
The labeling headache: The biggest issue is always needing correct, new data. Manually labeling website traffic takes a lot of time and money, and you need experts to correctly spot tricky scraping automation. If your labeled data is old or wrong, your model won't work well.
Things change over time: Web scrapers continuously change how they work. A supervised model trained on old scraping patterns might not catch new ways scrapers try to avoid detection. This "concept drift" means you have to retrain your model often with new labeled data, which keeps the labeling problem going.
Starting from scratch: When you set up a detection system on a new website or see completely new kinds of traffic, how do you get the first set of labels to train your first supervised model? This often means doing manual checking or using simpler rules before you can build a good supervised model.
The Unsupervised Learning Way: Finding New Things
Here are the goods and the bads of this approach:
Why it's good: Its main advantage is that it can find new or "zero-day" scraping bots that no one has labeled before. By spotting traffic that’s different from "normal" behavior, it can flag suspicious actions that supervised models might miss.
The problems:
What is "normal"?: What counts as "normal" website traffic can change a lot and be very different for different parts of a website or at different times of the day. An unsupervised model might flag real but unusual user behavior as strange, leading to mistakes—which are called false positives.
Understanding and acting: When an unsupervised model flags something as “strange” or a group of actions as “suspicious”, it can be hard to know exactly why it was flagged. Explaining these findings to others and deciding what to do (like block, challenge, or just watch) can be harder than with supervised models, where the learned rules are often clearer.
The "is it interesting?" problem: Unsupervised methods might find many patterns or strange things, but not all of them will be unwanted scraping. Looking through these findings to pick out truly bad activity takes extra work.
The Semi-Supervised Learning Way: A Mix of Both
Semi-supervised learning tries to use the best parts of both by using a small amount of labeled data to help the learning process on a much larger amount of unlabeled data:
Why it's good: It can reduce the amount of labeling needed compared to fully supervised learning. It still gets some guidance from the labels, which can lead to better results than purely unsupervised methods.
The problems:
Depends on the first labels: The quality of the first small set of labels is very important. If these first labels are biased or wrong, they can lead the model down the wrong path.
Model beliefs: Many semi-supervised methods assume certain things about how the data is structured. For example, data points close to each other are likely to have the same label. If these assumptions aren't true for the website traffic data, the model might not work well.
It's complicated: Setting up and adjusting semi-supervised models can be more complex than purely supervised or unsupervised ones.
Conclusion
If you are building anti-bot solutions, you know that navigating the complexities of web scraping requires sophisticated solutions. As explored in this article, machine learning provides a powerful pathway for organizations looking to enhance their detection capabilities.
What’s your experience building anti-bot systems? Are you using ML? Tell us in the comments!