New year, new resolutions, but the following sentence is always true: web scraping is getting harder and harder. As we started our consultancy firm to help companies with their scraping needs, we experienced firsthand the variety and challenges (and their solutions) that web scraping brings to life. For this reason, I wanted to start the year by publishing a reminder of the difficulties in the web scraping space and which tools should be in our toolbelt for this year.
This article will describe the key components of the modern web scraping tech stack, highlighting their features, best uses, and how they work together to handle the challenges of today’s web.
I voluntarily omitted the “Scraping with AI” theme, which will deserve a series of posts during the following months. For this one, I will stick to the old and dirty web scraping school.
Are you ready?
The Basics: Browserless Frameworks
Browserless frameworks like Scrapy have been the cornerstone of many web scraping projects for years, and for a good reason—they are fast, efficient, and highly scalable. Browserless frameworks don’t rely on launching a browser to collect data. Instead, they make direct HTTP requests to websites, fetching the raw HTML, which is then parsed to extract the desired information.
Still, nowadays, in my experience, if we want to scrape one hundred e-commerce websites, probably 80% of them can be scraped using it since they don’t have complex anti-bot protections.
Scrapy is one of the most popular and versatile browserless frameworks. Written in Python, It is lightweight yet powerful, handling everything from small projects to large-scale operations. Its asynchronous architecture allows it to send and receive multiple requests simultaneously, making it incredibly fast. This multithreading capability is ideal for scraping large websites and can be easily set up and configured.
In fact, one of Scrapy’s strengths is its modular design. Developers can extend its functionality through custom middleware and pipelines, enabling it to handle unique challenges. For instance, middleware can be added to manage cookies, rotate user agents, or handle retries after failed requests. During these years, as scraping became more difficult, we’ve seen several extensions that added some improvements in bypassing anti-bots.
One of the most useful is, for sure, scrapy-impersonate, a package that modifies the TLS fingerprint of our scrapers. It seems a small change, but as we’ve seen in this 2023 article, it can bring some results. While today it’s probably not enough to bypass Cloudflare, I’ve used it in several cases when it comes to handling Akamai bot protection.
Of course, you can also easily add proxies in your scraper to handle the rate limits imposed by the target websites and use scraping APIs or Web Unblockers to bypass more demanding protections. If it is compatible with my scraping budget, I still prefer this solution to build custom ones from the ground using browser automation tools, given the great advantages in terms of execution speed and simplicity of the infrastructure needed.
Residential proxies use IP addresses assigned to real devices, making them harder to detect. Datacenter proxies are faster but more easily flagged. Mobile proxies, which are associated with legitimate mobile networks, are particularly useful for bypassing geo-restricted content and offer high success rates.
Commercial scraping APIs or unblockers take proxy usage to the next level for more advanced needs. Tools like Zyte’s API or Oxylabs’ Web Scraper API handle proxy rotation, solve CAPTCHAs, manage JavaScript challenges, and mimic human behavior. By delegating these tasks to the API, you can focus on extracting data without worrying about the complexities of anti-bot defenses. Of course, this delegation comes with a price, and you need to understand the trade-off between the costs of these tools and how much you would spend on development and running a more complex homemade solution.
However, using Scrapy’s request-response model is not always possible. When websites use advanced bot softwares that monitors the user’s behavior or creates a detailed browser fingerprint, you have to use some browser automation tool to simulate an interaction with the target website that looks like a human one.
When Scrapy Isn’t Enough: Browser Automation Tools
While Scrapy and other browserless frameworks are excellent for straightforward tasks, some websites are designed to frustrate simple scrapers. Dynamic content, heavy use of JavaScript, and sophisticated anti-bot systems often require tools that simulate user interactions. This is where browser automation tools like Playwright come into play.
Playwright, developed by Microsoft, is a robust framework that automates browser actions. It supports multiple browsers, including Chromium, Firefox, and WebKit, and allows developers to script interactions such as clicking buttons, filling forms, and scrolling through pages. Unlike Scrapy, which works with raw HTML, Playwright loads entire web pages as a real browser, enabling it to handle dynamic content.
However, the Playwright’s usage of browser instances introduces a new family of challenges. Websites equipped with advanced anti-bot systems can detect browser automation frameworks through subtle clues known as fingerprints. These include discrepancies in JavaScript execution, missing browser features, or unusual interaction patterns. In 2023 and early 2024, it was enough to set the proper arguments when launching Playwright to make it less detectable, but today, this is no longer true. The detection via CDP protocol is just one of the techniques used by anti-bot solutions to understand when a scraper is using Playwright, together with the browser fingerprint that unveils the script’s running environment. For this reason, we need to complement Playwright and other browser automation tools with new solutions.
Stealth and Fingerprint Masking
One solution to making Playwright more stealthy is to use tools like Patchright. This open-source library adjusts Playwright’s settings to mimic a real user’s browser environment. It modifies attributes like user agents, screen resolutions, and WebGL parameters, making it harder for anti-bot systems to identify automated sessions.
Another option is anti-detect browsers, which go beyond basic fingerprint adjustments. Tools like Camoufox (open-source) or commercial solutions like GoLogin and Multilogin create virtual browser environments with highly customizable fingerprints. These browsers simulate human-like behavior, from mouse movements to keypresses, increasing the probability of staying undetected noticeably.
Developers engaged in projects requiring account logins or intricate interactions can benefit from anti-detect browsers, which provide session management features. These features help keep each session distinct, utilizing separate cookies, local storage, and cache data. This functionality is particularly valuable for activities such as scraping social media sites or booking platforms, where it is essential to maintain consistent sessions.
CAPTCHA Challenges
Even with stealth tools, many websites deploy CAPTCHAs as a final line of defense. CAPTCHAs are designed to distinguish humans from bots, and while they can be bypassed manually, this isn’t practical for large-scale scraping projects.
I’ve recently seen an increasing number of websites where some actions, like clicking a button, are protected by CAPTCHAs and require their validation token to continue the process.
In some cases, integrating automation tools with CAPTCHA-solving services like NopeCHA, 2Captcha, or Anti-Captcha can be effective.
These services send CAPTCHA challenges to human solvers or AI models, which return the correct answers. For example, a Playwright scraper can detect a CAPTCHA on a webpage, send the challenge to the solving service, and insert the solution programmatically, allowing the scraping process to continue seamlessly.
Some CAPTCHA solvers also provide browser extensions, particularly useful for scrapers relying on persistent browser sessions. By automatically solving CAPTCHAs in the background, these extensions reduce interruptions and improve the efficiency of scraping workflows.
With some specific CAPTCHA providers, there’s no way to bypass them, if not by acting like a human would do, slowing down the scraping process.
Looking Ahead
The web scraping tech stack for 2025 illustrates the industry's changing needs. Tools such as Scrapy, Playwright, Patchright, and anti-detect browsers provide a wide range of options for addressing the challenges of today's complex websites.
However, success in web scraping isn’t just about technology: it’s about strategy, understanding the best entry point for the target website, and creating the most efficient plan for requesting the data needed. The more requests we make, the more our operations are more expensive (and for the target website).
Last but not least, success is not getting into legal trouble: not everything on the web can be scraped.
Hey Pierluigi, I’ve sent you a DM with a great proposal for your newsletter. Would love for you to take a look when you get a chance!
Amazing article, İ am new to web scraping this article have great information.
İs getting into web scraping in 2025 still a good idea?Any Advice?