If you have followed these pages for some months, you probably remember the article I previously wrote about Botasaurus, an open-source scraping framework.
For me it was a pleasant surprise to discover this library and, thanks to the reach of The Web Scraping Club, I was able to get in touch with the developers, who recently shared with me the news of the new release.
What is Botasaurus?
Botasaurus, as its website mentions, is a Swiss Army knife for web scraping and browser automation that helps you create bots fast.
Basically, is a framework that gives you the ability to create scrapers using both requests and a browser, or automating tasks, like running third-party libraries like Playwright and Selenium for scraping but also generic Python scripts.
Key Features
Botasaurus is packed with great features, as you can see from the detailed Readme on its repository.
They can be grouped into five macro areas:
Decorators for Easy Configuration: Botasaurus offers three primary decorators:
@browser
,@request
, and@task
. These decorators allow you to configure your scrapers with various settings such as proxy usage, parallel scraping, and more.Browser and Request Scraping: The
@browser
decorator enables you to scrape web pages using a humane browser, while the@request
decorator allows for scraping using lightweight HTTP requests.Task-Based Scraping: The
@task
decorator supports scraping using third-party libraries like Playwright and Selenium, as well as non-web scraping tasks such as data processing.Utilities for Debugging and Development: Botasaurus includes several utilities for debugging and development, including
bt
for writing temporary files,Sitemap
for accessing website links, andCache
for managing cache data.Scale with Kubernetes: you can scale your scraper to multiple machines with Kubernetes and increase the speed of your operations.
Personally, I’m curious to test on the field the Botasaurus capabilities against some anti-bots.
I’m creating a small scraper, that you can find in the GitHub repository available for everyone, under the folder Botasaurus4.
I’m using the browser decorator to create a Playwright session, opening a browser window where all the pages will be loaded.
Botasaurus VS Cloudflare
In the following example, we’re testing Botasaurus against Harrods.com, a website protected with Cloudflare.
We’re using the browser decorator, specifying the use of the same browser session for the whole scraper, with the option reuse_driver= True
from botasaurus.browser import browser, Driver
from botasaurus.soupify import soupify
import time
@browser(
reuse_driver= True
)
def scrape_heading_task(driver: Driver, data):
# Cloudflare Protected Website
driver.get("https://www.harrods.com/")
driver.get("https://www.harrods.com/en-it/shopping/women")
time.sleep(10)
driver.scroll_to_bottom()
page_soup = soupify(driver)
products = page_soup.select('div[data-test="productCard-lazy-load-wrapper"]')
for product in products:
try:
product_name = product.select_one('a > article > h3[data-test="productCard-productName"]').text
print(f"Product: {product_name}")
except:
pass
# Initiate the web scraping task
scrape_heading_task()
After loading the website’s homepage, we’ll browse to a product list page and scroll to the bottom, in order to make load all the items.
Then, we parse the HTML into a BeautifulSoup object using soupify() and extract the heading, and we can use CSS selectors to extract data.
Once the test is passed, we can load the website and scrape the data from it.
Botasaurus VS Datadome
We’re using the same scraper, with different URLs and selectors, this time against a website protected by Datadome, footlocker.it.
from botasaurus.browser import browser, Driver
from botasaurus.soupify import soupify
import time
@browser(
reuse_driver= True
)
def scrape_heading_task(driver: Driver, data):
# Datadome Protected Website
driver.get("https://www.footlocker.it/")
driver.get("https://www.footlocker.it/it/category/uomo/scarpe.html")
time.sleep(10)
driver.scroll_to_bottom()
page_soup = soupify(driver)
products = page_soup.select('div[class="ProductCard ProductCard--flexDirection"]')
for product in products:
try:
product_name = product.select_one('a > span[class="ProductName"] > span').text
print(f"Product: {product_name}")
except:
pass
# Initiate the web scraping task
scrape_heading_task()
Even in this case, we could load the homepage, get the product list, and scrape the product names from there.
Botasaurus VS Kasada
Again the same scraper, this time against a Kasada-protected website, canadagoose.com.
And even in this case, the protection is bypassed!
So, when executed locally, all three main website protections are bypassed.
Have we found the final solution for web scraping?
Well, almost.
For websites that don’t need a headful browser loaded, it has everything that is needed to work, with also the referrer from Google when creating them.
If you need to bypass the most famous anti-bot solutions, until the scraper runs on a consumer-grade device like your desktop/laptop, it works perfectly.
In case the scraper runs on a server, instead, since there’s no “camouflage” of the browser fingerprint, all the canonical red flags used by these anti-bots are exposed and used to block the scraper.
In fact, even using residential proxies, the three scrapers got blocked: both Datadome and Cloudflare threw a CAPTCHA, while Kasada did not render both the home page and the product list pages.
By loading the fingerprint viewer from the Antoine Vastel website, we can easily understand why:
Our machine results with no speakers, mic, or webcam and uses the GoogleSwiftShare “video card”, which is a graphic API that uses the CPU, typically used by servers with no real graphic cards.
Apart from this, Botasaurus has a bunch of other great features, like creating a UX for your scrapers almost instantly, in case you need to share them with non-technical people. It’s difficult to collect them in only one article so probably in the near future you’ll see some of them in the future posts.
Really appreciate your thoughtful article
Are all the 3 sites are tested in headless mode?