Open source Python libraries for your web scraping projects

Save money and headaches with these Python tools

Sep 01, 2024

Text within this block will maintain its original spacing when published

"A penny saved is a penny earned." - Benjamin Franklin

The open-source world is always vibrant, especially in these times when AI is everywhere and it needs more and more data for its models. This means more web scraping but, as we’ve seen especially in the past five years, also more anti-bots.

So let me share with you some of the coolest Python libraries for leveraging AI for web scraping and for bypassing anti-bots.

ScrapeGraphAI

I’ve already written about this package some time ago but the ScrapeGraphAi team is literally flying, with one release after another.

The Web Scraping Club

The Lab #52: Scraping with LLMs and ScrapeGraphAi - part 1

There’s no need to say that LLMs are still on the news in these months, thanks to the daily feed of new startups, models, and improvements that we’re exposed to…

a year ago · 1 like · Pierluigi Vinciguerra

By connecting your preferred LLM model, locally or online, this library allows you to:

Extract data from a single or multiple pages, defining a target data schema
Extract data from the answers of a search on a search engine
Generate an audio file with the data extracted from a website
Write the Python code for your scraper for your preferred library, like BeautifoulSoup

LLMs are becoming more and more affordable and accurate, while their response time are not compatible still with the speed required for a web scraping project in production.

If I have to choose the best application of this technology for web scraping, I think it’s on writing and fixing automatically the code for the scrapers, leaving the execution to the current frameworks.

I had the pleasure of meeting part of the team some months ago and I was super impressed, I know they’re working on extracting data also from local documents, so I can’t wait for their next steps. You track their progress by joining their Discord server.

I’m seeing much competition in this landscape, with other similar projects like CyberScraper-2077, which I haven’t tested yet.

Scrapoxy

You probably have seen Fabien Vauchelles, the creator of Scrapoxy, in some web scraping events or webinars, since his talks about bots and anti-bots are always interesting and full of values. In fact, also on the new YouTube Channel of The Web Scraping Club, you will find an interview with him in the next weeks.

In case you don’t want to miss it, consider subscribing to the channel.

Check the TWSC YouTube Channel

Scrapoxy is a super proxy aggregator that enables you to manage proxies from different providers and nature, from free to commercials.

The Web Scraping Club

The Lab #41: Scrapoxy, the super proxy aggregator

We all know that in today’s environment, every web scraping project with a minimum scale needs one or more proxy providers…

2 years ago · 3 likes · Pierluigi Vinciguerra

The most interesting aspect of this library regards the managing of datacenter proxies. Basically, by creating and rotating virtual machines on different cloud providers, you get an almost infinite pool of IPs with unlimited bandwidth.

But Scrapoxy is not limited to them: using only one endpoint in your scrapers, you can mix different providers with also different proxy types.

Botasaurus

Botasaurus is a powerful framework for creating scrapers, bot headful, and browserless.

The Web Scraping Club

Testing the new Botasaurus 4

If you have followed these pages for some months, you probably remember the article I previously wrote about Botasaurus, an open-source scraping framework…

a year ago · 2 likes · 2 comments · Pierluigi Vinciguerra

During my tests in the article written some months ago, I noticed that this is a powerful library, capable of being undetected by Cloudflare and other anti-bots, but it has its limits.

When creating a headful scraper and running it from a datacenter, it lacks options to mask your browser fingerprint, causing your scraper to be blocked.

But surely it’s a tool to monitor and I’ll be glad to test it more in-depth in the future.

Nodriver

Nodriver is the successor of Undetected-Chromedriver, without the usage of Selenium and webdrivers.

It’s fully asynchronous, providing a fast tool for scraping, natively optimized for staying undetected by most anti-bot solutions, using few lines of code.

You can also manage different profiles and basically you have everything needed for your scrapers.

Unluckily I was not able to test it but I’m trying to write something about it in the next weeks.

Undetected Playwright

Undetected Playwright is a patch to apply to your Playwright scrapers, to improve their undetectability by anti-bots.

We’ve seen the patch applied in this article about CDP detection techniques, where it improved our scrapers passing the tests about this new approach that’s becoming used by some anti-bot softwares.

The Web Scraping Club

The Lab #57: Improving your Playwright scraper and avoid CDP detection

Your Playwright/Puppeteer/Selenium scraper that you created some time ago and helped you bypass the target’s antibot has stopped working in the past months, after the release of the new antibot version. Sounds familiar…

a year ago · 3 likes · 2 comments · Pierluigi Vinciguerra

Speaking of which, I’ve found a great article on the Rebrowser Blog where it’s been discussed a bit more in detail.

Camoufox

Camoufox is a browser under development recently shared in our Discord server by its author, which seems very promising.

Starting from Firefox, the author stripped down all the unnecessary features and added TLS masking, Browserforge to change the browser’s fingerprint, and a bunch of other features. The results of the tests conducted on well-known websites like Browserscan seem promising and can’t wait to test it.

This is my shortlist of Python open-source tools for web scraping, do you think I missed something (yes, for sure I did it)? Write your preferred package in the comments section or on our Discord server.

Oxycon Updates

As mentioned in the previous post, Oxycon 2024 is scheduled for the 25th of September, 12 PM BST (British Summer Time).

Now we have the full agenda (GMT +3 Time) available, so secure your spot by subscribing to the official page of the event.

Job Opportunity

On the community Discord Server, a great job opportunity was published by George of Emailchaser, here’s the full text.

Emailchaser is looking for an engineer to build a web scraper.
Hi everyone, I'm the founder of Emailchaser, and I'm looking to hire/partner with someone who has experience building scrapers.
This is the feature we want to build - https://www.emailchaser.com/lead-finder
This will be a long term partnership.
If you are interested in helping us build this feature, then please send me an email at george@emailchaser.com

Like this article? Share it with your friends who might have missed it or just leave feedback for me about it. It’s important to understand how to improve this newsletter.

Rate This Article

You can also invite your friends to subscribe to the newsletter. The more you bring, the bigger prize you get.

Invite another web scraping person

The Web Scraping Club

Open source Python libraries for your web scraping projects

Save money and headaches with these Python tools

ScrapeGraphAI

Scrapoxy

Botasaurus

Nodriver

Undetected Playwright

Camoufox

Oxycon Updates

Job Opportunity

Discussion about this post