Open source Python libraries for your web scraping projects
Save money and headaches with these Python tools
"A penny saved is a penny earned." - Benjamin Franklin
The open-source world is always vibrant, especially in these times when AI is everywhere and it needs more and more data for its models. This means more web scraping but, as we’ve seen especially in the past five years, also more anti-bots.
So let me share with you some of the coolest Python libraries for leveraging AI for web scraping and for bypassing anti-bots.
ScrapeGraphAI
I’ve already written about this package some time ago but the ScrapeGraphAi team is literally flying, with one release after another.
By connecting your preferred LLM model, locally or online, this library allows you to:
Extract data from a single or multiple pages, defining a target data schema
Extract data from the answers of a search on a search engine
Generate an audio file with the data extracted from a website
Write the Python code for your scraper for your preferred library, like BeautifoulSoup
LLMs are becoming more and more affordable and accurate, while their response time are not compatible still with the speed required for a web scraping project in production.
If I have to choose the best application of this technology for web scraping, I think it’s on writing and fixing automatically the code for the scrapers, leaving the execution to the current frameworks.
I had the pleasure of meeting part of the team some months ago and I was super impressed, I know they’re working on extracting data also from local documents, so I can’t wait for their next steps. You track their progress by joining their Discord server.
I’m seeing much competition in this landscape, with other similar projects like CyberScraper-2077, which I haven’t tested yet.
Scrapoxy
You probably have seen Fabien Vauchelles, the creator of Scrapoxy, in some web scraping events or webinars, since his talks about bots and anti-bots are always interesting and full of values. In fact, also on the new YouTube Channel of The Web Scraping Club, you will find an interview with him in the next weeks.
In case you don’t want to miss it, consider subscribing to the channel.
Scrapoxy is a super proxy aggregator that enables you to manage proxies from different providers and nature, from free to commercials.
The most interesting aspect of this library regards the managing of datacenter proxies. Basically, by creating and rotating virtual machines on different cloud providers, you get an almost infinite pool of IPs with unlimited bandwidth.
But Scrapoxy is not limited to them: using only one endpoint in your scrapers, you can mix different providers with also different proxy types.
Botasaurus
Botasaurus is a powerful framework for creating scrapers, bot headful, and browserless.
During my tests in the article written some months ago, I noticed that this is a powerful library, capable of being undetected by Cloudflare and other anti-bots, but it has its limits.
When creating a headful scraper and running it from a datacenter, it lacks options to mask your browser fingerprint, causing your scraper to be blocked.
But surely it’s a tool to monitor and I’ll be glad to test it more in-depth in the future.
Nodriver
Nodriver is the successor of Undetected-Chromedriver, without the usage of Selenium and webdrivers.
It’s fully asynchronous, providing a fast tool for scraping, natively optimized for staying undetected by most anti-bot solutions, using few lines of code.
You can also manage different profiles and basically you have everything needed for your scrapers.
Unluckily I was not able to test it but I’m trying to write something about it in the next weeks.
Undetected Playwright
Undetected Playwright is a patch to apply to your Playwright scrapers, to improve their undetectability by anti-bots.
We’ve seen the patch applied in this article about CDP detection techniques, where it improved our scrapers passing the tests about this new approach that’s becoming used by some anti-bot softwares.
Speaking of which, I’ve found a great article on the Rebrowser Blog where it’s been discussed a bit more in detail.
Camoufox
Camoufox is a browser under development recently shared in our Discord server by its author, which seems very promising.
Starting from Firefox, the author stripped down all the unnecessary features and added TLS masking, Browserforge to change the browser’s fingerprint, and a bunch of other features. The results of the tests conducted on well-known websites like Browserscan seem promising and can’t wait to test it.
This is my shortlist of Python open-source tools for web scraping, do you think I missed something (yes, for sure I did it)? Write your preferred package in the comments section or on our Discord server.
Oxycon Updates
As mentioned in the previous post, Oxycon 2024 is scheduled for the 25th of September, 12 PM BST (British Summer Time).
Now we have the full agenda (GMT +3 Time) available, so secure your spot by subscribing to the official page of the event.
Job Opportunity
On the community Discord Server, a great job opportunity was published by George of Emailchaser, here’s the full text.
Emailchaser is looking for an engineer to build a web scraper.
Hi everyone, I'm the founder of Emailchaser, and I'm looking to hire/partner with someone who has experience building scrapers.
This is the feature we want to build - https://www.emailchaser.com/lead-finder
This will be a long term partnership.
If you are interested in helping us build this feature, then please send me an email at george@emailchaser.com
Like this article? Share it with your friends who might have missed it or just leave feedback for me about it. It’s important to understand how to improve this newsletter.
You can also invite your friends to subscribe to the newsletter. The more you bring, the bigger prize you get.