The most interesting GitHub Repositories about web scraping (2023)
An incomplete but still yes useful list of interesting resources on web scraping
This post is sponsored by Oxylabs, your premium proxy provider. Sponsorships help keep The Web Scraping Free and it’s a way to give back to the readers some value.
In this case, for all The Web Scraping Club Readers, using the discount code WSC25 you can save 25% OFF for your residential proxy buying.
The open-source community has significantly contributed to the web scraping industry, giving the public access to a wide range of tools and resources. In this article, we’ll see together some of the most important GitHub repositories.
Tools for web scraping with Python
It’s the most popular repository on GitHub returned when looking for Python projects with the “web scraping” keyword. It’s used to download non-HTML content like videos on Youtube or images from websites
It’s a sort of scraper that, given some input examples, learns the rules for extracting the correct data and can apply them to new URLs with the same structure. It’s the first time I see this project but I’ll test it for sure in the next weeks.
Linkedin scraper using Scrapy, Selenium, and Chromium.
This package uses the AWS API Gateway service pool of IPs as a pool of proxies. Smart move for saving some bucks when in need of data center proxies!
This package enables different types of proxy usage in Scrapy. An alternative could be the advanced scrapy proxies package, written by myself, where I added several options like downloading a list of proxies from an external URL at every request and using hidden users and passwords.
Other useful repo
As said in the repository, they are 101 web scraping and research tasks for the data journalist, belonging to the Stanford Computational Journalism Lab. There are scripts in Python useful for extracting data typically from US government and administration websites.
The most famous python bypass for Cloudflare.
Brought to my attention in the comments on a post on my Linkedin Profile (feel free to add me), allows sending HTTP requests with custom TLS fingerprints.
Maintained by Apify, it’s the one-stop solution for web scraping in Js. Uses Playwright, Puppeteer, and Cheerio, adding some anti-blocking features.
The most famous browser automation tool is used also in web scraping. With the package puppeteer-extra it gains superpowers against anti-bot solutions.
Released in 2020 by Microsoft, this browser automation tool immediately gained traction in the web scraping scene. With the package playwright-extra you have more options against anti-bot. Available also in Python.
A new concept of web scraping tool that uses SQL-like language for extracting data from the DOM.
Tools with other languages
Geziyor is a web scraping framework for GO language, with JS rendering, proxy management, and some other common features.
Framework for easy web scraping in Ruby
Knowledge bases and documentation
It’s my first try to gather all the info and links about web scraping from several sources, so we can consider it as the synthetic version of this substack.
A great collection of tests and considerations about anti-bot industry and techniques.
A list of interesting repositories on GitHub that is much more complete than this post.
I’m sure I’ve left behind some amazing repositories, please comment here if you want to share with the other readers something I’ve missed.
The Web Scraping Club is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.
If you wish to receive articles like this directly in your email, you can subscribe below.