The most interesting GitHub Repositories about web scraping (2023)
An incomplete but still yes useful list of interesting resources on web scraping
This post is sponsored by Oxylabs, your premium proxy provider. Sponsorships help keep The Web Scraping Free and it’s a way to give back to the readers some value.
In this case, for all The Web Scraping Club Readers, using the discount code WSC25 you can save 25% OFF for your residential proxy buying.
The open-source community has significantly contributed to the web scraping industry, giving the public access to a wide range of tools and resources. In this article, we’ll see together some of the most important GitHub repositories.
Tools for web scraping with Python
Scraping tools
Scrapy
The standard de-facto for web scraping in Python, the repository has 45k stars and is maintained by Zyte. On top of it, Splash headless browser allows Scrapy to render Javascript and Spidermon enables the scheduling of your fleet of scrapers.
You-get
It’s the most popular repository on GitHub returned when looking for Python projects with the “web scraping” keyword. It’s used to download non-HTML content like videos on Youtube or images from websites
Autoscraper
It’s a sort of scraper that, given some input examples, learns the rules for extracting the correct data and can apply them to new URLs with the same structure. It’s the first time I see this project but I’ll test it for sure in the next weeks.
Linkedin Scraper
Linkedin scraper using Scrapy, Selenium, and Chromium.
Proxy management
Request Ip Rotator
This package uses the AWS API Gateway service pool of IPs as a pool of proxies. Smart move for saving some bucks when in need of data center proxies!
Scrapy Rotating Proxies
This package enables different types of proxy usage in Scrapy. An alternative could be the advanced scrapy proxies package, written by myself, where I added several options like downloading a list of proxies from an external URL at every request and using hidden users and passwords.
Other useful repo
Search-Script-Scrape
As said in the repository, they are 101 web scraping and research tasks for the data journalist, belonging to the Stanford Computational Journalism Lab. There are scripts in Python useful for extracting data typically from US government and administration websites.
Cloudscraper
The most famous python bypass for Cloudflare.
TLS-client
Brought to my attention in the comments on a post on my Linkedin Profile (feel free to add me), allows sending HTTP requests with custom TLS fingerprints.
Tools with Javascript
Scraping tools
Crawlee
Maintained by Apify, it’s the one-stop solution for web scraping in Js. Uses Playwright, Puppeteer, and Cheerio, adding some anti-blocking features.
Puppeteer
The most famous browser automation tool is used also in web scraping. With the package puppeteer-extra it gains superpowers against anti-bot solutions.
Playwright
Released in 2020 by Microsoft, this browser automation tool immediately gained traction in the web scraping scene. With the package playwright-extra you have more options against anti-bot. Available also in Python.
Ayakashi
A new concept of web scraping tool that uses SQL-like language for extracting data from the DOM.
Tools with other languages
Scraping tools
Geziyor
Geziyor is a web scraping framework for GO language, with JS rendering, proxy management, and some other common features.
Upton
Framework for easy web scraping in Ruby
Knowledge bases and documentation
The Web Scraping Open Knowledge Platform
It’s my first try to gather all the info and links about web scraping from several sources, so we can consider it as the synthetic version of this substack.
Browser fingerprinting
A great collection of tests and considerations about anti-bot industry and techniques.
Awesome Web Scraping
A list of interesting repositories on GitHub that is much more complete than this post.
I’m sure I’ve left behind some amazing repositories, please comment here if you want to share with the other readers something I’ve missed.