The most interesting GitHub Repositories about web scraping (2023)

An incomplete but still yes useful list of interesting resources on web scraping

Jan 22, 2023

This post is sponsored by Oxylabs, your premium proxy provider. Sponsorships help keep The Web Scraping Free and it’s a way to give back to the readers some value.

In this case, for all The Web Scraping Club Readers, using the discount code WSC25 you can save 25% OFF for your residential proxy buying.

The open-source community has significantly contributed to the web scraping industry, giving the public access to a wide range of tools and resources. In this article, we’ll see together some of the most important GitHub repositories.

blue and black penguin plush toy — Photo by Roman Synkevych on Unsplash

Tools for web scraping with Python

Scraping tools

Scrapy

The standard de-facto for web scraping in Python, the repository has 45k stars and is maintained by Zyte. On top of it, Splash headless browser allows Scrapy to render Javascript and Spidermon enables the scheduling of your fleet of scrapers.

You-get

It’s the most popular repository on GitHub returned when looking for Python projects with the “web scraping” keyword. It’s used to download non-HTML content like videos on Youtube or images from websites

Autoscraper

It’s a sort of scraper that, given some input examples, learns the rules for extracting the correct data and can apply them to new URLs with the same structure. It’s the first time I see this project but I’ll test it for sure in the next weeks.

Linkedin Scraper

Linkedin scraper using Scrapy, Selenium, and Chromium.

Proxy management

Request Ip Rotator

This package uses the AWS API Gateway service pool of IPs as a pool of proxies. Smart move for saving some bucks when in need of data center proxies!

Scrapy Rotating Proxies

This package enables different types of proxy usage in Scrapy. An alternative could be the advanced scrapy proxies package, written by myself, where I added several options like downloading a list of proxies from an external URL at every request and using hidden users and passwords.

Other useful repo

Search-Script-Scrape

As said in the repository, they are 101 web scraping and research tasks for the data journalist, belonging to the Stanford Computational Journalism Lab. There are scripts in Python useful for extracting data typically from US government and administration websites.

Cloudscraper

The most famous python bypass for Cloudflare.

TLS-client

Brought to my attention in the comments on a post on my Linkedin Profile (feel free to add me), allows sending HTTP requests with custom TLS fingerprints.

Tools with Javascript

Scraping tools

Crawlee

Maintained by Apify, it’s the one-stop solution for web scraping in Js. Uses Playwright, Puppeteer, and Cheerio, adding some anti-blocking features.

Puppeteer

The most famous browser automation tool is used also in web scraping. With the package puppeteer-extra it gains superpowers against anti-bot solutions.

Playwright

Released in 2020 by Microsoft, this browser automation tool immediately gained traction in the web scraping scene. With the package playwright-extra you have more options against anti-bot. Available also in Python.

Ayakashi

A new concept of web scraping tool that uses SQL-like language for extracting data from the DOM.

Tools with other languages

Scraping tools

Geziyor

Geziyor is a web scraping framework for GO language, with JS rendering, proxy management, and some other common features.

Upton

Framework for easy web scraping in Ruby

The most interesting GitHub Repositories about web scraping (2023)

An incomplete but still yes useful list of interesting resources on web scraping

Tools for web scraping with Python

Scraping tools

Proxy management

Other useful repo

Tools with Javascript

Scraping tools

Tools with other languages

Scraping tools

Knowledge bases and documentation

Discussion about this post