Three web scraping tools just discovered on GitHub
IP reputation, HTTP fingerprint spoofing and User Agents
At least once a month I take some time to randomly browse GitHub, looking for repositories that could be interesting for my web scraping activity.
In this post, I’ll share with you some of my latest findings.
ASN Lookup Tool and Traceroute Server
ASN is an OSINT command line tool that checks an IP and returns information about it, like its reputation, geolocation, and fingerprint.
It groups a series of API calls to services related to IP-enriching information so that by executing only one command you have an audit of the IP submitted.
When coming to web scraping, this can help evaluate IPs provided by online proxy lists or proxy providers themselves, checking as an example if the location chosen when making the request corresponds to the one seen by the target website.
You might need this double-check when the website doesn’t work as expected, or if the spider gets blocked when executed on a different divide.
In fact, two useful details we could get are the type of the IP (if it’s an IP contained in one of the subnets belonging to the data center, as an example) or if the IP has some bad reputation and is listed in registries like IPQualityScore. In this case, if you’re willing to get the details from this service, you need to get an API key by registering for free and you can check up to 5000 addresses per month.
The installation and the usage of ASN are extremely simple and you can find the information needed on the official GitHub repository pages.
Here are some tests I’ve made:
I’ve queried the Google DNS IP and we can see it’s a data center IP (under TYP), and its reputation is known as good (under REP)
This IP instead is one I’ve found in one of the “free proxy” online lists. It’s detected as a proxy running on a DC (so probably will be banned from every anti-bot) and its geolocation information mismatches with the one declared in the list. At least, it does not have a bad reputation.
curl_cffi
curl_cffi is a Python binding for curl_impersonate, which is a special version of the well-known command curl.
curl command downloads the content of a web page and I also used it for my first web scraping project around 2007, when no other framework was available. It’s very powerful but it’s also easily detectable by anti-bot solutions, making it unuseful for web scraping.
curl_impersonate instead is a special build of curl, where the TLS and HTTP handshakes are identical to the ones of a real browser, contributing to creating a more legit fingerprint when an anti-bot is used on a target website.
On top of it, curl_cffi has been built to use curl_impersonate frictionlessly in your Python scripts.
I was a bit skeptical of the real effectiveness of this solution since there’s no Javascript handling, which is something I suppose was really needed to bypass any anti-bot, but, however, I wanted to give it a try.
Well, I was more than surprised to see that Cloudflare, which of course blocked the standard curl call on the target website, allowed curl_cffi instead to read the HTML of the target page.
from curl_cffi import requests
# Notice the impersonate parameter
r = requests.get("https://www.harrods.com/en-it/shopping/women-clothing-dresses?icid=megamenu_shop_women_dresses_all-dresses", impersonate="chrome110")
print(r.text)
I will definitely give it a try in the next few days since I’m having some issues with some websites.
fake-useragent
Changing the User Agent on your scraper is a technique as old as web scraping itself but it’s still a thing. So why don’t we use a package to make it effortlessly?
fake-useragent does it, helping our scrapers get a User Agent completely random or one of the chosen browsers we want to impersonate.
It’s a well-designed package, less trivial than it could seem at first sight.
First, we can choose a get a random User Agent inside one or more browser families:
from fake_useragent import UserAgent
ua = UserAgent(browsers=['edge', 'chrome'])
ua.random
Also, we can discard agents not common enough, since every string has its usage percentage. In this example, we’re selecting only browsers with a percentage of adoption equal to or more than 1.3%.
from fake_useragent import UserAgent
ua = UserAgent(min_percentage=1.3)
ua.random
It’s not rocket science but simplifies a tedious task.
Do you have any preferred package not mentioned here you wish to share with our community? Are you working on an open-source project related to web scraping? Please write me at pier@thewebscraping.club
wafw00f - https://github.com/EnableSecurity/wafw00f.git to detect WAF