The latest papers about browser fingerpinting
Let's dive in the latest studies about browser fingerprinting
I’ve recently read this article on the Datadome website about the latest developments in their fingerprinting solutions.
In particular, the article describes how Picasso, a tool for canvas fingerprinting developed by Google, is integrated into their anti-bot solution to detect traffic generated by bots. By drawing this HTML5 canvas, Picasso is able to understand the device and the browser’s settings and compare them to the ones declared by the client. In case of a mismatch, the traffic is labeled as suspect and blocked.
This topic is super interesting to me since it has fallouts in different industries, not only in web scraping, I’ve decided to make a brief walkthrough on the newest papers published on ArXiv about browser fingerprinting that were published last year.
Browser Fingerprinting: Overview and Open Challenges
This paper provides a comprehensive overview of browser fingerprinting, discussing its evolution, methodologies, and the privacy concerns it raises. As we also mentioned in previous articles, the primary reason why fingerprinting techniques are born is the need for an alternative to cookies for websites. There are too many browser extensions that allow the blocking or the forging of cookies. Additionally, thanks also to the privacy regulations around the world, users are more aware of this technology and can decide to opt out of unnecessary cookie usage during a browsing session, if not navigating in incognito.
If you’re intrigued by the article and want to read a brief overview of it, not too technical, this paper is for you.
adF: A Novel System for Measuring Web Fingerprinting through Ads
In this paper, the researchers have implemented a fingerprint collection tool and embedded it in some advertising shown online 5.4 Million times to different users using different device setups, both mobile and desktop.
Then they calculated their fingerprint by using 66 browser attributes, so we’re talking about passive fingerprinting, since in this paper they’re only reading the configuration of the browsers and, by determining the entropy of single attributes, they found out which device configuration is more prone to determine uniquely a person.
While the paper is interesting to read in its integrity, the key concepts we can extract for our web scraping projects are the following:
desktop devices are more prone to unique fingerprinting than mobile ones since they expose more attributes with higher granularity. That’s understandable: mobile phones’ hardware and browser configurations and combinations are less than desktop devices. Given that, impersonating a mobile device when scraping, even if more difficult on the technical side, could lead to less effort when forging a fingerprint.
Among the desktop browsers, Chrome is the one that shares more details about the device configuration, compared to Safari or Firefox. That’s something I’ve already tested in my experience when writing scrapers with Playwright: in many cases, using Chrome led to blocks while using a more privacy-oriented browser like Brave gave better results on the same websites. This is the reason why in almost every article of The Lab I use Brave instead of Chrome.
The list of browser attributes that were considered useful for fingerprinting. Probably it’s not the complete list of attributes listed by anti-bots, which they also use other active techniques like Canvas and WebGL fingerprints, and also behavioral analysis, but it’s an interesting starting point for sure.
The Web Scraping Club is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.
WebGPU-SPY: Finding Fingerprints in the Sandbox through GPU Cache Attacks
It’s the second article in a short time I’ve read about using the WebGPU hardware to fingerprint a device. The previous paper is from 2022, called “DRAWNAPART: A Device Identification Technique based on Remote GPU Fingerprinting”, the researchers use the statistical speed variations of individual execution units in the GPU to uniquely identify a complete system, even between two identical hardware and software configurations.
These papers are, in my opinion, kinda scaring in terms of the end user’s privacy. Since they use the WebGPU web standard, probably something has to be implemented to avoid these kinds of hacks. From the web scraping perspective, using these techniques to understand what kind of GPU, if any, is installed on a machine could open to dark scenarios, since most scrapers run on servers.
Thanks for reading this episode of The Web Scraping Club, you’re more and more every day and this makes me raise the bar every day.
I’ve played with custom GPTs provided by OpenAI and I’ve spent some time training a small one and fed it with some articles from The Web Scraping Club. Its replies to my questions were quite good, of course not so detailed, but much better than standard ChatGPT. The issue is that I cannot pass more than 20 articles, because of the limit to the number of the attachments.
So I’m asking the community: Is any of you that is willing to create a sort of The Web Scraping Club GPT? It should read all the articles from the blog and be able to answer the prompts by using the data contained in them.
If so, please write me at email@example.com