This article is sponsored by Serply, the solution to scrape search engine results easily.
Web Scraping Club readers can save 25% on all SERP scraping plans by using the code TWSC25.
Welcome to our monthly interview, this time it’s the turn of Veritas, a NYC hacker and blogger. You can read his great posts here .
First of all, tell us a bit about yourself (whatever you want to share with us, your experiences, and so on).
2. In your latest post, which became pretty viral, you studied how TikTok gathers data from users, via obfuscated scripts that hide their Canvas and device fingerprinting techniques. What does it mean for the end user?
For the end user, this means TikTok can create a precise profile of you, beyond your IP address or account information. The outcome could result in fewer bots on the platform, improved ad personalization, reduced bots, or monitoring users across multiple accounts. It's worth mentioning that this practice is not unique to TikTok, and occurs to some extent on every major platform. I will say that this is the first time I've seen the data collection code so heavily obfuscated
3. Nowadays in the US TikTok is being banned from government devices. Do you think it’s a political move or it’s a real threat to publicly exposed people’s privacy?
I'd say it's important to consider the data collection practices of *all* companies, regardless of their country of origin, and make informed decisions about the use of their products. I don't believe that TikTok poses any more of a threat to privacy than the data collection practices of US-based tech companies
4. How do modern fingerprinting techniques represent an issue for internet users’ privacy? Is it really possible to track a single user, with a certain grade of precision, even without using cookies?
While individual attributes, such as browser type or device model, may not provide enough information to uniquely identify a user, the combination of multiple attributes can lead to a high level of accuracy in user fingerprinting. For example, a service known as FingerprintJS Pro claims to provide 99.5% accuracy and will provide a unique
`visitorId` token used to track a user. It claims to be even better than cookies since this technique isn't hindered by incognito mode or users who delete cookies.
5. What is your de-obfuscation tech stack of tools and processes?
6. You have a great passion for deobfuscation, where does it come from? How did you become so expert in it? do you have any suggestions for someone who wants to follow your path?
I find magic in unraveling the purpose of code that's deliberately made super complex to understand. For me, it's analogous to solving a Jigsaw puzzle. Once you gain the understanding, you can begin bending the rules and have the system perform in unintended ways. In many ways, you become a magician! My interest in reverse engineering stemmed from wanting to create game hacks to troll my friends at a young age. Being able to write code to fly around while everyone else is stuck traveling on foot. My suggestion for anyone looking to do the same is to not spend too much time thinking. I often see people getting stuck in the trap of convincing themselves they don't know where to start. Just pick a project and break your ultimate goal into smaller questions. From there, google away.
7. How deobfuscation is linked to web scraping?
Some companies' entire business is held up by the data they possess and thus they try to keep the data away from competitors. This may lead to some pretty clever obfuscation techniques to make scraping and interpreting the data as annoying as possible. I've seen businesses use completely custom binary formats for their web app's data and only deserialize the data using WebAssembly. It can get pretty nasty, but it also makes for a fun deobfuscation challenge.
8. Do you have any web scraping experience? Which tool do you use the most?
When scraping, I tend to look for any exposed APIs that may provide me the data I need. For a storefront, this may be as simple as a "/api/products.json" endpoint. Some sites may provide official APIs, but if they don't, then you may have to snoop around and find the private endpoints. In the case they don't exist at all, libraries such as cheerio.js, Playwright, JSDom, etc.. make it easy to grab the data from the source of the web page itself.
9. How do you see the web scraping tools industry in the future?
The web scraping industry is a cat-and-mouse game. As fingerprinting and other bot prevention techniques become more prevalent, tools need to become more resilient to these detections. We already see this with some passive fingerprinting techniques such as TLS/HTTP2 fingerprinting requiring scraping requests to appear as if they came from a legitimate browser. I feel web scraping tools will start to become more and more reliant on a Browser's engine versus super lightweight HTTP clients or calls to cURL.
10. You mentioned a fun story about recompiling a browser to tackle an antibot, would you mind to share?
The Lab - premium content with real-world cases
THE LAB #8: Using Bezier curves for human-like mouse movements
THE LAB #6: Changing Ciphers in Scrapy to avoid bans by TLS Fingerprinting
THE LAB #4: Scrapyd - how to manage and schedule a fleet of scrapers
THE LAB #2: scraping data from a website with Datadome and xsrf tokens
The Web Scraping Club is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.
I love passionate people, and discover how they dive into their passion! You rock!
I'm interested also if you have trick about rebuilding a modified firefox 😉
Very insightful. Thanks for this