This article is sponsored by Serply, the solution to scrape search engine results easily.
Web Scraping Club readers can save 25% on all SERP scraping plans by using the code TWSC25.
Welcome to our monthly interview, this time it’s the turn of Veritas, a NYC hacker and blogger. You can read his great posts here .
First of all, tell us a bit about yourself (whatever you want to share with us, your experiences, and so on).
Hi, I'm Veritas. I'm a software engineer / hacker based out of New York. I started my programming journey at the age of eight. Club Penguin-related forums were fairly popular in those days and as a child, I enjoyed creating my own. Although these sites were built on fairly simple forum hosts like Forumotion, they still allowed a great level of customizability if you knew at least introductory-level JavaScript. I now spend my free time reverse-engineering obfuscated JS and the likes.
2. In your latest post, which became pretty viral, you studied how TikTok gathers data from users, via obfuscated scripts that hide their Canvas and device fingerprinting techniques. What does it mean for the end user?
For the end user, this means TikTok can create a precise profile of you, beyond your IP address or account information. The outcome could result in fewer bots on the platform, improved ad personalization, reduced bots, or monitoring users across multiple accounts. It's worth mentioning that this practice is not unique to TikTok, and occurs to some extent on every major platform. I will say that this is the first time I've seen the data collection code so heavily obfuscated
3. Nowadays in the US TikTok is being banned from government devices. Do you think it’s a political move or it’s a real threat to publicly exposed people’s privacy?
I'd say it's important to consider the data collection practices of *all* companies, regardless of their country of origin, and make informed decisions about the use of their products. I don't believe that TikTok poses any more of a threat to privacy than the data collection practices of US-based tech companies
4. How do modern fingerprinting techniques represent an issue for internet users’ privacy? Is it really possible to track a single user, with a certain grade of precision, even without using cookies?
While individual attributes, such as browser type or device model, may not provide enough information to uniquely identify a user, the combination of multiple attributes can lead to a high level of accuracy in user fingerprinting. For example, a service known as FingerprintJS Pro claims to provide 99.5% accuracy and will provide a unique `visitorId`
token used to track a user. It claims to be even better than cookies since this technique isn't hindered by incognito mode or users who delete cookies.
5. What is your de-obfuscation tech stack of tools and processes?
I tend to always start with a JavaScript formatter of some kind. beautifier.io is an online JavaScript beautifier that always seems to give me great results. For the actual deobfuscation, I use the Babel suite (@babel/parser, @babel/traverse, etc..) and write my own transformations. I actually have a boilerplate repository on GitHub that I always start with (https://github.com/voidstar0/ast-playground). AST Explorer (
https://astexplorer.net/
) is another gem that makes visualizing a script's syntax tree super easy. Other than that, it's mainly just patience, testing, and taking notes as I go. I have some articles written where I describe certain techniques more in-depth but that's usually the basis.
6. You have a great passion for deobfuscation, where does it come from? How did you become so expert in it? do you have any suggestions for someone who wants to follow your path?
I find magic in unraveling the purpose of code that's deliberately made super complex to understand. For me, it's analogous to solving a Jigsaw puzzle. Once you gain the understanding, you can begin bending the rules and have the system perform in unintended ways. In many ways, you become a magician! My interest in reverse engineering stemmed from wanting to create game hacks to troll my friends at a young age. Being able to write code to fly around while everyone else is stuck traveling on foot. My suggestion for anyone looking to do the same is to not spend too much time thinking. I often see people getting stuck in the trap of convincing themselves they don't know where to start. Just pick a project and break your ultimate goal into smaller questions. From there, google away.
7. How deobfuscation is linked to web scraping?
Some companies' entire business is held up by the data they possess and thus they try to keep the data away from competitors. This may lead to some pretty clever obfuscation techniques to make scraping and interpreting the data as annoying as possible. I've seen businesses use completely custom binary formats for their web app's data and only deserialize the data using WebAssembly. It can get pretty nasty, but it also makes for a fun deobfuscation challenge.
8. Do you have any web scraping experience? Which tool do you use the most?
When scraping, I tend to look for any exposed APIs that may provide me the data I need. For a storefront, this may be as simple as a "/api/products.json" endpoint. Some sites may provide official APIs, but if they don't, then you may have to snoop around and find the private endpoints. In the case they don't exist at all, libraries such as cheerio.js, Playwright, JSDom, etc.. make it easy to grab the data from the source of the web page itself.
9. How do you see the web scraping tools industry in the future?
The web scraping industry is a cat-and-mouse game. As fingerprinting and other bot prevention techniques become more prevalent, tools need to become more resilient to these detections. We already see this with some passive fingerprinting techniques such as TLS/HTTP2 fingerprinting requiring scraping requests to appear as if they came from a legitimate browser. I feel web scraping tools will start to become more and more reliant on a Browser's engine versus super lightweight HTTP clients or calls to cURL.
10. You mentioned a fun story about recompiling a browser to tackle an antibot, would you mind to share?
As mentioned before, this industry tends to be a game of cat and mouse. Companies tend to deploy clever and annoying obfuscation techniques to protect their source code. This is what I experienced while reverse-engineering online retailer Supreme's anti-bot. On Supreme's website used to live a script known as "ticket.js" whose primary purpose was to prevent bots from purchasing items during their product launches. The script would collect various attributes from a user's browser and encrypt them so one couldn't easily tamper with the data before it gets sent off to their servers through a cookie. Their obfuscator, JScrambler, allowed them to trap anyone who was looking at the source, into an infinite debugger loop using the JavaScript debugger keyword. This made it impossible to use the DevTools debugger for code analysis. One solution to this problem is to deactivate all breakpoints inside DevTools, but then you can't analyze their code using breakpoints. My solution to this? Build Firefox from the source and rename JavaScript's "debugger" keyword to "banana". Their trap is now completely void because my browser's JavaScript engine now thinks "debugger" is not a valid keyword. I could've used any name but banana was the first to come to mind. Some problems require unique solutions and this is one fun example of such.
I love passionate people, and discover how they dive into their passion! You rock!
I'm interested also if you have trick about rebuilding a modified firefox 😉
Very insightful. Thanks for this