THE LAB #64: JWT Tokens and API scraping
How to create scrapers that use token authentication for API data retrieval
One of the best things that can occur to a web scraper is finding a web API underlying the target website or app. When you encounter it, you know you’ve found a reliable data source less prone to changes than HTML since it is probably used by more actors, not just the website itself. This is also why, in most cases, you can find more information in the API response than on the website.
While we can find unauthenticated APIs on most websites, this is often not true for apps, especially if these use endpoints that the website doesn’t. This is the case with the Tractor Supply Company website: if you use the network inspector from your browser, you don’t see any API call to retrieve the data, but if you monitor the network while using their app, after having unpinned the SSL certificate, you will discover them.
Finding the most efficient way to scrape a website is one of the services we offer in our consulting tasks, in addition to projects aimed at boosting the cost efficiency and scalability of your scraping operations. Want to know more? Let’s get in touch.
Before discussing the details of this particular scraper, let’s first understand what JWT tokens are and how they work.
What is a JWT token
Let’s start with the official definition from the jwt.io website:
JSON Web Token (JWT) is an open standard (RFC 7519) that defines a compact and self-contained way for securely transmitting information between parties as a JSON object.
We can see it as a protocol for exchanging trusted information between parties, and for this reason, it’s mainly used also for authentication purposes in apps or websites after the initial login from the user.
It’s more or less what happens when we need to fly to another city: We buy a ticket on the airline’s website and receive the boarding pass as proof of our purchase, with the flight data and our personal information inside its QR code. At the gate, the airline crew will take our ticket, check if the information in the QR code is valid, and, as a response, we can move on to the plane.
The same happens with JWT tokens used as an authentication method: we first login to a certain website. If we successfully do it, an issuer (the airline in the previous example) gives us a signed token with some information embedded. The next time we call an API on that website, for example, because we need to browse a product catalog, we’ll pass the token in the request’s headers. The website will receive this token, check its content and signature, and if everything is fine, it will allow us to receive the API response.
What is the JSON Web Token structure?
Every JWT token has a structure that follows this path: XXXXX.YYYYY.ZZZZZ
The three sections separated by dots correspond to the three different parts of the JWT.
The header specifies the token type—JWT—and the signing algorithm used, such as HS256 or RS256. It is a JSON object that, once base64Url encoded, forms the first part of the token. For example:
{
"alg": "HS256",
"typ": "JWT"
}
The payload contains the claims—statements about an entity (typically the user) and additional metadata. Claims fall into three categories: registered, public, and private. Registered claims are predefined and recommended by the JWT specification, like iss
(issuer), exp
(expiration time), sub
(subject), and aud
(audience). Public claims are custom claims defined by those using JWTs but should be collision-resistant, often using namespaces. Private claims are application-specific and agreed upon between parties. An example payload might look like:
{
"sub": "1234567890",
"name": "John Doe",
"admin": true,
"iat": 1516239022
}
The signature is created by taking the encoded header and payload, concatenating them with a period, and then hashing this string using the specified algorithm and a secret key (for symmetric algorithms) or a private key (for asymmetric algorithms). The final JWT is a compact string formed by concatenating the encoded header, payload, and signature with periods, like
eyJhbGciOiJSUzI1NiIsImtpZCI6IjhkNzU2OWQyODJkNWM1Mzk5MmNiYWZjZWI2NjBlYmQ0Y2E1OTMxM2EiLCJ0eXAiOiJKV1QifQ.eyJwcm92aWRlcl9pZCI6ImFub255bW91cyIsImlzcyI6Imh0dHBzOi8vc2VjdXJldG9rZW4uZ29vZ2xlLmNvbS9ldmdvLWZhbGNvbi1wcm9kIiwiYXVkIjoiZXZnby1mYWxjb24tcHJvZCIsImF1dGhfdGltZSI6MTcyODQ3OTMzOSwidXNlcl9pZCI6ImpRUmpaZGkxMkNabkZaUExwVVJENHdKODZnSTMiLCJzdWIiOiJqUVJqWmRpMTJDWm5GWlBMcFVSRDR3Sjg2Z0kzIiwiaWF0IjoxNzI4NDc5MzM5LCJleHAiOjE3Mjg0ODI5MzksImZpcmViYXNlIjp7ImlkZW50aXRpZXMiOnt9LCJzaWduX2luX3Byb3ZpZGVyIjoiYW5vbnltb3VzIn19.EN4JeeSLI-fMpyERsG4ebbHIcc7G3GbDXIA6JjN33kbNIHxNFU9jeC6KLz4WX_T0cC43Qz5s7dQflBmMonPkj4UcwC_blJPJn7bMa1IepBqNb_RyB2uAgLweJtpv3g2GzNLwxBWXz1R84gUkYXQGwUabJBsUIH9mBAklH-2khi_jS2dIlbVYVl3LFNhKR19kflcXlgby5GdLUcaazM8az6zRFC2ZtjsvC4opj1TMHfgCBbrIcxpESlsrU3wu1JKhKbAL-h1nv5I0eaX5HSuzdDzQiOQ1K2hPO5shkXs1re6nqeRZoVhGb5mzVKS49Gcl2dGB8ZXmk5WQUtuGbphbDQ
By using the secret or private key held by the target server, the JWT protocol avoids the content of the payload being forged by a third-party actor, for example, to extend its expiry date.
On the website JWT.io, you can see the content of the encoded token to get an idea of what’s inside the payload.
Regarding scraping, the most interesting part of the payload is the token's expiry date. From website to website, this may vary from a few minutes to several hours, so you can understand how often you should refresh it.
A real-world example: Tractor Supply & Co App
Tractor Supply Co. is one of the store chain leaders in the Home Improvement industry, with 14.6 Billion USD in revenues in 2023. It is also listed on the Nasdaq stock exchange.
As we have seen before, there is no API to view the assortment of products from the company's website, but this can be circumvented by using the company's mobile app.
Before intercepting the app's traffic, a preliminary step is to create a virtual Android device, root it, and install the Frida server, as explained in one of the previous The Lab articles. This will allow us to unpin the SSL certificate from the app.
After our testing environment is set up, we can start our virtual device, the Frida server on it, and the JS script to unpin the certificate. We can keep Fiddler Everywhere in the background on our computer, ready to listen to every network call.
Once the app is fully started, we will see the first API calls flowing from the app to the TSC backend servers on Fiddler. One of them is a request for a JWT token that, even if we’re not logged in to the app, we’ll use to authenticate to the other API endpoints the app uses.
As always, if you want to have a look at the code, you can access the GitHub repository, available for paying readers. For this article, I’ve created different files under folder 64.JWT that are split by the task they perform.
If you’re one of them but don’t have access to it, please write me at pier@thewebscraping.club to get it.
Keep reading with a 7-day free trial
Subscribe to The Web Scraping Club to keep reading this post and get 7 days of free access to the full post archives.