When starting a web scraping project, one of the first things to check is if there’s any internal API on the website we want to get the data from.
If there’s any, that’s the preferred choice for extracting data: it’s more lightweight both on the server and scraper side since you’re requesting less data. It’s also more efficient in terms of costs, in case you’re using some proxies billed per GB transmitted. Last but not least, it’s more reliable since APIs are less prone to changes compared to the HTML code.
In some cases, you could encounter APIs that require some authentication method to be used, like the Bearer Tokens.
What is a Bearear Token?
Let’s use the definition given on the Swagger Website:
Bearer authentication (also called token authentication) is an HTTP authentication scheme that involves security tokens called bearer tokens. The name “Bearer authentication” can be understood as “give access to the bearer of this token.” The bearer token is a cryptic string, usually generated by the server in response to a login request. The client must send this token in the
Authorization
header when making requests to protected resources:Authorization: Bearer <token>
We can easily detect these API endpoints from the network inspector in the browser:
In these cases, we cannot access the API endpoint with a simple GET request (for example, loading the URL from the browser), but we need to understand how this token is generated and try to replicate the mechanism.
But before starting with a real-world example, it’s important to understand that the Bearer authentication method is only one of the many you can encounter.
Differences with Castle antibot tokens
Castle.io is an anti-bot solution used also to protect API endpoints. Even in this case, when we need to make a request to the endpoint, we need to pass a token inside the headers, called x-castle-request-token.
Differently from the Bearer one, where we can have a look at the website behavior and replicate it in our scraper, the Castle tokens are generated using some information from our browser, so we cannot create a new one unless we reverse-engineer the whole anti-bot solution.
These guys at Takion seem to have done it ( I still haven’t tried their solution) but this can be an overshoot if the data you need can be read elsewhere.
How to handle Bearer Tokens in scraping
As we just mentioned, if we encounter an API endpoint requiring a Bearer token, we don’t need to reverse engineer anything, but we need to inspect carefully the network listener, in order to understand how the authentication works.
We could divide this process into three steps:
The token request, where the website calls a first endpoint in order to generate a token
The token parsing, where the website receives the JSON containing the token and reads it
The final API call, where the website adds the token in the headers of the target API call and get the data needed to be loaded in its page.
All these steps can be seen from the network tab inside the developer’s tools, like the following example will show us.
We’ll create a scraper that uses the internal API of an e-commerce website to scrape data efficiently by implementing the previous three steps.
As always, if you want to have a look at the code, you can access the GitHub repository available for paying readers. You can find this example in the folder named 51.BEARER
If you’re one of them but don’t have access to it, please write me at pier@thewebscraping.club to obtain it.
Finding the API containing the data we need
In this example, we’ll use the Loewe e-commerce website as a case study for this technique.
When browsing a product category, we can find the following call:
https://www.loewe.com/mobify/proxy/api/search/shopper-search/v1/organizations/f_ecom_bbpc_prd/product-search?siteId=LOE_USA&refine=htype%3Dset%7Cvariation_group&refine=price%3D%280.01..1370000000%29&refine=cgid%3Dwomen&refine=c_LW_custom_level%3Dwomen¤cy=USD&locale=en-US&offset=32&limit=32&c_isSaUserType=false&c_countryCode=US
which returns, at least in the browser, a JSON containing the first page of the product shown for that category in the US.
Despite being a GET call, so there’s no need to pass any payload, if we enter this URL in the browser’s address bar, we get this error:
{"title":"Unauthorized","type":"https://api.commercecloud.salesforce.com/documentation/error/v1/errors/unauthorized","detail":"Unauthorized request"}
The reason is quite simple: we need to pass the Bearer token in the headers as the website does.
How to generate the token
The first step is to figure out how to generate this token and the easiest way to do it is to look for the token string using the search function in the network tab (Control-F).
We should find the first call where the token is not used in the headers but can be found in the response, like in this case:
This is the result of a POST call to the following endpoint:
https://www.loewe.com/mobify/proxy/api/shopper/auth/v1/organizations/f_ecom_bbpc_prd/oauth2/token
that need some parameters to work:
client_id: which seems to be a fixed string, at least when making calls from my laptop, so we’ll hardcode it
channel_id: which depends on the country of the website we’re scraping
grant_type: refresh_token, which is the action we want to take
refresh_token: an hashed string, that we need to understand how to get.
Please note that this refresh action happens after 30 minutes after you’ve loaded the website’s page in the browser after the first token has expired. The same endpoint is used to generate the first token, as soon as you enter the website, by changing the parameter grant_type to authorization_code_pkce and adding other values we don’t care about. Once we understand how the refresh of the token is made, we can use this method as soon as we enter the website.
Again here we have a string in the refresh_token field that we need to understand how to get.
Just like we did before, we need to inspect the network tab in order to find it and we immediately realize that it’s a sort of session ID stored in our cookies as soon as we enter the website, under the key cc-nx-g.
So the logical steps our scraper needs to take are the following:
enter the home page and store all the cookies
read the cookies and store in a variable the string in the key cc-nx-g
use this string in the parameters for calling the refresh token API
read its response to store the Bearer token
use the Bearer token to finally call the product list API
Let’s do the code step by step.
Reading cookies from the Scrapy spider
Keep reading with a 7-day free trial
Subscribe to The Web Scraping Club to keep reading this post and get 7 days of free access to the full post archives.