Here’s another post of “THE LAB”: in this series, we'll cover real-world use cases, with code and an explanation of the methodology used.
Being a paying user gives:
Access to Paid Content, like the post series called “The LAB”, where we’ll go deep diving with code real-world cases (view here as an example).
Access to the GitHub repository with the code seen on ‘The LAB”
Access to private channels on our Discord server
But in case you want to read this newsletter for free, you will always get a post per week about:
News about web scraping
Anti-bot software and techniques insights
Interviews with key people in the industry
And you can always join the Web Scraping Club Discord server
Enough housekeeping, for now, let’s start.
Want to travel?
The travel industry has been one of the first to be impacted by digitalization. Booking.com, one of the largest websites for booking hotels around the globe, started its operations in 1997. Edreams.com, an air travel fares aggregator, went online in 2000. Airbnb is a fifteen years old online marketplace.
All these websites have in common an high traffic volume and a huge database of data points shown to their visitors. This means that every request made by the users should be responded to in the most efficient way, to save bandwidth and time. And it's not just a case that these three websites have in common one thing: they all use GraphQL to retrieve data to the front end.
What is GraphQL
We can think of GraphQL as an API query language, with its own syntax and grammar. But it is also the runtime engine for interpreting this language and responding to these requests.
In other words, it's a “query language,” that provides a consistent query layer for APIs, providing a single endpoint for developers to use when making requests.
This allows you to not only query the data but also control the structure of how each GraphQL API responds.
GraphQL was developed by Facebook in 2012 and later open-sourced in 2015.
Some more details
But how GraphQL helps websites to expose data more efficiently?
Modern websites have dozens if not hundreds of APIs exposing a single object, with all its attributes. With a single call made via GraphQL, you can gather data from the APIs needed, including only the requested fields in the output.
In this example from the Testproject’s blog that simulates the functioning of a blog, we have 3 different APIs on our website.
The first one lists the authors, with all their details: name, address, and birthday.
The second one is the list of the posts per author and their details: title, content, and comment list.
The last one lists the followers per author and their attributes: again name, address, and birthday.
Instead of calling the three APIs separately to get the data, and getting also unwanted fields, the user makes only one request to a single endpoint, specifying in the payload the fields he needs and the GraphQL engine will provide them.
This is possible because each object and the relationships between them are defined in its schema definition language.
For the ones of you that worked with relational databases, this operation is something pretty similar to designing the Database Entity-Relationship diagram. Mapping the entities in the GraphQL schemas allows its engine to understand where are all the information so that when it receives a request from a user, it knows which API to call to extract the fields needed to fulfill it.
The response is a JSON containing the result of the query, with the selected fields.
Web Scraping Implications
Because of its features, using GraphQL, when publicly available, to scrape a website is the preferred choice. We don't overload the target websites with requests for HTML pages but instead, we get exactly the data we need in a JSON format, maintained by the website itself for its internal functioning.
Let's see how Airbnb implemented it, simulating research for a place where to stay in Manhattan from 2022-11-10 to 2022-11-17, for two adults.
The payload sent to the GraphQL engine will look like the following.
You have surely noticed that in the rawParams list, we have the filters we’ve set in the search bar of the website, while the result contains all the data shown and much more.
It seems we’re ready to implement our scraper for Airbnb.
It’s Scrapy time
The first thing is to create a Scrapy project and then define the data model for the output.
Keep reading with a 7-day free trial
Subscribe to The Web Scraping Club to keep reading this post and get 7 days of free access to the full post archives.