The Lab #47: Scraping real time data with Python
Using WebSocket to scrape data from Bitstamp and Sofascore
In the past articles, we’ve always seen scraping techniques for websites where data is updated maybe frequently but not every second.
But how can we scrape websites where data is updated on a very high frequency, like the trade view of Bitstamp or sports bets?
Well, first we should understand how these websites work, and in most cases, it means understanding what is a WebSocket and its functioning.
What is a WebSocket and how it works
A WebSocket is a communication protocol that provides a full-duplex communication channel over a single, long-lasting connection between a client and a server on the web. It is designed to be implemented in web browsers and web servers but can be used by any client or server application. The WebSocket protocol facilitates real-time data transfer and interaction, making it an essential technology for modern web applications that require live content updates without the need to reload the web page, such as chat applications, live sports updates, and interactive games.
The operation of WebSockets is initiated through a handshake mechanism, which is performed over the HTTP protocol. This handshake starts when the client sends a WebSocket handshake request to the server, expressing its desire to establish a WebSocket connection. The request includes a specific upgrade header that signals the server to switch from the HTTP protocol to the WebSocket protocol. If the server supports WebSockets and accepts the connection request, it responds with a handshake response, confirming the protocol switch. Once this handshake is successfully completed, the initial HTTP connection is upgraded to a WebSocket connection, allowing for full-duplex communication.
Unlike the traditional HTTP request-response model, where each request necessitates a new TCP connection, the WebSocket protocol maintains an open connection, enabling both the client and the server to send data independently and at any time. This persistent connection is maintained until explicitly closed by either the client or the server. The ability to send data in both directions simultaneously without the overhead of multiple HTTP requests significantly reduces latency and increases the efficiency of data transfer, making WebSockets particularly suitable for real-time applications.
WebSockets operate at a lower level than HTTP, allowing them to bypass some of the limitations of HTTP such as connection throttling and proxy filtering. Furthermore, WebSockets support message-based data transfer, enabling the transmission of discrete data packets, which can be text or binary.
WebSocket in action: Bistamp
Once we understand how WebSockets work, let’s see them in action and let’s try to use them to get a continuous stream of data.
All the code of the tests can be found in The Lab GitHub repository, available for paying users, under folder 47.REAL-TIME-SCRAPING.
If you already subscribed but don’t have access to the repository, please write me at pier@thewebscraping.club since I need to add you manually.
Step one: let’s find out a WebSocket
In this step, we’re looking for a website that uses WebSockets and the first one that came to my mind as a potential target is Bitstamp, with its tradeview section.
We can see very dynamic content, with the current price of Bitcoin (or other cryptos), the order book, and the trades, that get updated even several times per second.
In fact, when loading the page, we can see in the Network tab of the Developers’ tool a WebSocket connection.
It’s easier to find it if you click on the WS filter, as shown in the above image.
Step two: what’s going on?
Just like any request, we can click on the console to see how the connection between the client and the server is established inside the WebSock and which messages are sent between the two.
In this case, we can see that the first five messages that the client (our browser) sends to the server are five subscriptions to different channels, that correspond to the dynamic content of the page.
We have a channel where we’ll get the data of the live trades between BTC and USD, the order book between the two, and the movement of the BTC ticker.
Once we subscribe to these channels, we get flooded by incoming messages for every trade and price change happening.
Let’s see how to collect it in the next step. I’m not such an expert in real-time scraping so the examples you’ll see in the next chapter will be very basic, but if you have more experience than me, please reach out since I’d like to know more about this fascinating world.
Keep reading with a 7-day free trial
Subscribe to The Web Scraping Club to keep reading this post and get 7 days of free access to the full post archives.