THE LAB #6: Changing Ciphers in Scrapy to avoid bans by TLS Fingerprinting
In other words: fake it until you scrape it
Here’s another post of “THE LAB”: in this series, we'll cover real-world use cases, with code and an explanation of the methodology used.
Being a paying user gives:
Access to Paid Content, like the post series called “The LAB”, where we’ll go deep diving with code real-world cases (view here an example).
Access to the GitHub repository with the code seen on ‘The LAB”
Access to private channels on our Discord server
But in case you want to read this newsletter for free, you will always get a post per week about:
News about web scraping
Anti-bot software and techniques insights
Interviews with key people in the industry
And you can always join the Web Scraping Club Discord server
Enough housekeeping, for now, let’s start.
As you surely know, the most advanced anti-bot solutions act on different levels:
at a behavioral level, they check how the scraper act and try to distinguish a bot from a human.
at a browser level, they try to distinguish a genuine browser from an automated version, looking for some incongruence in the setup.
at an HTTP level, they try to identify the device configuration to detect suspicious setups.
On our Discord server the focus was on this latest case, so today we'll try to explain how this can be achieved via TLS Fingerprinting and what can we do as a counter-measure in our scrapers.
Understanding TLS Fingerprinting
TLS fingerprinting is a passive (or server-side) fingerprinting technique used by servers to identify the configuration of the clients connecting to it.
The fingerprints are created using the ciphers exchanged when the connection between the client and servers establishes.
To better understand how this technique works, let's borrow the image from this Cloudflare blog post.
When we connect a client to a server, the first interaction is made by the TCP protocol. It's called Three-way Handshake, where the client and server share their willingness and availability to connect.
The client sends a SYN packet to ask for availability to the server for a new connection.
If the server is available, it replies with an SYN/ACK packet to the client.
The client again replies then with an ACK packet and the connection is established. From now on, the two can exchange data.
Without entering too many details about the full TLS protocol, we'll focus now on what happens after a connection is established.
The "Hello Message", the first one sent by the client after the handshake, is where data needed for fingerprinting are sent. The message will include which TLS version the client supports, the cipher suites supported, and a string of random bytes known as the "client random."
But the point is that ciphers differ from client to client: a Chrome connection has a different cipher suite than a Safari one or a Scrapy one, sent from the same machine.
Here are the ciphers of a connection made to google.com with Chrome from a Mac laptop.
[8A8A] Unrecognized cipher - See https://www.iana.org/assignments/tls-parameters/
[1301] TLS_AES_128_GCM_SHA256
[1302] TLS_AES_256_GCM_SHA384
[1303] TLS_CHACHA20_POLY1305_SHA256
[C02B] TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256
[C02F] TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256
[C02C] TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384
[C030] TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384
[CCA9] TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305_SHA256
[CCA8] TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305_SHA256
[C013] TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA
[C014] TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA
[009C] TLS_RSA_WITH_AES_128_GCM_SHA256
[009D] TLS_RSA_WITH_AES_256_GCM_SHA384
[002F] TLS_RSA_WITH_AES_128_CBC_SHA
[0035] TLS_RSA_WITH_AES_256_CBC_SHA
Safari:
[2A2A] Unrecognized cipher - See https://www.iana.org/assignments/tls-parameters/
[1301] TLS_AES_128_GCM_SHA256
[1302] TLS_AES_256_GCM_SHA384
[1303] TLS_CHACHA20_POLY1305_SHA256
[C02C] TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384
[C02B] TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256
[CCA9] TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305_SHA256
[C030] TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384
[C02F] TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256
[CCA8] TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305_SHA256
[C00A] TLS_ECDHE_ECDSA_WITH_AES_256_CBC_SHA
[C009] TLS_ECDHE_ECDSA_WITH_AES_128_CBC_SHA
[C014] TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA
[C013] TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA
[009D] TLS_RSA_WITH_AES_256_GCM_SHA384
[009C] TLS_RSA_WITH_AES_128_GCM_SHA256
[0035] TLS_RSA_WITH_AES_256_CBC_SHA
[002F] TLS_RSA_WITH_AES_128_CBC_SHA
[C008] TLS_ECDHE_ECDSA_WITH_3DES_EDE_CBC_SHA
[C012] TLS_ECDHE_RSA_WITH_3DES_EDE_CBC_SHA
[000A] SSL_RSA_WITH_3DES_EDE_SHA
Scrapy:
[1302] TLS_AES_256_GCM_SHA384
[1303] TLS_CHACHA20_POLY1305_SHA256
[1301] TLS_AES_128_GCM_SHA256
[C02C] TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384
[C030] TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384
[009F] TLS_DHE_RSA_WITH_AES_256_GCM_SHA384
[CCA9] TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305_SHA256
[CCA8] TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305_SHA256
[CCAA] TLS_DHE_RSA_WITH_CHACHA20_POLY1305_SHA256
[C02B] TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256
[C02F] TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256
[009E] TLS_DHE_RSA_WITH_AES_128_GCM_SHA256
[C024] TLS_ECDHE_ECDSA_WITH_AES_256_CBC_SHA384
[C028] TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA384
[006B] TLS_DHE_RSA_WITH_AES_256_CBC_SHA256
[C023] TLS_ECDHE_ECDSA_WITH_AES_128_CBC_SHA256
[C027] TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA256
[0067] TLS_DHE_RSA_WITH_AES_128_CBC_SHA256
[C00A] TLS_ECDHE_ECDSA_WITH_AES_256_CBC_SHA
[C014] TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA
[0039] TLS_DHE_RSA_WITH_AES_256_CBC_SHA
[C009] TLS_ECDHE_ECDSA_WITH_AES_128_CBC_SHA
[C013] TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA
[0033] TLS_DHE_RSA_WITH_AES_128_CBC_SHA
[009D] TLS_RSA_WITH_AES_256_GCM_SHA384
[009C] TLS_RSA_WITH_AES_128_GCM_SHA256
[003D] TLS_RSA_WITH_AES_256_CBC_SHA256
[003C] TLS_RSA_WITH_AES_128_CBC_SHA256
[0035] TLS_RSA_WITH_AES_256_CBC_SHA
[002F] TLS_RSA_WITH_AES_128_CBC_SHA
[00FF] TLS_EMPTY_RENEGOTIATION_INFO_SCSV
They all differ in order and number of ciphers. It means that the server, using these ciphers and some other parameters sent, has an idea of my client's architecture as soon as I try to connect to it and can use this data to create fingerprints and block suspicious ones.
This great LWT Hiker blog post, from where the previous table comes, digs deeper in detail and shows also two of the most know algorithms to create fingerprints used nowadays, the JA3 and the TS1.
Countermeasures
Keep reading with a 7-day free trial
Subscribe to The Web Scraping Club to keep reading this post and get 7 days of free access to the full post archives.