THE LAB #45: Bypassing Geo-fencing While Scraping
How to scrape websites that are banned in your country
Understanding Geo-fencing on Websites
Geo-fencing is a technology that uses the geographic location of a user to determine what content they are allowed to access on a website. This is achieved by analyzing the IP address of the user or the browser’s APIs, giving a rough estimate of their geographical location. Websites use geo-fencing for a variety of reasons, such as complying with legal restrictions, delivering localized content, or protecting against fraud.
For web scrapers, geo-fencing can be a significant barrier. It can prevent access to specific data or content that is only available to users in certain regions. For instance, a scraper designed to collect prices from an e-commerce site may find that prices vary by region, or a streaming service may offer different content libraries in different countries.
Is geo-fencing a regular practice in Europe?
The European Union (EU) has regulations that restrict certain uses of geo-fencing and geo-blocking, particularly in the context of digital markets and e-commerce, to ensure a single digital market allowing for the free movement of goods, services, and digital content.
The EU's Geo-blocking Regulation (EU) 2018/302, which came into effect on December 3, 2018, aims to counter unjustified geo-blocking and other forms of discrimination based on customers' nationality, place of residence, or place of establishment within the internal market. The regulation prohibits online sellers from applying differential treatment to customers from different EU countries unless it is justified by specific legal requirements. This means that online retailers and service providers must offer the same access to goods and services to all EU consumers, regardless of where they live.
The regulation does not cover all forms of digital content. For example, copyright-protected content such as online streaming services may still implement geo-blocking due to licensing agreements that are specific to certain countries or regions. Additionally, the regulation does not mandate that sellers deliver goods across all EU member states, but it does require that customers from any member state have the ability to purchase goods or services if the seller offers delivery within its own country.
But how do websites know from where we’re connecting?
As mentioned before, they’re mostly using the information deducted from our IP address and our browser as well.
What else can be deducted from our IP address?
Let’s use one of the most well-known services for gathering information from an IP, called IPinfo, to understand better what can be deducted from one address.
We’re testing one address from the AWS eu-central-1 region by making this API call and here’s its result.
{
"input": "3.120.31.122",
"data": {
"ip": "3.120.31.122",
"hostname": "ec2-3-120-31-122.eu-central-1.compute.amazonaws.com",
"city": "Frankfurt am Main",
"region": "Hesse",
"country": "DE",
"loc": "50.1155,8.6842",
"org": "AS16509 Amazon.com, Inc.",
"postal": "60306",
"timezone": "Europe/Berlin",
"asn": {
"asn": "AS16509",
"name": "Amazon.com, Inc.",
"domain": "amazon.com",
"route": "3.120.0.0/14",
"type": "hosting"
},
"company": {
"name": "A100 ROW GmbH",
"domain": "amazon.com",
"type": "hosting"
},
"privacy": {
"vpn": false,
"proxy": false,
"tor": false,
"relay": false,
"hosting": true,
"service": ""
},
"abuse": {
"address": "US, WA, Seattle, Amazon Web Services Elastic Compute Cloud, EC2, 410 Terry Avenue North, 98109-5210",
"country": "US",
"email": "abuse@amazonaws.com",
"name": "Amazon EC2 Abuse",
"network": "3.120.0.0/14",
"phone": "+1-206-555-0000"
}
}
}
We see an accurate geolocation set of attributes, information about the owner of the IP, the carrier, and the information of the ASN owning the IP.
But how a location is mapped to an IP?
Well, there’s no rocket science here: there are databases where IPs are matched by ISP providers or other entities to a location, so this information is available to everyone.
How to bypass geo-fencing?
To bypass geofencing for scraping, we’ll see several options now, from the most expensive to the free ones.
Changing your IP with proxies on another location
Since most of websites use information deducted from your IP address to guess your location, using a proxy to change it and place it in another desired region is the fastest way to do so. Depending on the frameworks and tools you’re using to create the scraper, the syntax for adding a proxy on it is different but usually, it’s pretty straightforward.
Let’s see how the API changes the result when using a Smartproxy residential proxy based in the USA:
{
"input": "45.18.12.144",
"data": {
"ip": "45.18.12.144",
"hostname": "45-18-12-144.lightspeed.dybhfl.sbcglobal.net",
"city": "Oviedo",
"region": "Florida",
"country": "US",
"loc": "28.6700,-81.2081",
"org": "AS7018 AT&T Services, Inc.",
"postal": "32765",
"timezone": "America/New_York",
"asn": {
"asn": "AS7018",
"name": "AT&T Services, Inc.",
"domain": "att.com",
"route": "45.16.0.0/12",
"type": "isp"
},
"company": {
"name": "AT&T Corp.",
"domain": "att.com",
"type": "isp"
},
"carrier": {
"name": "AT&T",
"mcc": "310",
"mnc": "410"
},
"privacy": {
"vpn": false,
"proxy": false,
"tor": false,
"relay": false,
"hosting": false,
"service": ""
},
"abuse": {
"address": "US, TX, Plano, 2701 W 15th ST, 75075",
"country": "US",
"email": "abuse@att.net",
"name": "abuse",
"network": "45.16.0.0/12",
"phone": "+1-919-319-8167"
}
}
}
This is reflected in the behaviour of a website that uses geofence.
One example of websites using it is italist.com, a fashion e-commerce website. While if you browse it from Italy you get a login page,
from the rest of the world you can buy products without any limitations. Here’s what we see when we use the Smartproxy US proxy.
So if we want to scrape product data from this website, we need to change our IP by using a proxy from another geography.
But if the website doesn’t block data center proxies, we could try another solution that could help us save some money.
Use Scrapoxy and create a project using data center IPs in a different region
We have seen how Scrapoxy works in a past article of The Lab. Basically, it’s an open-source super proxy aggregator: you can manage your different proxy providers by one unified dashboard but also you can connect your accounts belonging to different cloud providers. By doing so, Scrapoxy will spawn as many virtual machines as you want and use their addresses as a proxy: in this way, you’re obtaining datacenter proxies with unlimited bandwidth, which is particularly useful and cost-effective in case you need to scrape large websites.
All you need to do is follow the documentation on the Scrapoxy website and connect the cloud providers you’re actually using.
After that, you can use the Scrapoxy endpoint to route all your requests to the proxies you just created: in my previous example I used this configuration for my Scrapy spiders.
DOWNLOADER_MIDDLEWARES = {
'scrapoxy.ProxyDownloaderMiddleware': 100,
}
SPIDER_MIDDLEWARES = {
"scrapoxy.ScaleSpiderMiddleware": 100,
}
SCRAPOXY_WAIT_FOR_SCALE = 120
SCRAPOXY_MASTER = "http://localhost:8888"
SCRAPOXY_API = "http://localhost:8890/api"
SCRAPOXY_USERNAME = "USER"
SCRAPOXY_PASSWORD = "PWD"
But before setting up Scrapoxy, are you really sure you need a proxy to bypass the geo-fencing?
Setting your location on a browser with Playwright
In some cases, just like we’ve seen when we tried to scrape Lowes.com, there are some websites that use the geolocation data from our browser. This is particularly true when you need to select a pickup point or a place near you for any reason.
In the example of Lowe’s, we modified this information inside Playwright and without the need for a proxy we placed ourselves near the store where we wanted to scrape inventory data from.
You can get the full code of this example on the GitHub Repository reserved for paying readers, directory 28.LOEWES.
In case you’re a paying reader but don’t have access to the GitHub repository, write me at pier@thewebscraping.club since I need to add you manually.
Lowe’s was a peculiar case but after the paywall we’re seeing a real-world case where, by studying how the website works, we could circumvent its geo-fencing and scrape data from everywhere.
After that, we’ll see what happens when you use a proxy on a browser not correctly patched (spoiler: webRCT leak).
Keep reading with a 7-day free trial
Subscribe to The Web Scraping Club to keep reading this post and get 7 days of free access to the full post archives.