The Lab #48: Scraping with AWS Lambda
Using Serverless and Selenium on Lambda for gathering data
Using Cloud services for scraping it’s not a new thing: you can use virtual machines or containers deployed on the cloud and that’s something quite common nowadays.
I’ve also heard in the past people using AWS Lambda functions to do some scraping and was intrigued by the idea, so I’ve spent this week trying to understand how they can be used and the pros and the limits of this approach.
What is an AWS Lambda function?
AWS Lambda is a serverless computing service provided by Amazon Web Services (AWS) that allows developers to run code in response to events without the need to manage servers or runtime environments. This capability is particularly valuable for creating applications that need to respond quickly to new information or requests, without the cost of keeping a server running continuously.
Lambda functions are designed to execute code in response to specific triggers, which can originate from over 200 AWS services or direct HTTP requests via Amazon API Gateway. The triggers can include changes in data within an AWS S3 bucket, updates to a DynamoDB table, or custom events from application code or other AWS services. When triggered, Lambda functions execute the code to process the event, scaling automatically, both in frequency and in number of times the function is invoked.
Developers using Lambda can write functions in several programming languages such as Python, Node.js, Java, and C#. The environment is fully managed, meaning AWS handles the underlying compute resources, including server and operating system maintenance, capacity provisioning, automatic scaling, code monitoring, and logging. All that developers need to manage is the code itself and the associated configurations for triggering events.
Lambda functions are stateless, with no affinity to the underlying infrastructure, so they can quickly start, stop, and scale. AWS charges for Lambda on a pay-per-use basis, measuring compute cost through the function's memory allocation and execution time, and this could be interesting for our web scraping purposes.
Deploying with Serverless
I had the idea for this post for some time but since I’m not a great expert with Lambda, I’ve always postponed it. Then I finally saw this repository, which has reduced the learning curve for deploying successfully a Lambda function with Selenium, so I decided to give it a try.
As far as I understand, we’re using Serverless, a service that helps us in deploying applications on AWS lambda using containers.
We’ll use a clone of the repository as a template and, after creating the image, we’ll upload it to Lambda.
In the Dockerfile we’re setting up the running environment, by installing all the missing packages and dependencies, while the code for the function we’re executing is contained in the main.py file.
All the code of the tests can be found in The Lab GitHub repository, available for paying users, under folder 48.SCRAPING-WITH-LAMBDA.
If you already subscribed but don’t have access to the repository, please write me at pier@thewebscraping.club since I need to add you manually.
Just to check if I understood correctly, I’ve made a small change to the main.py file and, instead of returning the example.com website’s HTML code, I’ve called an API to check my IP.
from selenium import webdriver
from tempfile import mkdtemp
from selenium.webdriver.common.by import By
def handler(event=None, context=None):
options = webdriver.ChromeOptions()
service = webdriver.ChromeService("/opt/chromedriver")
options.binary_location = '/opt/chrome/chrome'
options.add_argument("--headless=new")
options.add_argument('--no-sandbox')
options.add_argument("--disable-gpu")
options.add_argument("--window-size=1280x1696")
options.add_argument("--single-process")
options.add_argument("--disable-dev-shm-usage")
options.add_argument("--disable-dev-tools")
options.add_argument("--no-zygote")
options.add_argument(f"--user-data-dir={mkdtemp()}")
options.add_argument(f"--data-path={mkdtemp()}")
options.add_argument(f"--disk-cache-dir={mkdtemp()}")
options.add_argument("--remote-debugging-port=9222")
chrome = webdriver.Chrome(options=options, service=service)
chrome.get("https://api.ipify.org?format=json")
return chrome.find_element(by=By.XPATH, value="//html").text
As you can imagine, the IP returned belongs to an AWS datacenter.
In fact, if I analyze the IP returned with ASN, we can see it’s an IP from eu-central-1 (the AWS region I’ve set up), belonging to Amazon.
By executing the function another time, I’ll get another IP from the same region.
With this little experiment, we can already understand some pros and cons of this solution.
We’re using AWS data center IPs, which are easily detectable and blocked by target websites since Amazon publicly discloses the range of IPs of their subnets. So, unless we’re masking them using a proxy service, we will not be able to scrape some websites.
On the other hand, since the IP is changing for every request, we’ve basically built an IP rotation system, at a fraction of the cost of the already cheap datacenter proxies.
Now that we’ve understood the basic concepts, let’s increase the difficulty: I want to pass a URL as a parameter to the function invocation and get the result from there.
Keep reading with a 7-day free trial
Subscribe to The Web Scraping Club to keep reading this post and get 7 days of free access to the full post archives.