Web Scraping from 0 to hero: Our first scraper with Selenium
Let's write our first scraper with Selenium
Welcome back for a new episode of “Web Scraping from 0 to Hero”, our biweekly free web scraping course provided by The Web Scraping Club.
In the last post we introduced the browser automation tool called Selenium and today we’re writing our first scraper.
How the course works
The course is and will be always free. As always, I’m here to share and not to make you buy something. If you want to say “thank you”, consider subscribing to this substack with a paid plan. It’s not mandatory but appreciated, and you’ll get access to the whole “The LAB” articles archive, with 40+ practical articles on more complex topics and its code repository.
We’ll see free-to-use packages and solutions and if there will be some commercial ones, it’s because they are solutions that I’ve already tested and solve issues I cannot do in other ways.
At first, I imagined this course being a monthly issue but as I was writing down the table of content, I realized it would take years to complete writing it. So probably it will have a bi-weekly frequency, filling the gaps in the publishing plan without taking too much space at the expense of more in-depth articles.
The collection of articles can be found using the tag WSF0TH and there will be a section on the main substack page.
A quick recap about Selenium WebDriver
As we mentioned in the previous article of this course, Selenium WebDriver is primarily a tool for automating web applications for testing purposes, but it's also incredibly effective for web scraping. It allows you to programmatically control a web browser, navigate pages, fill out forms, and extract data. This is particularly useful for scraping data from websites that use a lot of JavaScript to display their content.
Setting Up Your Environment
Before we see the code, you'll need to set up your environment. Here's what you need:
Python: Make sure you have Python installed on your computer. You can download it from python.org.
Selenium: Install Selenium by running pip install selenium in your terminal or command prompt.
WebDriver: Download the WebDriver for the browser you want to automate (e.g., ChromeDriver for Google Chrome, geckodriver for Firefox). Ensure it's placed in your PATH or specify its location in your code.
Your First Web Scraper
Now, let's write our first simple web scraper. We'll scrape quotes from http://quotes.toscrape.com, a website designed for practicing web scraping.
Import Selenium and Initialize WebDriver
from selenium import webdriver
# Specify the path to your WebDriver if it's not in your PATH
driver_path = '/path/to/your/webdriver'
driver = webdriver.Chrome(executable_path=driver_path)
# Open the website
driver.get('http://quotes.toscrape.com')
The code is quite self-explanatory, we’re importing the Selenium Webdriver package, setting the executable PATH, and opening a new browser window.
Selecting Elements with XPATH
XPATH is a language for selecting nodes in XML documents, which can also be used with HTML. It's powerful for web scraping because it allows for a precise selection of elements.
Let's say we want to scrape all the quotes on the page. We can inspect the page and find that each quote is contained within a <span> element with the class text. Here's how we can select all these elements using XPATH:
# Find elements using XPATH
quotes_elements = driver.find_elements_by_xpath('//span[@class="text"]')
# Extract and print the quotes text
for quote in quotes_elements:
print(quote.text)
Please notice that driver.find_elements_by_xpath returns a list of items we can iterate in the following step. The function driver.find_element_by_xpath instead returns only one item and we’re using it for navigating the website.
# Find the next page button and click it
next_page_button = driver.find_element_by_xpath('//li[@class="next"]/a')
next_page_button.click()
# Now you can repeat the process of extracting data from the new page
Before going on with the article, I’d like to invite you to a webinar with Tamas, CEO of Kameleo.
We’ll talk about anti-detect browsers and their role in the web scraping industry.
Join us at this link: https://register.gotowebinar.com/register/8448545530897058397
A comparison between XPATH and CSS selectors syntax
Of course, you can select elements on the page by also using CSS selectors, so let’s take some minutes here to see how the two syntaxes differ in Selenium.
Selecting Elements with Attributes
We've already seen how to select elements by class. Now, let's say you want to select an element with a specific id. Suppose there's a div with an id of "author-info". You can select this element like so:
author_info = driver.find_element_by_xpath('//div[@id="author-info"]')
print(author_info.text)
With a CSS selector, the syntax instead will be the following.
author_info = driver.find_element_by_css_selector('#author-info')
print(author_info.text)
Selecting Elements Containing Specific Text
Sometimes, you might want to select elements that contain specific text. XPATH has a contains() function for this purpose. For example, if you want to find a link that contains the text "Next", you can do it like this:
next_link = driver.find_element_by_xpath('//a[contains(text(), "Next")]')
next_link.click()
CSS selectors don't directly support selecting elements based on their text content. However, you can often achieve similar results by targeting the specific elements that contain the text, if you know the structure of the HTML. While not properly equivalent, you can use the substring selector to get all the items starting with a determined string:
driver.find_element_by_css_selector(“a[class^=’Nex’]”)
Using Logical Operators
XPATH allows you to use logical operators (and, or) to combine conditions. This can be handy when you need to select elements that meet multiple criteria. For instance, if you're looking for a button that has the class "submit" and the text "Click Me", you could write:
submit_button = driver.find_element_by_xpath('//button[@class="submit" and text()="Click Me"]')
submit_button.click()
CSS doesn't have logical operators in the same way XPATH does, but you can combine selectors. For a button with class "submit":
submit_button = driver.find_element_by_css_selector('button.submit') submit_button.click()
CSS doesn't allow you to directly select based on text content or combine conditions with logical operators like and/or. You'd select based on attributes and tags, then filter further using Python if needed.
Selecting Parent or Sibling Elements
Sometimes, the element you're interested in doesn't have a unique identifier, but its parent or sibling does. XPATH lets you navigate the DOM hierarchy to select elements based on their relationships. For example, to select a div that is a direct parent of a span with the class "text", you can use:
parent_div = driver.find_element_by_xpath('//span[@class="text"]/parent::div')
Or, if you want to select the next sibling of an element, you can use the following-sibling axis:
next_sibling = driver.find_element_by_xpath('//div[@id="info"]/following-sibling::div')
CSS selectors can target sibling elements but not in the backward direction (previous siblings) or select a parent directly. For the next sibling:
next_sibling = driver.find_element_by_css_selector('#info + div')
This selects the div that directly follows an element with the id "info".
Using Wildcards
Wildcards can be useful when you want to select elements without specifying the tag name. For instance, if you want to select all elements that have a certain attribute, regardless of their tag, you can use *
:
all_elements_with_id = driver.find_elements_by_xpath('//*[@id]')
for element in all_elements_with_id:
print(element.tag_name)
In CSS, the wildcard *
is used differently than in XPATH. It selects all elements. To select elements based on the presence of an attribute, you might do:
all_elements_with_id = driver.find_elements_by_css_selector('[id]')
for element in all_elements_with_id:
print(element.tag_name)
Combining Selectors
You can also combine different XPATH expressions to refine your selection. For example, if you want to select all span
elements with the class "text" that are inside a div with a specific id, you can do:
specific_spans = driver.find_elements_by_xpath('//div[@id="container"]/span[@class="text"]')
for span in specific_spans:
print(span.text)
You can combine CSS selectors to refine your selection. For all span elements with the class "text" inside a div with a specific id:
specific_spans = driver.find_elements_by_css_selector('#container span.text')
for span in specific_spans:
print(span.text)
Handling Exceptions
When your scraper interacts with the web, many things can go wrong. Elements might not load in time, elements might not be present at all, or the structure of the webpage might change. These scenarios can cause your script to stop abruptly, potentially losing progress. Handling exceptions allows your script to gracefully manage these issues, whether by retrying actions, skipping over problematic parts, or safely shutting down.
Common Exceptions in Selenium
NoSuchElementException: This occurs when Selenium can't find an element on the page that your script is trying to interact with. It's common when a page's structure has changed or if there was a typo in your element selector.
TimeoutException: This happens when an operation takes longer than the allotted time. For example, you might be waiting for a page to load or an element to become visible, and it just doesn't happen within your specified timeout period.
Handling Exceptions with Try-Except Blocks
Python's try-except blocks allow you to catch exceptions and execute alternative code when they occur, rather than stopping the script. Here's how you can apply this to web scraping:
from selenium.common.exceptions import NoSuchElementException, TimeoutException
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
try:
# Wait up to 10 seconds for elements to be available
quotes_elements = WebDriverWait(driver, 10).until(
EC.presence_of_all_elements_located((By.XPATH, '//span[@class="text"]'))
)
for quote in quotes_elements:
print(quote.text)
# Attempt to click the next page button
next_page_button = driver.find_element_by_xpath('//li[@class="next"]/a')
next_page_button.click()
except NoSuchElementException:
print("Element not found")
except TimeoutException:
print("Loading took too much time")
Best Practices for Exception Handling
Be Specific: Catch specific exceptions rather than using a broad except
:
clause. This approach prevents masking other issues and makes your code easier to debug.Log or Print the Error: When catching exceptions, it's helpful to log or print them out. This information can be invaluable for debugging and understanding where your scraper might be encountering issues.
Decide on a Recovery Strategy: When you catch an exception, decide what the script should do next. Should it retry the operation, skip the current task, or exit gracefully? Implementing a thoughtful recovery strategy can make your scraper more resilient.
Use Timeouts Wisely: When using WebDriverWait, consider how long you're willing to wait for elements. Setting reasonable timeouts can help avoid unnecessary delays in your script.
Final code
After we’ve seen how to initialize Selenium, write XPath selectors, and manage exceptions, here’s the final version of our scraper.
from selenium.common.exceptions import NoSuchElementException, TimeoutException
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium import webdriver
# Specify the path to your WebDriver if it's not in your PATH
driver = webdriver.Chrome(executable_path='YOUR_PATH_HERE')
# Open the website
driver.get('http://quotes.toscrape.com')
paging = 1
retry_count = 0 # To handle retries in case of TimeoutException
while paging == 1:
try:
# Wait for elements to be available (up to 10 seconds)
quotes_elements = WebDriverWait(driver, 10).until(
EC.presence_of_all_elements_located((By.XPATH, '//span[@class="text"]'))
)
for quote in quotes_elements:
print(quote.text)
# Attempt to click the next page button
next_page_button = driver.find_element_by_xpath('//li[@class="next"]/a')
next_page_button.click()
retry_count = 0 # Reset retry count after successful operation
except NoSuchElementException:
paging = 0
print("Oops! Couldn't find an element. Moving on...")
except TimeoutException:
print("Oops! The operation timed out. Let's try something else...")
retry_count += 1
if retry_count > 2: # Give up after 3 tries
print("Too many timeouts, giving up.")
break
driver.quit() # Close the browser