Web Scraping from 0 to hero: XPATH and CSS Selectors in Web Scraping

What are selectors and the difference between XPATH and CSS with 10 examples.

Apr 28, 2024

During this course “Web Scraping from 0 to Hero” we have seen how to create our first scrapers with Scrapy and Playwright. One of the key elements for writing a scraper, not correlated to the tool used, are the selectors used for the parsing of the HTML.

Selectors are basically pieces of the code that “select” the interesting part from the HTML code, according to the specs we have for the output.

The two most common ways to build selectors are using XPATH language or CSS selectors.

We’ve seen the differences between the two options in this previous article of The Web Scraping Club.

The Web Scraping Club

XPath vs CSS selectors: a comparison

This post is sponsored by Smartproxy, the premium proxy and web scraping infrastructure focused on the best price, ease of use, and performance. In this case, for all The Web Scraping Club Readers, using the discount code WEBSCRAPINGCLUB10 you can save 10% OFF…

2 years ago · 1 like · Pierluigi Vinciguerra

Today, in this episode of the course, we’re seeing the practical differences between the two approaches, with ten practical examples.

Practical Examples: Using XPATH and CSS Selectors in Scrapy Spiders

Below are examples of how both XPATH and CSS selectors can be used in Scrapy to extract data from the same HTML snippets. Each example includes a piece of HTML, followed by how you would write the CSS selector and the XPATH to target specific elements within that HTML.

CAPTCHA — Photo by Markus Spiske on Unsplash

Example 1: Selecting All Paragraphs

HTML:

This example demonstrates how to select all paragraph elements within a specified parent element (a div). It is useful for extracting all textual content that falls under a specific section of a webpage.

<div>
  <p>First paragraph.</p>
  <p>Second paragraph.</p>
</div>

CSS Selector:

response.css('div > p').getall()

XPATH:

response.xpath('//div/p').getall()

Example 2: Selecting Elements with a Specific Class

HTML:

This example focuses on selecting elements based on their class attribute. It is useful when you need to distinguish between elements for styling purposes or specific content targeting within complex pages.

<div>
  <p class="highlight">Highlighted paragraph.</p>
  <p>Regular paragraph.</p>
</div>

CSS Selector:

response.css('p.highlight').getall()

XPATH:

response.xpath('//p[@class="highlight"]').getall()

Example 3: Selecting the First Element of a Specific Type

HTML:

This example showcases how to select the first element of a specific type within its parent. It's particularly useful for scenarios where only the initial item in a list or section is needed for analysis or data extraction.

<ul>
  <li>First item</li>
  <li>Second item</li>
</ul>

CSS Selector:

response.css('li:first-of-type').get()

XPATH:

response.xpath('//li[1]').get()

Example 4: Selecting a Link by Its Href Attribute

HTML:

This example demonstrates how to select an element based on its href attribute, a common task in web scraping for extracting links. It is crucial for navigation or when links are dynamically generated and need to be followed during scraping.

<a href="https://example.com">Visit Example</a>

CSS Selector:

response.css('a[href="https://example.com"]').get()

XPATH:

response.xpath('//a[@href="https://example.com"]').get()

Example 5: Selecting Text from Nested Elements

HTML:

This example illustrates how to extract text from nested elements. It is used when text is split across multiple child elements but needs to be captured as a whole, such as in structured documents or when extracting specific data from formatted text.

<div>
  <p>Text <span>inside</span> paragraph.</p>
</div>

CSS Selector:

response.css('div > p::text').getall()

XPATH:

response.xpath('//div/p/text()').getall()

Example 6: Selecting Elements Containing Specific Text

HTML:

This example demonstrates how to select elements based on their text content. XPATH is particularly useful for this scenario as it can directly query elements that contain specific text strings.

<div>
  <p>Find me if you can!</p>
  <p>You can't see me.</p>
</div>

CSS Selector:

CSS selectors do not directly support selecting elements by their text content.

XPATH:

response.xpath('//p[contains(text(), "Find me")]').get()

Example 7: Selecting the Last Child of an Element

HTML:

This example shows how to select the last child of a specific parent element. Both CSS Selectors and XPATH provide straightforward methods to achieve this, making it simple to target the last item in a list.

<ul>
  <li>First item</li>
  <li>Second item</li>
  <li>Last item</li>
</ul>

CSS Selector:

response.css('li:last-child').get()

XPATH:

response.xpath('//li[last()]').get()

Example 8: Selecting Attributes of an Element

HTML:

This example illustrates how to extract the value of an attribute from an HTML element. Both methods are effective for retrieving attributes like 'src' from an 'img' tag.

<img src="image.jpg" alt="An example image">

CSS Selector:

response.css('img::attr(src)').get()

XPATH:

response.xpath('//img/@src').get()

Example 9: Selecting Siblings Following a Specific Element

HTML:

This example targets sibling elements that immediately follow a specified element, such as paragraphs following a header. It showcases how to use both CSS Selectors and XPATH to navigate sibling relationships within the DOM.

<div>
  <h2>Title</h2>
  <p>First paragraph after title.</p>
  <p>Second paragraph after title.</p>
</div>

CSS Selector:

response.css('h2 + p').getall()

XPATH:

response.xpath('//h2/following-sibling::p').getall()

Example 10: Selecting Elements by Multiple Attributes

HTML:

This example focuses on selecting elements based on a combination of attributes. It is particularly useful in forms where inputs need to be identified not just by their type but also by name or other attributes.

<input type="text" name="username" placeholder="Enter Username">
<input type="password" name="password" placeholder="Enter Password">

CSS Selector:

response.css('input[type="password"][name="password"]').get()

XPATH:

response.xpath('//input[@type="password" and @name="password"]').get()

Final remarks

The ten examples provided demonstrate the practical implementation of both selector types in Scrapy spiders, offering insights into their syntax and showcasing how they can be used to fulfill different scraping needs.

CSS selectors, with their straightforward and style-focused syntax, are exceptionally suited for quick and efficient selection of elements based on style attributes, classes, and IDs. They are ideal for tasks where the target data is directly related to the visual presentation of the page.

On the other hand, XPATH provides a more robust solution for navigating XML and HTML documents. Its ability to perform complex queries, including traversing upwards and sideways in the document hierarchy, makes it invaluable for more complex scraping scenarios where elements need to be selected based on a deeper analysis of the document structure.

The Web Scraping Club

Web Scraping from 0 to hero: XPATH and CSS Selectors in Web Scraping

What are selectors and the difference between XPATH and CSS with 10 examples.

Practical Examples: Using XPATH and CSS Selectors in Scrapy Spiders

Example 1: Selecting All Paragraphs

HTML:

CSS Selector:

XPATH:

Example 2: Selecting Elements with a Specific Class

HTML:

CSS Selector:

XPATH:

Example 3: Selecting the First Element of a Specific Type

HTML:

CSS Selector:

XPATH:

Example 4: Selecting a Link by Its Href Attribute

HTML:

CSS Selector:

XPATH:

Example 5: Selecting Text from Nested Elements

HTML:

CSS Selector:

XPATH:

Example 6: Selecting Elements Containing Specific Text

HTML:

CSS Selector:

XPATH:

Example 7: Selecting the Last Child of an Element

HTML:

CSS Selector:

XPATH:

Example 8: Selecting Attributes of an Element

HTML:

CSS Selector:

XPATH:

Example 9: Selecting Siblings Following a Specific Element

HTML:

CSS Selector:

XPATH:

Example 10: Selecting Elements by Multiple Attributes

HTML:

CSS Selector:

XPATH:

Final remarks

Discussion about this post