Web Scraping from 0 to hero: XPATH and CSS Selectors in Web Scraping
What are selectors and the difference between XPATH and CSS with 10 examples.
During this course “Web Scraping from 0 to Hero” we have seen how to create our first scrapers with Scrapy and Playwright. One of the key elements for writing a scraper, not correlated to the tool used, are the selectors used for the parsing of the HTML.
Selectors are basically pieces of the code that “select” the interesting part from the HTML code, according to the specs we have for the output.
The two most common ways to build selectors are using XPATH language or CSS selectors.
We’ve seen the differences between the two options in this previous article of The Web Scraping Club.
Today, in this episode of the course, we’re seeing the practical differences between the two approaches, with ten practical examples.
Practical Examples: Using XPATH and CSS Selectors in Scrapy Spiders
Below are examples of how both XPATH and CSS selectors can be used in Scrapy to extract data from the same HTML snippets. Each example includes a piece of HTML, followed by how you would write the CSS selector and the XPATH to target specific elements within that HTML.
Example 1: Selecting All Paragraphs
HTML:
This example demonstrates how to select all paragraph elements within a specified parent element (a div). It is useful for extracting all textual content that falls under a specific section of a webpage.
<div>
<p>First paragraph.</p>
<p>Second paragraph.</p>
</div>
CSS Selector:
response.css('div > p').getall()
XPATH:
response.xpath('//div/p').getall()
Example 2: Selecting Elements with a Specific Class
HTML:
This example focuses on selecting elements based on their class attribute. It is useful when you need to distinguish between elements for styling purposes or specific content targeting within complex pages.
<div>
<p class="highlight">Highlighted paragraph.</p>
<p>Regular paragraph.</p>
</div>
CSS Selector:
response.css('p.highlight').getall()
XPATH:
response.xpath('//p[@class="highlight"]').getall()
Example 3: Selecting the First Element of a Specific Type
HTML:
This example showcases how to select the first element of a specific type within its parent. It's particularly useful for scenarios where only the initial item in a list or section is needed for analysis or data extraction.
<ul>
<li>First item</li>
<li>Second item</li>
</ul>
CSS Selector:
response.css('li:first-of-type').get()
XPATH:
response.xpath('//li[1]').get()
Example 4: Selecting a Link by Its Href Attribute
HTML:
This example demonstrates how to select an element based on its href attribute, a common task in web scraping for extracting links. It is crucial for navigation or when links are dynamically generated and need to be followed during scraping.
<a href="https://example.com">Visit Example</a>
CSS Selector:
response.css('a[href="https://example.com"]').get()
XPATH:
response.xpath('//a[@href="https://example.com"]').get()
Example 5: Selecting Text from Nested Elements
HTML:
This example illustrates how to extract text from nested elements. It is used when text is split across multiple child elements but needs to be captured as a whole, such as in structured documents or when extracting specific data from formatted text.
<div>
<p>Text <span>inside</span> paragraph.</p>
</div>
CSS Selector:
response.css('div > p::text').getall()
XPATH:
response.xpath('//div/p/text()').getall()
Example 6: Selecting Elements Containing Specific Text
HTML:
This example demonstrates how to select elements based on their text content. XPATH is particularly useful for this scenario as it can directly query elements that contain specific text strings.
<div>
<p>Find me if you can!</p>
<p>You can't see me.</p>
</div>
CSS Selector:
CSS selectors do not directly support selecting elements by their text content.
XPATH:
response.xpath('//p[contains(text(), "Find me")]').get()
Example 7: Selecting the Last Child of an Element
HTML:
This example shows how to select the last child of a specific parent element. Both CSS Selectors and XPATH provide straightforward methods to achieve this, making it simple to target the last item in a list.
<ul>
<li>First item</li>
<li>Second item</li>
<li>Last item</li>
</ul>
CSS Selector:
response.css('li:last-child').get()
XPATH:
response.xpath('//li[last()]').get()
Example 8: Selecting Attributes of an Element
HTML:
This example illustrates how to extract the value of an attribute from an HTML element. Both methods are effective for retrieving attributes like 'src' from an 'img' tag.
<img src="image.jpg" alt="An example image">
CSS Selector:
response.css('img::attr(src)').get()
XPATH:
response.xpath('//img/@src').get()
Example 9: Selecting Siblings Following a Specific Element
HTML:
This example targets sibling elements that immediately follow a specified element, such as paragraphs following a header. It showcases how to use both CSS Selectors and XPATH to navigate sibling relationships within the DOM.
<div>
<h2>Title</h2>
<p>First paragraph after title.</p>
<p>Second paragraph after title.</p>
</div>
CSS Selector:
response.css('h2 + p').getall()
XPATH:
response.xpath('//h2/following-sibling::p').getall()
Example 10: Selecting Elements by Multiple Attributes
HTML:
This example focuses on selecting elements based on a combination of attributes. It is particularly useful in forms where inputs need to be identified not just by their type but also by name or other attributes.
<input type="text" name="username" placeholder="Enter Username">
<input type="password" name="password" placeholder="Enter Password">
CSS Selector:
response.css('input[type="password"][name="password"]').get()
XPATH:
response.xpath('//input[@type="password" and @name="password"]').get()
Final remarks
The ten examples provided demonstrate the practical implementation of both selector types in Scrapy spiders, offering insights into their syntax and showcasing how they can be used to fulfill different scraping needs.
CSS selectors, with their straightforward and style-focused syntax, are exceptionally suited for quick and efficient selection of elements based on style attributes, classes, and IDs. They are ideal for tasks where the target data is directly related to the visual presentation of the page.
On the other hand, XPATH provides a more robust solution for navigating XML and HTML documents. Its ability to perform complex queries, including traversing upwards and sideways in the document hierarchy, makes it invaluable for more complex scraping scenarios where elements need to be selected based on a deeper analysis of the document structure.