Handy CSS Selectors for Web Scraping

Gone are the days where nerds like us could 'right-click, View Source' on any webpage and find a clean DOM structure with readable IDs and classes. With modern frameworks like React or Tailwind, DOMs have become a messy jungle of auto-generated IDs and meaningless classes.

Scraping a webpage requires picking CSS selectors for the elements you want to parse and extract. And it's not always straightforward. However, CSS selectors have become more flexible and powerful, and they can save us from tricky scraping nightmares!

Below are a few of the CSS selectors we find most useful when scraping a website:

The Basics: ID, Class, Descendants#

You already know these, but they keep being the most useful:

css

#ID    /* Select an element with an ID */
.class /* Select an element with a class */
A B    /* Select an element B nested in an element A */

HTML attributes#

Many DOM elements have HTML attributes that provide useful ways to target them.

Has an attribute#

css

.product[data-price] /* has an HTML attribute called data-price */

Attribute has value#

css

.product[available="true"] /* has an HTML attribute called 'available' with the value "true" */

Attribute contains, starts with, ends_with#

css

a[href*="foo"] /* the href contains "foo" */
a[href^="foo"] /* the href begins with "foo" */
a[href$="foo"] /* the href ends with "foo" */

Interestingly, this also works with classes, which can be particularly useful with classes generated by React Components:

css

a[class^="ButtonComponent_"] /* the href begins with "foo" */

Attribute contains value in a list#

This is an obscure one, but can sometimes save you from a headache. It applies to attributes containing a list of space-separated values, like:

html

<p countries="france spain italy">...</p>

You can use the ~ operator to select for one of the values in the list.

css

.product[countries~="france"] /* has an attribute 'countries' containing a list of words including France */

Nth element#

To select the nth element of a list.

Nth child#

css

.product img:nth-child(2)      /* the second image within .product */
.product img:nth-last-child(2) /* the next to last image within .product */

Nth of type#

Similar to nth-child, except that it only considers elements of the same type (ul, p, img) when counting.

css

.product img:nth-of-type(2)       /* the next to last img within .product */
.product img:nth-last-of-type(2) /* the next to last img within .product */

Shortcuts for first and last#

There are a few helpers for the first and last elements.

css

.product img:first-child   /* first image */
.product img:first-of-type /* first img */
.product img:last-child    /* last image */
.product img:last-of-type  /* last img */

Pseudo selectors#

Negative selection#

css

.product.not(.disabled) /* doesn't have the class 'disabled' */

There are other pseudo selectors like ::first-line or :checked that are also rarely used when web parsing.

Combination of Selectors#

When IDs and classes don't provide enough structure, it can be useful to target elements based on their position in the DOM tree.

Children#

The > operator is similar to A B, but only selects elements that are direct children.

css

.product > p /* All paragraphs that are direct children to the element with class 'product' */

Siblings#

The + operator searches for elements that are directly adjacent. It will select only the element B that is directly preceded by the element A, at the same depth in the DOM.

css

.product + p /* only the first paragraph after the element with class 'product' */

The ~ operator (also called the sibling operator) is similar to A + B, but doesn't select only the first element matching, but all elements matching.

css

.product ~ p /* all paragraphs directly adjacent to the element with class 'product' */

The most modern ones#

Some of the latest CSS selectors are not yet fully supported by all modern browsers, but you can use already them in Mantabase!

Negation#

The :not() pseudo-selector selects elements that don't match the condition within the parenthesis.

css

.product:not(.hidden) /* elements that don't have the class 'hidden' */

Search by text#

The :text() pseudo-selector matches the smallest element containing specified text.

css

button:text("Submit") /* button with the text 'Submit' */

The :has-text() pseudo-selector matches any element containing specified text somewhere inside, possibly in a child or a descendant element.

css

div:has-text("Submit") /* div with the text 'Submit' in a child element */

Condition on children#

The :has() pseudo-selector creates a condition on the children of an object.

css

article:has(h1.title) /* article that contains an 'h1' with the class 'title' */

That's all we got! We hope you found this list useful, happy scraping!

Nicolas Baissas

#Guide

March 22, 2023

The Mantabase Blog