Handy CSS Selectors for Web Scraping
Gone are the days where nerds like us could 'right-click, View Source' on any webpage and find a clean DOM structure with readable IDs and classes. With modern frameworks like React or Tailwind, DOMs have become a messy jungle of auto-generated IDs and meaningless classes.
Scraping a webpage requires picking CSS selectors for the elements you want to parse and extract. And it's not always straightforward. However, CSS selectors have become more flexible and powerful, and they can save us from tricky scraping nightmares!
Below are a few of the CSS selectors we find most useful when scraping a website:
The Basics: ID, Class, Descendants
You already know these, but they keep being the most useful:
#ID /* Select an element with an ID */
.class /* Select an element with a class */
A B /* Select an element B nested in an element A */
HTML attributes
Many DOM elements have HTML attributes that provide useful ways to target them.
Has an attribute
.product[data-price] /* has an HTML attribute called data-price */
Attribute has value
.product[available="true"] /* has an HTML attribute called 'available' with the value "true" */
Attribute contains, starts with, ends_with
a[href*="foo"] /* the href contains "foo" */
a[href^="foo"] /* the href begins with "foo" */
a[href$="foo"] /* the href ends with "foo" */
Interestingly, this also works with classes, which can be particularly useful with classes generated by React Components:
a[class^="ButtonComponent_"] /* the href begins with "foo" */
Attribute contains value in a list
This is an obscure one, but can sometimes save you from a headache. It applies to attributes containing a list of space-separated values, like:
<p countries="france spain italy">...</p>
You can use the ~ operator to select for one of the values in the list.
.product[countries~="france"] /* has an attribute 'countries' containing a list of words including France */
Nth element
To select the nth element of a list.
Nth child
.product img:nth-child(2) /* the second image within .product */
.product img:nth-last-child(2) /* the next to last image within .product */
Nth of type
Similar to nth-child, except that it only considers elements of the same type (ul, p, img) when counting.
.product img:nth-of-type(2) /* the next to last img within .product */
.product img:nth-last-of-type(2) /* the next to last img within .product */
Shortcuts for first and last
There are a few helpers for the first and last elements.
.product img:first-child /* first image */
.product img:first-of-type /* first img */
.product img:last-child /* last image */
.product img:last-of-type /* last img */
Pseudo selectors
Negative selection
.product.not(.disabled) /* doesn't have the class 'disabled' */
There are other pseudo selectors like ::first-line or :checked that are also rarely used when web parsing.
Combination of Selectors
When IDs and classes don't provide enough structure, it can be useful to target elements based on their position in the DOM tree.
Children
The > operator is similar to A B, but only selects elements that are direct children.
.product > p /* All paragraphs that are direct children to the element with class 'product' */
Siblings
The + operator searches for elements that are directly adjacent. It will select only the element B that is directly preceded by the element A, at the same depth in the DOM.
.product + p /* only the first paragraph after the element with class 'product' */
The ~ operator (also called the sibling operator) is similar to A + B, but doesn't select only the first element matching, but all elements matching.
.product ~ p /* all paragraphs directly adjacent to the element with class 'product' */
The most modern ones
Some of the latest CSS selectors are not yet fully supported by all modern browsers, but you can use already them in Mantabase!
Negation
The :not() pseudo-selector selects elements that don't match the condition within the parenthesis.
.product:not(.hidden) /* elements that don't have the class 'hidden' */
Search by text
The :text() pseudo-selector matches the smallest element containing specified text.
button:text("Submit") /* button with the text 'Submit' */
The :has-text() pseudo-selector matches any element containing specified text somewhere inside, possibly in a child or a descendant element.
div:has-text("Submit") /* div with the text 'Submit' in a child element */
Condition on children
The :has() pseudo-selector creates a condition on the children of an object.
article:has(h1.title) /* article that contains an 'h1' with the class 'title' */
That's all we got!
We hope you found this list useful, happy scraping!