Unlock the Power of Markup Parsing with Cheerio
What is Cheerio?
Traditionally, Node.js doesn’t allow you to parse and manipulate markups because it executes code outside of the browser. But what if you could? Enter Cheerio, an open-source JavaScript library designed specifically for this purpose. Cheerio provides a flexible and lean implementation of jQuery, tailored for the server. With Cheerio, you can manipulate and render markup at incredible speeds, thanks to its concise and simple markup (similar to jQuery). Plus, it works seamlessly with XML documents too!
Getting Started with Cheerio
To begin, you’ll need:
- Basic familiarity with HTML, CSS, and the DOM
- Familiarity with npm and Node.js
- Familiarity working with the command line and text editors
Setting Up Cheerio
Cheerio can be used on any ES6+, TypeScript, and Node.js project. For this tutorial, we’ll focus on Node.js. First, run npm init -y
to generate a new package.json file. Then, install Cheerio using npm. Verify the installation by checking the package.json file for the new “dependencies” entry.
Understanding Cheerio
Loading
The first step in working with Cheerio is to load the HTML/XML file you want to parse or manipulate. Use the cheerio.load()
method, which requires the HTML/XML document as an argument. Cheerio will automatically include <html>
, <head>
, and <body>
tags if they’re not already present in your markup. You can disable this feature by setting the third argument to false
.
Selectors
Use selectors to tell Cheerio what element you want to work on. Cheerio’s selector implementation is similar to jQuery, following CSS style with a few additions. Some commonly used selectors include:
$("*")
– Selects every element on the provided markup$("div")
– Selects every instance of the<div>
tag$(".foo")
– Selects every element with thefoo
class$("#bar")
– Selects every element with the uniquebar
id$(":focus")
– Selects the element that currently has focus$("input[type='text']")
– Selects any input element with an input type of text$('.bar, '#foo)
– Selects all child elements with classbar
under an element with classfoo
Events and DOM Manipulation
Cheerio comes with a range of DOM-related methods for accessing and manipulating HTML elements and their attributes. Some commonly used methods include:
.text()
– Sets or returns the innerText content of the selected element.html()
– Sets or returns the innerHTML content of the selected element.append()
– Inserts provided content as the last child of each selected element.prepend()
– Inserts provided content as the first child of each selected element.addClass()
and.removeClass()
– Adds or removes provided classes to/from all matched elements.hasClass()
– Returns a Boolean value indicating if the selected element has the provided class name.toggleClass()
– Toggles the provided class on the selected element
Rendering
Once you’ve finished parsing and manipulating your markup, access its root content using cheerio.root().html()
. By default, when parsing HTML content in Cheerio, some tags may be open. To render a valid XML document, use Cheerio’s XML utility function.
Building FeatRocket
Now that we have a solid understanding of Cheerio, let’s build a sample project. We’ll create FeatRocket, a CLI application that scrapes all featured articles on the LogRocket blog and logs them to the console.
Understanding Website Structure
First, understand how the website content is arranged, including what attributes (class, id, href) are assigned to the elements you want to access.
Downloading Webpage Markup
Next, download the website content using Axios. Then, load the downloaded markup into a new Cheerio instance.
Filtering Out Results
Loop through each targeted div and log them to the console. The full code will look like this:
Conclusion
Cheerio is an excellent framework for manipulating and scraping markup contents on the server-side. This tutorial has provided an in-depth guide on how to get started using Cheerio in a real-life project. For further reference, check out the FeatRocket source code on GitHub.