Unlock the Power of Markup Parsing with Cheerio

What is Cheerio?

Traditionally, Node.js doesn’t allow you to parse and manipulate markups because it executes code outside of the browser. But what if you could? Enter Cheerio, an open-source JavaScript library designed specifically for this purpose. Cheerio provides a flexible and lean implementation of jQuery, tailored for the server. With Cheerio, you can manipulate and render markup at incredible speeds, thanks to its concise and simple markup (similar to jQuery). Plus, it works seamlessly with XML documents too!

Getting Started with Cheerio

To begin, you’ll need:

  • Basic familiarity with HTML, CSS, and the DOM
  • Familiarity with npm and Node.js
  • Familiarity working with the command line and text editors

Setting Up Cheerio

Cheerio can be used on any ES6+, TypeScript, and Node.js project. For this tutorial, we’ll focus on Node.js. First, run npm init -y to generate a new package.json file. Then, install Cheerio using npm. Verify the installation by checking the package.json file for the new “dependencies” entry.

Understanding Cheerio

Loading

The first step in working with Cheerio is to load the HTML/XML file you want to parse or manipulate. Use the cheerio.load() method, which requires the HTML/XML document as an argument. Cheerio will automatically include <html>, <head>, and <body> tags if they’re not already present in your markup. You can disable this feature by setting the third argument to false.

Selectors

Use selectors to tell Cheerio what element you want to work on. Cheerio’s selector implementation is similar to jQuery, following CSS style with a few additions. Some commonly used selectors include:

  • $("*") – Selects every element on the provided markup
  • $("div") – Selects every instance of the <div> tag
  • $(".foo") – Selects every element with the foo class
  • $("#bar") – Selects every element with the unique bar id
  • $(":focus") – Selects the element that currently has focus
  • $("input[type='text']") – Selects any input element with an input type of text
  • $('.bar, '#foo) – Selects all child elements with class bar under an element with class foo

Events and DOM Manipulation

Cheerio comes with a range of DOM-related methods for accessing and manipulating HTML elements and their attributes. Some commonly used methods include:

  • .text() – Sets or returns the innerText content of the selected element
  • .html() – Sets or returns the innerHTML content of the selected element
  • .append() – Inserts provided content as the last child of each selected element
  • .prepend() – Inserts provided content as the first child of each selected element
  • .addClass() and .removeClass() – Adds or removes provided classes to/from all matched elements
  • .hasClass() – Returns a Boolean value indicating if the selected element has the provided class name
  • .toggleClass() – Toggles the provided class on the selected element

Rendering

Once you’ve finished parsing and manipulating your markup, access its root content using cheerio.root().html(). By default, when parsing HTML content in Cheerio, some tags may be open. To render a valid XML document, use Cheerio’s XML utility function.

Building FeatRocket

Now that we have a solid understanding of Cheerio, let’s build a sample project. We’ll create FeatRocket, a CLI application that scrapes all featured articles on the LogRocket blog and logs them to the console.

Understanding Website Structure

First, understand how the website content is arranged, including what attributes (class, id, href) are assigned to the elements you want to access.

Downloading Webpage Markup

Next, download the website content using Axios. Then, load the downloaded markup into a new Cheerio instance.

Filtering Out Results

Loop through each targeted div and log them to the console. The full code will look like this:

Conclusion

Cheerio is an excellent framework for manipulating and scraping markup contents on the server-side. This tutorial has provided an in-depth guide on how to get started using Cheerio in a real-life project. For further reference, check out the FeatRocket source code on GitHub.

Leave a Reply