Automating Web Interactions with Puppeteer

Puppeteer, a high-level abstraction of headless Chrome, offers an extensive API for automating interactions with web pages. In this article, we’ll explore a basic example of using Puppeteer to search for a keyword on GitHub and fetch the title of the first result.

Setting Up Puppeteer and Node.js

To get started, let’s initialize a Node.js project and install the required packages. Create a new folder and navigate to it in your terminal. Run the command npm init to generate a package.json file. Then, install Puppeteer using npm install puppeteer.

Creating a Service File

Create a new file named service.mjs and add the following code to launch a Chrome instance and navigate to a URL:
“`javascript
import puppeteer from ‘puppeteer’;

async function scrapePage(url) {
const browser = await puppeteer.launch({ headless: false });
const page = await browser.newPage();
await page.goto(url);
//…
}
“`
Inspecting the Page

To interact with the page, we need to manually inspect the page and specify the DOM elements to target. Open GitHub in a browser and inspect the search input field at the top of the page. We can use the .header-search-input class name to target the element.

Targeting Elements with Puppeteer

Using Puppeteer, we can focus on the input field element and simulate typing. We’ll use the waitForSelector method to ensure the element is rendered on the page and ready for interaction.
javascript
async function scrapePage(url) {
//...
await page.waitForSelector('.header-search-input', { visible: true });
await page.focus('.header-search-input');
await page.type('react');
await page.press('Enter');
//...
}

Scraping Data

After navigating to the search results page, we can scrape the title of the first result using the page.evaluate method.
javascript
async function scrapePage(url) {
//...
const repoList = await page.waitForSelector('.repo-list');
const title = await page.evaluate((repoList) => {
const repo = repoList.querySelector('li');
return repo.querySelector('.f4.text-normal').innerText;
}, repoList);
return title;
}

Creating an Express Server

To serve the scraped data, we’ll create an Express server with a single endpoint. The endpoint will capture the keyword as a route parameter and call the scrapePage function to fetch the data.
“`javascript
import express from ‘express’;
import { scrapePage } from ‘./service.mjs’;

const app = express();

app.get(‘/:keyword’, async (req, res) => {
const keyword = req.params.keyword;
try {
const title = await scrapePage(https://github.com/search?q=${keyword});
res.send(title);
} catch (error) {
res.status(500).send(error.message);
}
});

app.listen(3000, () => {
console.log(‘Server listening on port 3000’);
});
“`
Deploying to Google Cloud Functions

To deploy our service to a serverless cloud function, we’ll create a new file named index.js and modify the code to export the Express app object.
“`javascript
import express from ‘express’;
import { scrapePage } from ‘./service.js’;

const app = express();

app.get(‘/:keyword’, async (req, res) => {
//…
});

export default app;

We'll also update the `package.json` file to include the required dependencies and set the `type` to `module`.
json
{
“name”: “puppeteer-example”,
“version”: “1.0.0”,
“type”: “module”,
“dependencies”: {
“express”: “^4.17.1”,
“puppeteer”: “^13.0.1”
}
}

Finally, we'll deploy our cloud function to Google Cloud Functions and set the entry point to the
index.js` file. We can then test our cloud function by invoking the trigger URL, which returns the title of the first repository in the list.

Leave a Reply