Automating Web Interactions with Puppeteer
Puppeteer, a high-level abstraction of headless Chrome, offers an extensive API for automating interactions with web pages. In this article, we’ll explore a basic example of using Puppeteer to search for a keyword on GitHub and fetch the title of the first result.
Setting Up Puppeteer and Node.js
To get started, let’s initialize a Node.js project and install the required packages. Create a new folder and navigate to it in your terminal. Run the command npm init
to generate a package.json
file. Then, install Puppeteer using npm install puppeteer
.
Creating a Service File
Create a new file named service.mjs
and add the following code to launch a Chrome instance and navigate to a URL:
“`javascript
import puppeteer from ‘puppeteer’;
async function scrapePage(url) {
const browser = await puppeteer.launch({ headless: false });
const page = await browser.newPage();
await page.goto(url);
//…
}
“`
Inspecting the Page
To interact with the page, we need to manually inspect the page and specify the DOM elements to target. Open GitHub in a browser and inspect the search input field at the top of the page. We can use the .header-search-input
class name to target the element.
Targeting Elements with Puppeteer
Using Puppeteer, we can focus on the input field element and simulate typing. We’ll use the waitForSelector
method to ensure the element is rendered on the page and ready for interaction.
javascript
async function scrapePage(url) {
//...
await page.waitForSelector('.header-search-input', { visible: true });
await page.focus('.header-search-input');
await page.type('react');
await page.press('Enter');
//...
}
Scraping Data
After navigating to the search results page, we can scrape the title of the first result using the page.evaluate
method.
javascript
async function scrapePage(url) {
//...
const repoList = await page.waitForSelector('.repo-list');
const title = await page.evaluate((repoList) => {
const repo = repoList.querySelector('li');
return repo.querySelector('.f4.text-normal').innerText;
}, repoList);
return title;
}
Creating an Express Server
To serve the scraped data, we’ll create an Express server with a single endpoint. The endpoint will capture the keyword as a route parameter and call the scrapePage
function to fetch the data.
“`javascript
import express from ‘express’;
import { scrapePage } from ‘./service.mjs’;
const app = express();
app.get(‘/:keyword’, async (req, res) => {
const keyword = req.params.keyword;
try {
const title = await scrapePage(https://github.com/search?q=${keyword}
);
res.send(title);
} catch (error) {
res.status(500).send(error.message);
}
});
app.listen(3000, () => {
console.log(‘Server listening on port 3000’);
});
“`
Deploying to Google Cloud Functions
To deploy our service to a serverless cloud function, we’ll create a new file named index.js
and modify the code to export the Express app object.
“`javascript
import express from ‘express’;
import { scrapePage } from ‘./service.js’;
const app = express();
app.get(‘/:keyword’, async (req, res) => {
//…
});
export default app;
json
We'll also update the `package.json` file to include the required dependencies and set the `type` to `module`.
{
“name”: “puppeteer-example”,
“version”: “1.0.0”,
“type”: “module”,
“dependencies”: {
“express”: “^4.17.1”,
“puppeteer”: “^13.0.1”
}
}
“
index.js` file. We can then test our cloud function by invoking the trigger URL, which returns the title of the first repository in the list.
Finally, we'll deploy our cloud function to Google Cloud Functions and set the entry point to the