Asynchronous Programming C++ Tutorials Categories: Web Development

Build a Web Crawler in Go with Colly: A Step-by-Step Guide

By Alex Rivers October 21, 2024 #@vis.gl/react-google-maps, #Actix Web Client, #Bit Flags, #Colly, #DenoRedis, #Go Structs, #GoQuery, #HTML, #IMDB, #LogRocket, #Web Crawling, #Web Scraping

Unleash the Power of Web Scraping with Colly

Web scraping is an essential technique for extracting valuable data from websites that lack a dedicated API. By leveraging Colly, a powerful Go package, you can build efficient web scrapers and crawlers to collect data from the internet.

Getting Started with Colly

To begin, you’ll need a system with Go installed (version 1.14 or higher). Create a new directory for your project and initialize a Go module using the following commands:

go mkdir celeb-scraping cd celeb-scraping go mod init celeb-scraping

Next, install Colly as an external package:

go go get -u github.com/gocolly/colly/v2

Understanding Colly’s Collector Component

At the heart of Colly lies the Collector component, responsible for making network calls and configurable to suit your needs. You can initialize a new Collector with custom options or use the default settings:

go c := colly.NewCollector( colly.AllowedDomains("imdb.com"), colly.UserAgent("Mozilla/5.0"), )

Collectors can also have callbacks attached, such as OnRequest and OnHTML, which are executed at different stages of the collection lifecycle.

Scraping Celebrity Data from IMDB

Let’s create a scraper that extracts celebrity data from IMDB. We’ll define two functions: main and crawl. The main function will call crawl to visit and extract the required information from the web page.

“`go
func main() {
month := flag.Int(“month”, 12, “Month of birth”)
day := flag.Int(“day”, 25, “Day of birth”)
flag.Parse()
crawl(*month, *day)
}

func crawl(month int, day int) {
c := colly.NewCollector(
colly.AllowedDomains(“imdb.com”),
)
//…
}
“`

Traversing HTML Pages with Colly

To extract the complete list of celebrities born on a specific date, we’ll recursively visit the next pages by attaching an OnHTML callback to the collector object.

go c.OnHTML("a.lister-page-next.next-page", func(e *colly.HTMLElement) { nextPageURL := e.Request.AbsoluteURL(e.Attr("href")) c.Visit(nextPageURL) })

Marshaling HTML to Go Structs

Let’s define the movie and star structs to hold each celebrity’s data. We’ll extract the bio-data from the profile container and loop through the top movies featured on the page.

“go type movie struct { Title stringjson:”title”`
//…
}

type star struct {
Name string json:"name"
Bio string json:"bio"
Movies []movie json:"movies"
//…
}
“`

Receiving CLI Arguments using Flags

To make our scraper more dynamic, we’ll add support for CLI flags to pass in any day and month as command-line arguments.

go func main() { month := flag.Int("month", 12, "Month of birth") day := flag.Int("day", 25, "Day of birth") flag.Parse() crawl(*month, *day) }

Build and Run the Scraper

Finally, build and run the scraper using the following commands:

go go build main.go ./main --month=10 --day=10

You should receive a response similar to the screenshot below.

Take Your Web Scraping Skills to the Next Level

With Colly, you can unlock the full potential of web scraping and crawling. Explore more advanced techniques, such as simulating random delays between requests and configuring collectors to store visited URLs and cookies on Redis.

Resources

Colly’s godoc page: https://godoc.org/github.com/gocolly/colly/v2
GoQuery: https://godoc.org/github.com/PuerkitoBio/goquery

Join the LogRocket Community

Interested in exploring more about web scraping and error tracking? Join LogRocket’s developer community to stay updated on the latest trends and best practices.

Breaking

Build a Web Crawler in Go with Colly: A Step-by-Step Guide

Like this:

Related

By Alex Rivers

Leave a ReplyCancel reply

You Missed

The No-Funded Founder’s Field Guide: How to Market Your App When You Only Have Time and Tenacity

Unlock Project Success: Mastering the PMBOK Framework

Simplify React Native App Updates with Expo’s Game-Changing Hook

Product Management Mastery: Insights from a Seasoned Pro

Build a Web Crawler in Go with Colly: A Step-by-Step Guide

Share this:

Like this:

Related

Related posts:

By Alex Rivers

Related Post

Node.js Error Mastery: Fixing Common Pitfalls

Master Component-Driven Development with React’s Ultimate Documentation Tool

Code Alive: Unlock Interactive Learning on Your Website

Leave a ReplyCancel reply

You Missed

The No-Funded Founder’s Field Guide: How to Market Your App When You Only Have Time and Tenacity

Unlock Project Success: Mastering the PMBOK Framework

Simplify React Native App Updates with Expo’s Game-Changing Hook

Product Management Mastery: Insights from a Seasoned Pro