Unleash the Power of Web Scraping with Colly

Web scraping is an essential technique for extracting valuable data from websites that lack a dedicated API. By leveraging Colly, a powerful Go package, you can build efficient web scrapers and crawlers to collect data from the internet.

Getting Started with Colly

To begin, you’ll need a system with Go installed (version 1.14 or higher). Create a new directory for your project and initialize a Go module using the following commands:

go
mkdir celeb-scraping
cd celeb-scraping
go mod init celeb-scraping

Next, install Colly as an external package:

go
go get -u github.com/gocolly/colly/v2

Understanding Colly’s Collector Component

At the heart of Colly lies the Collector component, responsible for making network calls and configurable to suit your needs. You can initialize a new Collector with custom options or use the default settings:

go
c := colly.NewCollector(
colly.AllowedDomains("imdb.com"),
colly.UserAgent("Mozilla/5.0"),
)

Collectors can also have callbacks attached, such as OnRequest and OnHTML, which are executed at different stages of the collection lifecycle.

Scraping Celebrity Data from IMDB

Let’s create a scraper that extracts celebrity data from IMDB. We’ll define two functions: main and crawl. The main function will call crawl to visit and extract the required information from the web page.

“`go
func main() {
month := flag.Int(“month”, 12, “Month of birth”)
day := flag.Int(“day”, 25, “Day of birth”)
flag.Parse()
crawl(*month, *day)
}

func crawl(month int, day int) {
c := colly.NewCollector(
colly.AllowedDomains(“imdb.com”),
)
//…
}
“`

Traversing HTML Pages with Colly

To extract the complete list of celebrities born on a specific date, we’ll recursively visit the next pages by attaching an OnHTML callback to the collector object.

go
c.OnHTML("a.lister-page-next.next-page", func(e *colly.HTMLElement) {
nextPageURL := e.Request.AbsoluteURL(e.Attr("href"))
c.Visit(nextPageURL)
})

Marshaling HTML to Go Structs

Let’s define the movie and star structs to hold each celebrity’s data. We’ll extract the bio-data from the profile container and loop through the top movies featured on the page.

go
type movie struct {
Title string
json:”title”`
//…
}

type star struct {
Name string json:"name"
Bio string json:"bio"
Movies []movie json:"movies"
//…
}
“`

Receiving CLI Arguments using Flags

To make our scraper more dynamic, we’ll add support for CLI flags to pass in any day and month as command-line arguments.

go
func main() {
month := flag.Int("month", 12, "Month of birth")
day := flag.Int("day", 25, "Day of birth")
flag.Parse()
crawl(*month, *day)
}

Build and Run the Scraper

Finally, build and run the scraper using the following commands:

go
go build main.go
./main --month=10 --day=10

You should receive a response similar to the screenshot below.

Take Your Web Scraping Skills to the Next Level

With Colly, you can unlock the full potential of web scraping and crawling. Explore more advanced techniques, such as simulating random delays between requests and configuring collectors to store visited URLs and cookies on Redis.

Resources

Join the LogRocket Community

Interested in exploring more about web scraping and error tracking? Join LogRocket’s developer community to stay updated on the latest trends and best practices.

Leave a Reply