Unleash the Power of Web Scraping with Colly
Web scraping is an essential technique for extracting valuable data from websites that lack a dedicated API. By leveraging Colly, a powerful Go package, you can build efficient web scrapers and crawlers to collect data from the internet.
Getting Started with Colly
To begin, you’ll need a system with Go installed (version 1.14 or higher). Create a new directory for your project and initialize a Go module using the following commands:
go
mkdir celeb-scraping
cd celeb-scraping
go mod init celeb-scraping
Next, install Colly as an external package:
go
go get -u github.com/gocolly/colly/v2
Understanding Colly’s Collector Component
At the heart of Colly lies the Collector component, responsible for making network calls and configurable to suit your needs. You can initialize a new Collector with custom options or use the default settings:
go
c := colly.NewCollector(
colly.AllowedDomains("imdb.com"),
colly.UserAgent("Mozilla/5.0"),
)
Collectors can also have callbacks attached, such as OnRequest and OnHTML, which are executed at different stages of the collection lifecycle.
Scraping Celebrity Data from IMDB
Let’s create a scraper that extracts celebrity data from IMDB. We’ll define two functions: main
and crawl
. The main
function will call crawl
to visit and extract the required information from the web page.
“`go
func main() {
month := flag.Int(“month”, 12, “Month of birth”)
day := flag.Int(“day”, 25, “Day of birth”)
flag.Parse()
crawl(*month, *day)
}
func crawl(month int, day int) {
c := colly.NewCollector(
colly.AllowedDomains(“imdb.com”),
)
//…
}
“`
Traversing HTML Pages with Colly
To extract the complete list of celebrities born on a specific date, we’ll recursively visit the next pages by attaching an OnHTML callback to the collector object.
go
c.OnHTML("a.lister-page-next.next-page", func(e *colly.HTMLElement) {
nextPageURL := e.Request.AbsoluteURL(e.Attr("href"))
c.Visit(nextPageURL)
})
Marshaling HTML to Go Structs
Let’s define the movie
and star
structs to hold each celebrity’s data. We’ll extract the bio-data from the profile container and loop through the top movies featured on the page.
“go
json:”title”`
type movie struct {
Title string
//…
}
type star struct {
Name string json:"name"
Bio string json:"bio"
Movies []movie json:"movies"
//…
}
“`
Receiving CLI Arguments using Flags
To make our scraper more dynamic, we’ll add support for CLI flags to pass in any day and month as command-line arguments.
go
func main() {
month := flag.Int("month", 12, "Month of birth")
day := flag.Int("day", 25, "Day of birth")
flag.Parse()
crawl(*month, *day)
}
Build and Run the Scraper
Finally, build and run the scraper using the following commands:
go
go build main.go
./main --month=10 --day=10
You should receive a response similar to the screenshot below.
Take Your Web Scraping Skills to the Next Level
With Colly, you can unlock the full potential of web scraping and crawling. Explore more advanced techniques, such as simulating random delays between requests and configuring collectors to store visited URLs and cookies on Redis.
Resources
- Colly’s godoc page: https://godoc.org/github.com/gocolly/colly/v2
- GoQuery: https://godoc.org/github.com/PuerkitoBio/goquery
Join the LogRocket Community
Interested in exploring more about web scraping and error tracking? Join LogRocket’s developer community to stay updated on the latest trends and best practices.