Unlocking the Power of Web Scraping with Go
When building applications, you often need to extract data from websites or other sources to integrate with your app. While some websites provide APIs for easy data access, others don’t. That’s where web scraping comes in – a technique for extracting data from websites and presenting it in a readable format. In this tutorial, we’ll explore Colly, a Go package that enables you to build web scrapers, and create a basic web scraper that extracts product information from an ecommerce store and saves it to a JSON file.
Introducing Colly
Colly is a Go framework designed for building web scrapers, crawlers, or spiders. With Colly, you can easily extract structured data from websites, which can be used for various applications like data mining, data processing, or archiving. Some of Colly’s key features include:
- Speed: Colly can handle over 1,000 requests per second on a single core
- Support for synchronous, asynchronous, and parallel scraping
- Caching and robots.txt support
Getting Started with Colly
To follow along with this tutorial, you’ll need to have Go installed on your local machine and a basic understanding of Go programming. If you’re new to Go, start by installing it and verifying that you can run Go commands in your terminal.
Building a Web Scraper with Colly
Let’s dive into the code! Create a file called main.go
and add the following code:
“`go
package main
import (
“fmt”
“time”
"github.com/gocolly/colly"
)
func main() {
c := colly.NewCollector()
//…
}
“`
This code imports the necessary packages, including Colly, and creates a new instance of the Colly collector object.
Configuring Colly
Next, we’ll configure Colly to make requests and handle responses. Modify your main.go
file to include the following code:
“`go
package main
import (
“fmt”
“time”
"github.com/gocolly/colly"
)
func main() {
c := colly.NewCollector()
c.SetTimeout(120 * time.Second)
c.OnRequest(func(r *colly.Request) {
fmt.Println(“Visiting”, r.URL.String())
})
c.OnResponse(func(r *colly.Response) {
fmt.Println(“Got a response from”, r.Request.URL.String())
})
c.OnError(func(r *colly.Request, err error) {
fmt.Println(“Error:”, err)
})
//…
}
“`
This code sets the request timeout to 120 seconds and defines three callback functions: OnRequest
, OnResponse
, and OnError
. These functions will be triggered when Colly makes a request, receives a response, or encounters an error, respectively.
Analyzing the Website
Before we proceed, let’s take a closer look at the website we’re scraping. Open the website in your browser and inspect the DOM structure using the developer tools. We’ll focus on the product cards, which contain the product name, price, and discount information.
Extracting Product Information
Now, let’s define the structure of a single product and modify our main.go
file to extract the product information:
“`go
type Product struct {
Name string
Image string
Price string
URL string
Discount string
}
func main() {
//…
c.OnHTML(“a.core”, func(e *colly.HTMLElement) {
product := Product{}
product.Name = e.ChildText(“div.name”)
product.Image = e.ChildAttr(“img”, “data-src”)
product.Price = e.ChildAttr(“div.prc”, “data-price”)
product.URL = e.Request.AbsoluteURL(e.Attr(“href”))
product.Discount = e.ChildText(“div.tag_dsct”)
products = append(products, product)
})
//…
}
“`
This code defines a Product
struct and uses Colly’s OnHTML
callback to extract the product information from the website. We’ll then append each product to a list and convert it to a JSON object when the scraping job is complete.
Running the Web Scraper
Finally, let’s run our program! Execute the command go run main.go
in your terminal, and you should see a new file called products.json
created with the scrape results.
Conclusion
In this tutorial, we’ve successfully built a web scraper with Go using Colly. We’ve extracted product information from an ecommerce store and saved it to a JSON file. With Colly, you can easily build web scrapers for various applications, and we hope this tutorial has provided a solid foundation for your future projects.