Football Fixture Scraper Concept
In today’s data-driven world, information is often scattered across the web. Sometimes, we need to gather data from websites for various purposes, such as research, analysis, or automation. Go, a powerful programming language, provides us with the tools to perform web scraping and data extraction efficiently.
In this blog post, we’ll explore a Go program that demonstrates web scraping and data extraction. This program is designed to extract fixture information for a sports league from a website. Specifically, we aim to find the date of the matches and the teams that are playing. Let’s dive into the code to understand how it accomplishes this task.
The Code
package main
import (
"fmt"
"github.com/PuerkitoBio/goquery"
"golang.org/x/net/html"
"log"
"net/http"
"os"
)
type fixtureResult struct {
teams []string
matchDate []string
}
func main() {
// Open and parse the HTML file
file, err := os.Open("./2023-ligue1-fixtures.html")
if err != nil {
log.Fatalf("Error opening HTML file: %v", err)
}
defer file.Close()
doc, err := html.Parse(file)
if err != nil {
log.Fatalf("Error parsing HTML: %v", err)
}
// Find and print contents of li.MatchCardsList_matchCard__DBsrE
findAndPrintMatchCardContents(doc)
}
func findAndPrintMatchCardContents(node *html.Node) {
if node.Type == html.ElementNode && node.Data == "li" {
for _, attr := range node.Attr {
if attr.Key == "class" && attr.Val == "MatchCardsList_matchCard__DBsrE" {
anchor := node.FirstChild
for _, attr := range anchor.Attr {
if attr.Key == "href" {
// Extract fixture information
getFixtureInfo(attr.Val)
}
}
}
}
}
// Recursively process child nodes
for child := node.FirstChild; child != nil; child = child.NextSibling {
findAndPrintMatchCardContents(child)
}
}
func getFixtureInfo(fixturePath string) {
// Build the full URL
link := fmt.Sprintf(`https://onefootball.com/%s`, fixturePath)
resp, err := http.Get(link)
if err != nil {
log.Fatal(err)
}
defer resp.Body.Close()
// Create a goquery document from the response body
doc, err := goquery.NewDocumentFromReader(resp.Body)
if err != nil {
log.Fatal(err)
}
var date []string
var teams []string
// Find and extract the match date
doc.Find(".MatchScore_numeric__ke8YT").Each(func(i int, s *goquery.Selection) {
date = append(date, s.Text())
})
// Find and extract the team names
doc.Find(".MatchScoreTeam_name__zzQrD").Each(func(i int, s *goquery.Selection) {
teams = append(teams, s.Text())
})
// Create a fixtureResult struct to store the extracted data
fixture := fixtureResult{
teams: teams,
matchDate: date,
}
// Print the extracted fixture information
fmt.Printf("%+v\n", fixture)
}
How It Works
Now, let’s break down how this Go program accomplishes its goal:
-
Opening and Parsing HTML: The program starts by opening an HTML file named
2023-xxxx-fixtures.html
. It uses theos
andgolang.org/x/net/html
packages to do this. The HTML content is loaded into a structured format for parsing. -
Searching for Match Cards: The
findAndPrintMatchCardContents
function is responsible for searching for specific HTML elements (li
) with the class nameMatchCardsList_matchCard__DBsrE
. When a matching element is found, the program extracts thehref
attribute from the anchor element within thatli
. Thishref
attribute contains a path to another web page with detailed fixture information. -
Fetching Detailed Fixture Information: The
getFixtureInfo
function constructs the full URL for the detailed fixture information, sends an HTTP GET request to that URL, and creates agoquery
document from the response body. Usinggoquery
simplifies HTML parsing and querying. -
Extracting Data: The program searches for HTML elements with class names
MatchScore_numeric__ke8YT
(which contains match dates) andMatchScoreTeam_name__zzQrD
(which contains team names). It extracts this data and stores it in slices. -
Printing Results: Finally, the program creates a
fixtureResult
struct to store the extracted data and prints it in a structured format.
Ways to improve
A very powerful concept that is commonly used in golang is concurrency, currently my code retrieves data serially which can be slow depending on the volume of data. One way I can expand on this script is to retrieve fixture data using a group of workers dispatched in parallel.
Conclusion
This Go program showcases how to perform web scraping and data extraction efficiently. By leveraging Go’s powerful packages like github.com/PuerkitoBio/goquery
, we can parse HTML, navigate complex structures, and extract valuable information from websites. Web scraping is a versatile technique with applications in various domains, from data analysis to automation. I might end up using this data to create an iOS app or something who
knows?!?!?!?! See ya next time!