Lädt...

🔧 Go Web Scraper: Build and Optimize HTML Parsers


Nachrichtenbereich: 🔧 Programmierung
🔗 Quelle: dev.to

Image description

Leapcell: The Next-Gen Serverless Platform for Web Hosting

Installation and Usage of Goquery

Installation

Execute:

go get github.com/PuerkitoBio/goquery

Import

import "github.com/PuerkitoBio/goquery"

Load the Page

Take the IMDb Popular Movies page as an example:

package main

import (
    "fmt"
    "log"
    "net/http"

    "github.com/PuerkitoBio/goquery"
)

func main() {
    res, err := http.Get("https://www.imdb.com/chart/moviemeter/")
    if err != nil {
        log.Fatal(err)
    }
    defer res.Body.Close()
    if res.StatusCode != 200 {
        log.Fatalf("status code error: %d %s", res.StatusCode, res.Status)
    }

Get the Document Object

    doc, err := goquery.NewDocumentFromReader(res.Body)
    if err != nil {
        log.Fatal(err)
    }
    // Other creation methods
    // doc, err := goquery.NewDocumentFromReader(reader io.Reader)
    // doc, err := goquery.NewDocument(url string)
    // doc, err := goquery.NewDocument(strings.NewReader("<p>Example content</p>"))

Select Elements

Element Selector

Select based on basic HTML elements. For example, dom.Find("p") matches all p tags. It supports chained calls:

ele.Find("h2").Find("a")

Attribute Selector

Filter elements by element attributes and values, with multiple matching methods:

Find("div[my]")        // Filter div elements with the my attribute
Find("div[my=zh]")     // Filter div elements whose my attribute is zh
Find("div[my!=zh]")    // Filter div elements whose my attribute is not equal to zh
Find("div[my|=zh]")    // Filter div elements whose my attribute is zh or starts with zh-
Find("div[my*=zh]")    // Filter div elements whose my attribute contains the string zh
Find("div[my~=zh]")    // Filter div elements whose my attribute contains the word zh
Find("div[my$=zh]")    // Filter div elements whose my attribute ends with zh
Find("div[my^=zh]")    // Filter div elements whose my attribute starts with zh

parent > child Selector

Filter the child elements under a certain element. For example, dom.Find("div>p") filters the p tags under the div tag.

element + next Adjacent Selector

Use it when the elements are irregularly selected, but the previous element has a pattern. For example, dom.Find("p[my=a]+p") filters the adjacent p tags whose my attribute value of the p tag is a.

element~next Sibling Selector

Filter the non-adjacent tags under the same parent element. For example, dom.Find("p[my=a]~p") filters the sibling p tags whose my attribute value of the p tag is a.

ID Selector

It starts with # and precisely matches the element. For example, dom.Find("#title") matches the content with id=title, and you can specify the tag dom.Find("p#title").

ele.Find("#title")

Class Selector

It starts with . and filters the elements with the specified class name. For example, dom.Find(".content1"), and you can specify the tag dom.Find("div.content1").

ele.Find(".title")

Selector OR (|) Operation

Combine multiple selectors, separated by commas. Filtering is done if any one of them is satisfied. For example, Find("div,span").

func main() {
    html := `<body>
                <div lang="zh">DIV1</div>
                <span>
                    <div>DIV5</div>
                </span>
            </body>`
    dom, err := goquery.NewDocumentFromReader(strings.NewReader(html))
    if err != nil {
        log.Fatalln(err)
    }
    dom.Find("div,span").Each(func(i int, selection *goquery.Selection) {
        fmt.Println(selection.Html())
    })
}

Filters

:contains Filter

Filter elements that contain the specified text. For example, dom.Find("p:contains(a)") filters the p tags that contain a.

dom.Find("div:contains(DIV2)").Each(func(i int, selection *goquery.Selection) {
    fmt.Println(selection.Text())
})

:has(selector)

Filter elements that contain the specified element nodes.

:empty

Filter elements that have no child elements.

:first-child and :first-of-type Filters

Find("p:first-child") filters the first p tag; first-of-type requires it to be the first element of that type.

:last-child and :last-of-type Filters

The opposite of :first-child and :first-of-type.

:nth-child(n) and :nth-of-type(n) Filters

:nth-child(n) filters the nth element of the parent element; :nth-of-type(n) filters the nth element of the same type.

:nth-last-child(n) and :nth-last-of-type(n) Filters

Calculate in reverse order, with the last element being the first one.

:only-child and :only-of-type Filters

Find(":only-child") filters the only child element in the parent element; Find(":only-of-type") filters the only element of the same type.

Get Content

ele.Html()
ele.Text()

Traversal

Use the Each method to traverse the selected elements:

ele.Find(".item").Each(func(index int, elA *goquery.Selection) {
    href, _ := elA.Attr("href")
    fmt.Println(href)
})

Built-in Functions

Array Positioning Functions

Eq(index int) *Selection
First() *Selection
Get(index int) *html.Node
Index...() int
Last() *Selection
Slice(start, end int) *Selection

Extended Functions

Add...()
AndSelf()
Union()

Filtering Functions

End()
Filter...()
Has...()
Intersection()
Not...()

Loop Traversal Functions

Each(f func(int, *Selection)) *Selection
EachWithBreak(f func(int, *Selection) bool) *Selection
Map(f func(int, *Selection) string) (result []string)

Document Modification Functions

After...()
Append...()
Before...()
Clone()
Empty()
Prepend...()
Remove...()
ReplaceWith...()
Unwrap()
Wrap...()
WrapAll...()
WrapInner...()

Attribute Manipulation Functions

Attr*(), RemoveAttr(), SetAttr()
AttrOr(e string, d string)
AddClass(), HasClass(), RemoveClass(), ToggleClass()
Html()
Length()
Size()
Text()

Node Search Functions

Contains()
Is...()

Document Tree Traversal Functions

Children...()
Contents()
Find...()
Next...() *Selection
NextAll() *Selection
Parent[s]...()
Prev...() *Selection
Siblings...()

Type Definitions

Document
Selection
Matcher

Helper Functions

NodeName
OuterHtml

Examples

Getting Started Example

func main() {
    html := `<html>
            <body>
                <h1 id="title">O Captain! My Captain!</h1>
                <p class="content1">
                O Captain! my Captain! our fearful trip is done,
                The ship has weather’d every rack, the prize we sought is won,
                The port is near, the bells I hear, the people all exulting,
                While follow eyes the steady keel, the vessel grim and daring;
                </p>
            </body>
            </html>`
    dom, err := goquery.NewDocumentFromReader(strings.NewReader(html))
    if err != nil {
        log.Fatalln(err)
    }
    dom.Find("p").Each(func(i int, selection *goquery.Selection) {
        fmt.Println(selection.Text())
    })
}

Example of Crawling IMDb Popular Movie Information

package main

import (
    "fmt"
    "log"

    "github.com/PuerkitoBio/goquery"
)

func main() {
    doc, err := goquery.NewDocument("https://www.imdb.com/chart/moviemeter/")
    if err != nil {
        log.Fatal(err)
    }
    doc.Find(".titleColumn a").Each(func(i int, selection *goquery.Selection) {
        title := selection.Text()
        href, _ := selection.Attr("href")
        fmt.Printf("Movie Name: %s, Link: https://www.imdb.com%s\n", title, href)
    })
}

The above examples extract the movie names and link information from the IMDb popular movies page. In actual use, you can adjust the selectors and processing logic according to your needs.

Leapcell: The Next-Gen Serverless Platform for Web Hosting

Finally, I would like to recommend the best platform for deploying Go services: Leapcell

Image description

1. Multi-Language Support

  • Develop with JavaScript, Python, Go, or Rust.

2. Deploy unlimited projects for free

  • pay only for usage — no requests, no charges.

3. Unbeatable Cost Efficiency

  • Pay-as-you-go with no idle charges.
  • Example: $25 supports 6.94M requests at a 60ms average response time.

4. Streamlined Developer Experience

  • Intuitive UI for effortless setup.
  • Fully automated CI/CD pipelines and GitOps integration.
  • Real-time metrics and logging for actionable insights.

5. Effortless Scalability and High Performance

  • Auto-scaling to handle high concurrency with ease.
  • Zero operational overhead — just focus on building.

Image description

Explore more in the documentation!

Leapcell Twitter: https://x.com/LeapcellHQ

...

🔧 Go Web Scraper: Build and Optimize HTML Parsers


📈 64.92 Punkte
🔧 Programmierung

🔧 Go Web Scraper: Build and Optimize HTML Parsers


📈 64.92 Punkte
🔧 Programmierung

🔧 How to Build a Product Scraper for Infinite Scroll Websites using ZenRows Web Scraper


📈 42.68 Punkte
🔧 Programmierung

📰 Words Scraper - Selenium Based Web Scraper To Generate Passwords List


📈 38.26 Punkte
📰 IT Security Nachrichten

📰 TeleGram-Scraper - Telegram Group Scraper Tool (Fetch All Information About Group Members)


📈 35.11 Punkte
📰 IT Security Nachrichten

🔧 Optimize Core Web Vitals - FCP and LCP: Optimize bundle size by lazy loading heavy 3rd-party package


📈 26.78 Punkte
🔧 Programmierung

🔧 How to Build a Dynamic Web Scraper App with Playwright and React: A Step-by-Step Guide


📈 26.26 Punkte
🔧 Programmierung

🔧 Build an Advanced Web Scraping Tool Using ToolJet and Scraper API! 🚀 🛠️


📈 26.26 Punkte
🔧 Programmierung

🔧 How to Build a Robust Web Scraper with Laravel: and Catch 'Em All


📈 26.26 Punkte
🔧 Programmierung

🔧 How to Use the Python SDK to Build Your Own Web Scraper


📈 25.12 Punkte
🔧 Programmierung

🔧 Introduction to Lexers, Parsers and Interpreters with Chevrotain


📈 24.49 Punkte
🔧 Programmierung

📰 MKVToolnix 10.0.0 Open-Source MKV Manipulator Improves H.264 and H.265 Parsers


📈 24.49 Punkte
📰 IT Security Nachrichten

🔧 Submission For Bright Data Web Scraping Challenge: Web Scraper Using Bright Data API


📈 23.85 Punkte
🔧 Programmierung

⚠️ Privoxy bis 3.0.23 HTTP Host Header Handler parsers.c client_host Denial of Service


📈 23.35 Punkte
⚠️ PoC

🎥 Splitting the Email Atom: Exploiting Parsers to Bypass Access Controls


📈 23.35 Punkte
🎥 IT Security Video

🎥 Perl & PHP Vulns, Fuzzing & Parsers, Protecting Multi-Hosted Tenants, Secure Design - ASW #303


📈 23.35 Punkte
🎥 IT Security Video

🎥 Perl & PHP Vulns, Fuzzing & Parsers, Protecting Multi-Hosted Tenants, Secure Design - ASW #303


📈 23.35 Punkte
🎥 IT Security Video

🔧 Parsers are relative bimonads


📈 23.35 Punkte
🔧 Programmierung

🔧 Building a parser combinator: basic parsers 1.


📈 23.35 Punkte
🔧 Programmierung

🔧 Using Arktype in Place of Zod - How to Adapt Parsers


📈 23.35 Punkte
🔧 Programmierung

🔧 LangChain: LLMs with Models, Prompts, Parsers


📈 23.35 Punkte
🔧 Programmierung

🐧 Linus Torvalds Injects Tabs To Thwart Kconfig Parsers Not Correctly Handling Them


📈 23.35 Punkte
🐧 Linux Tipps

🔧 Benchmark TypeScript Parsers: Demystify Rust Tooling Performance


📈 23.35 Punkte
🔧 Programmierung

🕵️ curl: Abusing URL Parsers by long schema name


📈 23.35 Punkte
🕵️ Sicherheitslücken

📰 Zwei Probleme in univocity-parsers (Fedora)


📈 23.35 Punkte
📰 IT Security Nachrichten

🕵️ Kiewtai: Use Kaitai format parsers with Hiew


📈 23.35 Punkte
🕵️ Reverse Engineering

🕵️ Reflex: Extracting Regexps from Flex/Bison-based Parsers


📈 23.35 Punkte
🕵️ Reverse Engineering

🕵️ Tcpdump 4.8.1 parsers Overflow Vulnerability


📈 23.35 Punkte
🕵️ Sicherheitslücken

🕵️ Privoxy up to 3.0.23 HTTP Host Header parsers.c client_host denial of service


📈 23.35 Punkte
🕵️ Sicherheitslücken

⚠️ Privoxy bis 3.0.23 HTTP Host Header Handler parsers.c client_host Denial of Service


📈 23.35 Punkte
⚠️ PoC

🔧 How to Build a Job Board Scraper with Python


📈 21.98 Punkte
🔧 Programmierung

🔧 Building a Web Scraper with React.js, Express and TailwindCSS: A Journey into Data Collection


📈 21.84 Punkte
🔧 Programmierung