html

package module
v1.2.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Feb 7, 2026 License: MIT Imports: 17 Imported by: 0

README

HTML Library

Go Version pkg.go.dev Performance Thread Safe

A Go library for intelligent HTML content extraction. Compatible with golang.org/x/net/html — use it as a drop-in replacement, plus get enhanced content extraction features.

📖 中文文档 - User guide

✨ Core Features

🎯 Content Extraction
  • Article Detection: Identifies main content using scoring algorithms (text density, link density, semantic tags)
  • Smart Text Extraction: Preserves structure, handles newlines, calculates word count and reading time
  • Media Extraction: Images, videos, audio with metadata (URL, dimensions, alt text, type detection)
  • Link Analysis: External/internal detection, nofollow attributes, anchor text extraction
⚡ Performance
  • Content-Addressable Caching: SHA256-based keys with TTL and LRU eviction
  • Batch Processing: Parallel extraction with configurable worker pools
  • Thread-Safe: Concurrent use without external synchronization
  • Resource Limits: Configurable input size, nesting depth, and timeout protection
📖 Use Cases
  • News Aggregators: Extract article content from news sites
  • Web Scrapers: Get structured data from HTML pages
  • Content Management: Convert HTML to Markdown or other formats
  • Search Engines: Index main content without navigation/ads
  • Data Analysis: Extract and analyze web content at scale
  • RSS/Feed Generators: Create feeds from HTML content
  • Documentation Tools: Convert HTML docs to other formats

📦 Installation

go get github.com/cybergodev/html

⚡ 5-Minute Quick Start

import "github.com/cybergodev/html"

// Extract clean text from HTML
htmlBytes := []byte(`
    <html>
        <nav>Navigation</nav>
        <article><h1>Hello World</h1><p>Content here...</p></article>
        <footer>Footer</footer>
    </html>
`)
text, _ := html.ExtractText(htmlBytes)
fmt.Println(text) // "Hello World\nContent here..."

That's it! The library automatically:

  • Removes navigation, footers, ads
  • Extracts main content
  • Cleans up whitespace

🚀 Quick Guide

One-Liner Functions

Just want to get something done? Use these package-level functions:

// Extract text only
text, _ := html.ExtractText(htmlBytes)

// Extract everything
result, _ := html.Extract(htmlBytes)
fmt.Println(result.Title)     // Hello World
fmt.Println(result.Text)      // Content here...
fmt.Println(result.WordCount) // 4

// Extract all resource links
links, _ := html.ExtractAllLinks(htmlBytes)

// Convert formats
markdown, _ := html.ExtractToMarkdown(htmlBytes)
jsonData, _ := html.ExtractToJSON(htmlBytes)

When to use: Simple scripts, one-off tasks, quick prototyping


Basic Processor Usage

Need more control? Create a processor:

// Create processor with default configuration
processor, err := html.New()
if err != nil {
    log.Fatal(err)
}
defer processor.Close()

// Extract with defaults
result, _ := processor.Extract(htmlBytes, html.DefaultExtractConfig())

// Extract from file
result, _ = processor.ExtractFromFile("page.html", html.DefaultExtractConfig())

// Batch processing
htmlContents := [][]byte{html1, html2, html3}
results, _ := processor.ExtractBatch(htmlContents, html.DefaultExtractConfig())

When to use: Multiple extractions, processing many files, web scrapers


Custom Configuration

Fine-tune what gets extracted:

config := html.ExtractConfig{
    ExtractArticle:    true,       // Auto-detect main content
    PreserveImages:    true,       // Extract image metadata
    PreserveLinks:     true,       // Extract link metadata
    PreserveVideos:    false,      // Skip videos
    PreserveAudios:    false,      // Skip audio
    InlineImageFormat: "none",     // Options: "none", "placeholder", "markdown", "html"
    TableFormat:       "markdown", // Options: "markdown", "html"
    Encoding:          "",         // Auto-detect from meta tags, or specify: "utf-8", "windows-1252", etc.
}

processor, err := html.New()
if err != nil {
    log.Fatal(err)
}
defer processor.Close()

result, _ := processor.Extract(htmlBytes, config)

When to use: Specific extraction needs, format conversion, custom output


Advanced Features
Custom Processor Configuration
config := html.Config{
    MaxInputSize:       10 * 1024 * 1024, // 10MB limit
    ProcessingTimeout:  30 * time.Second,
    MaxCacheEntries:    500,
    CacheTTL:           30 * time.Minute,
    WorkerPoolSize:     8,
    EnableSanitization: true,  // Remove <script>, <style> tags
    MaxDepth:           50,    // Prevent deep nesting attacks
}

processor, _ := html.New(config)
defer processor.Close()
// Extract all resource links
links, _ := html.ExtractAllLinks(htmlBytes)

// Group by type
byType := html.GroupLinksByType(links)
cssLinks := byType["css"]
jsLinks := byType["js"]
images := byType["image"]

// Advanced configuration
processor, err := html.New()
if err != nil {
    log.Fatal(err)
}
linkConfig := html.LinkExtractionConfig{
    BaseURL:               "https://example.com",
    ResolveRelativeURLs:   true,
    IncludeImages:         true,
    IncludeVideos:         true,
    IncludeAudios:         true,
    IncludeCSS:            true,
    IncludeJS:             true,
    IncludeContentLinks:   true,
    IncludeExternalLinks:  true,
    IncludeIcons:          true,
}
links, _ = processor.ExtractAllLinks(htmlBytes, linkConfig)
Caching & Statistics
processor, err := html.New()
if err != nil {
    log.Fatal(err)
}
defer processor.Close()

// Automatic caching enabled
result1, _ := processor.Extract(htmlBytes, html.DefaultExtractConfig())
result2, _ := processor.Extract(htmlBytes, html.DefaultExtractConfig()) // Cache hit!

// Check performance
stats := processor.GetStatistics()
fmt.Printf("Cache hits: %d/%d\n", stats.CacheHits, stats.TotalProcessed)

// Clear cache (preserves statistics)
processor.ClearCache()

// Reset statistics (preserves cache entries)
processor.ResetStatistics()

When to use: Production applications, performance optimization, specific use cases


📖 Common Recipes

Copy-paste solutions for common tasks:

Extract Article Text (Clean)
text, _ := html.ExtractText(htmlBytes)
// Returns clean text without navigation/ads
Extract with Images
result, _ := html.Extract(htmlBytes)
for _, img := range result.Images {
    fmt.Printf("Image: %s (alt: %s)\n", img.URL, img.Alt)
}
Convert to Markdown
markdown, _ := html.ExtractToMarkdown(htmlBytes)
// Images become: ![alt](url)
links, _ := html.ExtractAllLinks(htmlBytes)
for _, link := range links {
    fmt.Printf("%s: %s\n", link.Type, link.URL)
}
Get Reading Time
result, _ := html.Extract(htmlBytes)
minutes := result.ReadingTime.Minutes()
fmt.Printf("Reading time: %.1f min", minutes)
Batch Process Files
processor, err := html.New()
if err != nil {
    log.Fatal(err)
}
defer processor.Close()

files := []string{"page1.html", "page2.html", "page3.html"}
results, _ := processor.ExtractBatchFiles(files, html.DefaultExtractConfig())

🔧 API Quick Reference

Package-Level Functions
// Extraction
html.Extract(htmlBytes []byte, configs ...ExtractConfig) (*Result, error)
html.ExtractText(htmlBytes []byte) (string, error)
html.ExtractFromFile(filePath string, configs ...ExtractConfig) (*Result, error)

// Format Conversion
html.ExtractToMarkdown(htmlBytes []byte) (string, error)
html.ExtractToJSON(htmlBytes []byte) ([]byte, error)

// Links
html.ExtractAllLinks(htmlBytes []byte, configs ...LinkExtractionConfig) ([]LinkResource, error)
html.GroupLinksByType(links []LinkResource) map[string][]LinkResource
Processor Methods
// Creation
processor, err := html.New()
// or with custom config:
processor, err := html.New(config)
defer processor.Close()

// Extraction
processor.Extract(htmlBytes []byte, configs ...ExtractConfig) (*Result, error)
processor.ExtractFromFile(filePath string, configs ...ExtractConfig) (*Result, error)

// Batch
processor.ExtractBatch(contents [][]byte, configs ...ExtractConfig) ([]*Result, error)
processor.ExtractBatchFiles(paths []string, configs ...ExtractConfig) ([]*Result, error)

// Links
processor.ExtractAllLinks(htmlBytes []byte, configs ...LinkExtractionConfig) ([]LinkResource, error)

// Monitoring
processor.GetStatistics() Statistics
processor.ClearCache()
processor.ResetStatistics()
Configuration Functions
// Processor configuration
html.DefaultConfig()            Config

// Extraction configuration
html.DefaultExtractConfig()           ExtractConfig

// Link extraction configuration
html.DefaultLinkExtractionConfig()           LinkExtractionConfig

Default values for DefaultConfig():

Config{
    MaxInputSize:       50 * 1024 * 1024, // 50MB
    MaxCacheEntries:    2000,
    CacheTTL:           1 * time.Hour,
    WorkerPoolSize:     4,
    EnableSanitization: true,
    MaxDepth:           500,
    ProcessingTimeout:  30 * time.Second,
}

Default values for DefaultExtractConfig():

ExtractConfig{
    ExtractArticle:    true,
    PreserveImages:    true,
    PreserveLinks:     true,
    PreserveVideos:    true,
    PreserveAudios:    true,
    InlineImageFormat: "none",
    TableFormat:       "markdown",
    Encoding:          "", // Auto-detect
}

Default values for DefaultLinkExtractionConfig():

LinkExtractionConfig{
    ResolveRelativeURLs:  true,  // Convert relative URLs to absolute
    BaseURL:              "",    // Base URL for resolution (empty = auto-detect)
    IncludeImages:        true,  // Extract image links
    IncludeVideos:        true,  // Extract video links
    IncludeAudios:        true,  // Extract audio links
    IncludeCSS:           true,  // Extract CSS links
    IncludeJS:            true,  // Extract JavaScript links
    IncludeContentLinks:  true,  // Extract content links
    IncludeExternalLinks: true,  // Extract external domain links
    IncludeIcons:         true,  // Extract favicon/icon links
}

Result Structure

type Result struct {
    Text           string        // Clean text content
    Title          string        // Page/article title
    Images         []ImageInfo   // Image metadata
    Links          []LinkInfo    // Link metadata
    Videos         []VideoInfo   // Video metadata
    Audios         []AudioInfo   // Audio metadata
    WordCount      int           // Total words
    ReadingTime    time.Duration // Estimated reading time (JSON: reading_time_ms in milliseconds)
    ProcessingTime time.Duration // Time taken (JSON: processing_time_ms in milliseconds)
}

type ImageInfo struct {
    URL          string  // Image URL
    Alt          string  // Alt text
    Title        string  // Title attribute
    Width        string  // Width attribute
    Height       string  // Height attribute
    IsDecorative bool    // No alt text
    Position     int     // Position in document
}

type LinkInfo struct {
    URL        string  // Link URL
    Text       string  // Anchor text
    Title      string  // Title attribute
    IsExternal bool    // External domain
    IsNoFollow bool    // rel="nofollow"
}

type VideoInfo struct {
    URL      string  // Video URL
    Type     string  // MIME type or "embed"
    Poster   string  // Poster image URL
    Width    string  // Width attribute
    Height   string  // Height attribute
    Duration string  // Duration attribute
}

type AudioInfo struct {
    URL      string  // Audio URL
    Type     string  // MIME type
    Duration string  // Duration attribute
}

type LinkResource struct {
    URL   string  // Resource URL
    Title string  // Resource title
    Type  string  // Resource type: css, js, image, video, audio, icon, link
}
Statistics Structure
type Statistics struct {
    TotalProcessed     int64         // Total number of extractions performed
    CacheHits          int64         // Number of times cache was hit
    CacheMisses        int64         // Number of times cache was missed
    ErrorCount         int64         // Number of errors encountered
    AverageProcessTime time.Duration // Average processing time per extraction
}

🔒 Security Features

The library includes built-in security protections:

HTML Sanitization
  • Dangerous Tag Removal: <script>, <style>, <noscript>, <iframe>, <embed>, <object>, <form>, <input>, <button>
  • Event Handler Removal: All on* attributes (onclick, onerror, onload, etc.)
  • Dangerous Protocol Blocking: javascript:, vbscript:, data: (except safe media types)
  • XSS Prevention: Comprehensive sanitization to prevent cross-site scripting
Input Validation
  • Size Limits: Configurable MaxInputSize prevents memory exhaustion
  • Depth Limits: MaxDepth prevents stack overflow from deeply nested HTML
  • Timeout Protection: ProcessingTimeout prevents hanging on malformed input
  • Path Traversal Protection: ExtractFromFile validates file paths to prevent directory traversal attacks
Data URL Security

Only safe media type data URLs are allowed:

  • Allowed: data:image/*, data:font/*, data:application/pdf
  • Blocked: data:text/html, data:text/javascript, data:text/plain

Performance Benchmarks

Based on benchmark_test.go:

Operation Performance Notes
Text Extraction ~500ns per HTML document Fast text extraction
Link Extraction ~2μs per HTML document With metadata extraction
Full Extraction ~5μs per HTML document With all features enabled
Cache Hit ~100ns Near-instant for cached content

Caching Benefits:

  • SHA256-based keys: Content-addressable caching
  • TTL Support: Configurable cache expiration
  • LRU Eviction: Automatic cache management with doubly-linked list
  • Thread-Safe: Concurrent access without external locks

See examples/ directory for complete, runnable code:

Example Description
01_quick_start.go Quick start with one-liners
02_content_extraction.go Content extraction basics
03_links_and_urls.go Link extraction & URL resolution
04_media_extraction.go Media files extraction
05_config_performance.go Configuration & performance tuning
06_http_integration.go HTTP integration patterns
07_advanced_usage.go Advanced features & batch processing
08_output_formats.go JSON and Markdown output formats
09_error_handling.go Error handling patterns
10_real_world.go Real-world use cases

Compatibility

This library is a drop-in replacement for golang.org/x/net/html:

// Just change the import
- import "golang.org/x/net/html"
+ import "github.com/cybergodev/html"

// All existing code works
doc, err := html.Parse(reader)
html.Render(writer, doc)
escaped := html.EscapeString("<script>")

The library re-exports all commonly used types, constants, and functions from golang.org/x/net/html:

  • Types: Node, NodeType, Token, Attribute, Tokenizer, ParseOption
  • Constants: All NodeType and TokenType constants
  • Functions: Parse, ParseFragment, ParseWithOptions, ParseFragmentWithOptions, Render, EscapeString, UnescapeString, NewTokenizer, NewTokenizerFragment, ParseOptionEnableScripting

Thread Safety

The Processor is safe for concurrent use:

processor, err := html.New()
if err != nil {
    log.Fatal(err)
}
defer processor.Close()

// Safe to use from multiple goroutines
var wg sync.WaitGroup
for i := 0; i < 100; i++ {
    wg.Add(1)
    go func() {
        defer wg.Done()
        processor.Extract(htmlBytes, html.DefaultExtractConfig())
    }()
}
wg.Wait()

🤝 Contributing

Contributions, issue reports, and suggestions are welcome!

📄 License

MIT License - See LICENSE file for details.


Crafted with care for the Go community ❤️ | If this project helps you, please give it a ⭐️ Star!

Documentation

Index

Constants

View Source
const (
	ErrorNode    = stdxhtml.ErrorNode
	TextNode     = stdxhtml.TextNode
	DocumentNode = stdxhtml.DocumentNode
	ElementNode  = stdxhtml.ElementNode
	CommentNode  = stdxhtml.CommentNode
	DoctypeNode  = stdxhtml.DoctypeNode
	RawNode      = stdxhtml.RawNode
)
View Source
const (
	ErrorToken          = stdxhtml.ErrorToken
	TextToken           = stdxhtml.TextToken
	StartTagToken       = stdxhtml.StartTagToken
	EndTagToken         = stdxhtml.EndTagToken
	SelfClosingTagToken = stdxhtml.SelfClosingTagToken
	CommentToken        = stdxhtml.CommentToken
	DoctypeToken        = stdxhtml.DoctypeToken
)
View Source
const (
	DefaultMaxInputSize      = 50 * 1024 * 1024
	DefaultMaxCacheEntries   = 2000
	DefaultWorkerPoolSize    = 4
	DefaultCacheTTL          = time.Hour
	DefaultMaxDepth          = 500
	DefaultProcessingTimeout = 30 * time.Second
)

Variables

View Source
var (
	// ErrInputTooLarge is returned when input exceeds MaxInputSize.
	ErrInputTooLarge = errors.New("html: input size exceeds maximum")

	// ErrInvalidHTML is returned when HTML parsing fails.
	ErrInvalidHTML = errors.New("html: invalid HTML")

	// ErrProcessorClosed is returned when operations are attempted on a closed processor.
	ErrProcessorClosed = errors.New("html: processor closed")

	// ErrMaxDepthExceeded is returned when HTML nesting exceeds MaxDepth.
	ErrMaxDepthExceeded = errors.New("html: max depth exceeded")

	// ErrInvalidConfig is returned when configuration validation fails.
	ErrInvalidConfig = errors.New("html: invalid config")

	// ErrProcessingTimeout is returned when processing exceeds ProcessingTimeout.
	ErrProcessingTimeout = errors.New("html: processing timeout exceeded")

	// ErrFileNotFound is returned when specified file cannot be read.
	ErrFileNotFound = errors.New("html: file not found")

	// ErrInvalidFilePath is returned when file path validation fails.
	ErrInvalidFilePath = errors.New("html: invalid file path")
)

Error definitions for the `cybergodev/html` package.

View Source
var (
	ErrBufferExceeded    = stdxhtml.ErrBufferExceeded
	Parse                = stdxhtml.Parse
	ParseFragment        = stdxhtml.ParseFragment
	Render               = stdxhtml.Render
	EscapeString         = htmlstd.EscapeString
	UnescapeString       = htmlstd.UnescapeString
	NewTokenizer         = stdxhtml.NewTokenizer
	NewTokenizerFragment = stdxhtml.NewTokenizerFragment
)

Functions

func ExtractText added in v1.0.2

func ExtractText(htmlBytes []byte) (string, error)

ExtractText extracts plain text from HTML bytes with automatic encoding detection. The method automatically detects character encoding and converts to UTF-8.

Parameters:

htmlBytes - Raw HTML bytes (auto-detects encoding)

Returns:

string - Extracted plain text in UTF-8
error - Error if extraction fails

Example:

bytes, _ := os.ReadFile("document.html")
text, _ := html.ExtractText(bytes)

func ExtractToJSON added in v1.0.4

func ExtractToJSON(htmlBytes []byte) ([]byte, error)

func ExtractToMarkdown added in v1.0.4

func ExtractToMarkdown(htmlBytes []byte) (string, error)

ExtractToMarkdown converts HTML bytes to Markdown with automatic encoding detection. The method automatically detects character encoding (Windows-1252, UTF-8, GBK, Shift_JIS, etc.) from the HTML bytes and converts it to UTF-8 before processing.

Parameters:

htmlBytes - Raw HTML bytes (auto-detects encoding)

Returns:

string - Markdown content in UTF-8
error - Error if conversion fails

Example:

// HTTP response
resp, _ := http.Get(url)
bytes, _ := io.ReadAll(resp.Body)
markdown, _ := html.ExtractToMarkdown(bytes)

// File
bytes, _ := os.ReadFile("document.html")
markdown, _ := html.ExtractToMarkdown(bytes)

func GroupLinksByType added in v1.0.2

func GroupLinksByType(links []LinkResource) map[string][]LinkResource

Types

type Attribute

type Attribute = stdxhtml.Attribute

Type aliases for commonly used types from golang.org/x/net/html

type AudioInfo

type AudioInfo struct {
	URL      string `json:"url"`
	Type     string `json:"type"`
	Duration string `json:"duration"`
}

type Config

type Config struct {
	MaxInputSize       int
	MaxCacheEntries    int
	CacheTTL           time.Duration
	WorkerPoolSize     int
	EnableSanitization bool
	MaxDepth           int
	ProcessingTimeout  time.Duration
}

func DefaultConfig

func DefaultConfig() Config

type ExtractConfig

type ExtractConfig struct {
	ExtractArticle    bool
	PreserveImages    bool
	PreserveLinks     bool
	PreserveVideos    bool
	PreserveAudios    bool
	InlineImageFormat string
	TableFormat       string
	// Encoding specifies the character encoding of the input HTML.
	// If empty, the encoding will be auto-detected from meta tags or BOM.
	// Common values: "utf-8", "windows-1252", "iso-8859-1", "shift_jis", etc.
	Encoding string
}

func DefaultExtractConfig

func DefaultExtractConfig() ExtractConfig

type ImageInfo

type ImageInfo struct {
	URL          string `json:"url"`
	Alt          string `json:"alt"`
	Title        string `json:"title"`
	Width        string `json:"width"`
	Height       string `json:"height"`
	IsDecorative bool   `json:"is_decorative"`
	Position     int    `json:"position"`
}

type LinkExtractionConfig added in v1.0.2

type LinkExtractionConfig struct {
	ResolveRelativeURLs  bool
	BaseURL              string
	IncludeImages        bool
	IncludeVideos        bool
	IncludeAudios        bool
	IncludeCSS           bool
	IncludeJS            bool
	IncludeContentLinks  bool
	IncludeExternalLinks bool
	IncludeIcons         bool
}

func DefaultLinkExtractionConfig added in v1.0.2

func DefaultLinkExtractionConfig() LinkExtractionConfig

type LinkInfo

type LinkInfo struct {
	URL        string `json:"url"`
	Text       string `json:"text"`
	Title      string `json:"title"`
	IsExternal bool   `json:"is_external"`
	IsNoFollow bool   `json:"is_nofollow"`
}

type LinkResource added in v1.0.2

type LinkResource struct {
	URL   string
	Title string
	Type  string
}
func ExtractAllLinks(htmlBytes []byte, configs ...LinkExtractionConfig) ([]LinkResource, error)

ExtractAllLinks extracts all links from HTML bytes with automatic encoding detection. The method automatically detects character encoding and converts to UTF-8.

Parameters:

htmlBytes - Raw HTML bytes (auto-detects encoding)
configs - Optional link extraction configurations

Returns:

[]LinkResource - List of extracted links with UTF-8 encoded titles
error - Error if extraction fails

Example:

bytes, _ := os.ReadFile("document.html")
links, _ := html.ExtractAllLinks(bytes)

type Node

type Node = stdxhtml.Node

Type aliases for commonly used types from golang.org/x/net/html

type NodeType

type NodeType = stdxhtml.NodeType

Type aliases for commonly used types from golang.org/x/net/html

type ParseOption added in v1.1.0

type ParseOption = stdxhtml.ParseOption

Type aliases for commonly used types from golang.org/x/net/html

type Processor

type Processor struct {
	// contains filtered or unexported fields
}

func New

func New(configs ...Config) (*Processor, error)

New creates a new HTML processor with the given configuration. If no configuration is provided, it uses DefaultConfig().

The function signature uses variadic arguments to make the config optional:

processor, err := html.New()              // Uses DefaultConfig()
processor, err := html.New(config)        // Uses custom config

The returned processor must be closed when no longer needed:

processor, err := html.New()
defer processor.Close()

func (*Processor) ClearCache

func (p *Processor) ClearCache()

ClearCache clears the cache contents but preserves cumulative statistics. Use ResetStatistics to reset statistics counters.

func (*Processor) Close

func (p *Processor) Close() error

func (*Processor) Extract

func (p *Processor) Extract(htmlBytes []byte, configs ...ExtractConfig) (*Result, error)

Extract extracts content from HTML bytes with automatic encoding detection. This is the main extraction method that processes HTML bytes after detecting and converting their character encoding to UTF-8.

The method performs the following steps: 1. Validates processor state (not closed) 2. Resolves extraction configuration 3. Checks input size limits 4. Detects character encoding and converts to UTF-8 5. Processes content with caching support 6. Updates statistics and returns result

func (p *Processor) ExtractAllLinks(htmlBytes []byte, configs ...LinkExtractionConfig) ([]LinkResource, error)

ExtractAllLinks extracts all links from HTML bytes with automatic encoding detection. The method automatically detects character encoding and converts to UTF-8 before extracting links, ensuring that link titles and text are properly decoded.

func (*Processor) ExtractBatch

func (p *Processor) ExtractBatch(htmlContents [][]byte, configs ...ExtractConfig) ([]*Result, error)

func (*Processor) ExtractBatchFiles

func (p *Processor) ExtractBatchFiles(filePaths []string, configs ...ExtractConfig) ([]*Result, error)

func (*Processor) ExtractFromFile

func (p *Processor) ExtractFromFile(filePath string, configs ...ExtractConfig) (*Result, error)

func (*Processor) GetStatistics

func (p *Processor) GetStatistics() Statistics

func (*Processor) ResetStatistics added in v1.2.0

func (p *Processor) ResetStatistics()

ResetStatistics resets all statistics counters to zero. This preserves cache entries while clearing the accumulated metrics.

type Result

type Result struct {
	Text           string        `json:"text"`
	Title          string        `json:"title"`
	Images         []ImageInfo   `json:"images,omitempty"`
	Links          []LinkInfo    `json:"links,omitempty"`
	Videos         []VideoInfo   `json:"videos,omitempty"`
	Audios         []AudioInfo   `json:"audios,omitempty"`
	ProcessingTime time.Duration `json:"processing_time_ms"`
	WordCount      int           `json:"word_count"`
	ReadingTime    time.Duration `json:"reading_time_ms"`
}

func Extract added in v1.0.2

func Extract(htmlBytes []byte, configs ...ExtractConfig) (*Result, error)

Extract extracts content from HTML bytes with automatic encoding detection. The method automatically detects the character encoding (Windows-1252, UTF-8, GBK, Shift_JIS, etc.) from the HTML bytes and converts it to UTF-8 before processing.

This is the primary method for HTML content extraction when the source encoding may not be UTF-8, such as content from HTTP responses, databases, or files.

Parameters:

htmlBytes - Raw HTML bytes (auto-detects encoding)
configs - Optional extraction configurations

Returns:

*Result - Extracted content with UTF-8 encoded text
error - Error if extraction fails

Example:

// HTTP response
resp, _ := http.Get(url)
bytes, _ := io.ReadAll(resp.Body)
result, _ := html.Extract(bytes)

// File
bytes, _ := os.ReadFile("document.html")
result, _ := html.Extract(bytes)

func ExtractFromFile added in v1.0.2

func ExtractFromFile(filePath string, configs ...ExtractConfig) (*Result, error)

ExtractFromFile extracts content from an HTML file with automatic encoding detection. Use this when you have a file path instead of raw bytes.

Parameters:

filePath - Path to the HTML file
configs - Optional extraction configurations

Returns:

*Result - Extracted content with UTF-8 encoded text
error - Error if file reading or extraction fails

Example:

result, _ := html.ExtractFromFile("document.html", html.ExtractConfig{
    InlineImageFormat: "markdown",
})

func (*Result) MarshalJSON added in v1.1.0

func (r *Result) MarshalJSON() ([]byte, error)

type Statistics

type Statistics struct {
	TotalProcessed     int64
	CacheHits          int64
	CacheMisses        int64
	ErrorCount         int64
	AverageProcessTime time.Duration
}

type Token

type Token = stdxhtml.Token

Type aliases for commonly used types from golang.org/x/net/html

type Tokenizer

type Tokenizer = stdxhtml.Tokenizer

Type aliases for commonly used types from golang.org/x/net/html

type VideoInfo

type VideoInfo struct {
	URL      string `json:"url"`
	Type     string `json:"type"`
	Poster   string `json:"poster"`
	Width    string `json:"width"`
	Height   string `json:"height"`
	Duration string `json:"duration"`
}

Directories

Path Synopsis
Package internal provides caching functionality for content extraction results.
Package internal provides caching functionality for content extraction results.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL