gorpora

package module
v0.0.0-...-43cd961 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Oct 18, 2018 License: MIT Imports: 20 Imported by: 0

README

gorpora

Random set of utilities for text corpora processing, primarily for machine learning purpoces.

Usage of gorpora:

gorpora command [arguments]

commands are:

fb2text

Parameters:

  • -i string directory with fb2 files, will be processed recursively
  • -l int number of \n's added after each text output block (paragraph) (default 1)
  • -t int number of threads for parallel processing of conversion jobs (default 1)

normalize.html.entities

Parameters:

  • -debug do othing only print use cases
  • -max int maximum number of lines to process

strip.html

word.tokenizer

Parameters:

  • -debug do nothing only print use cases
  • -lemma output lemmas instead of words
  • -udpipe use Udpipe as tokenizer

sentence.tokenizer

Parameters:

  • -debug do nothing only print use cases
  • -max int maximum sentence length in chars (default 1000000)
  • -min int minimun sentence length in chars (default 10)

filter.language

Parameters:

  • -debug do othing only print use cases
  • -lang value set of accepted languages

unique

accepts text lines to stdin, outputs to stdout filtering out non unique lines.

Parameters:

  • -debug do nothing only print use cases

Documentation

Index

Constants

This section is empty.

Variables

View Source
var PARSER *udpipe.Parser

Functions

func Collect

func Collect(min, max int, input, extension string, level int) (collectedLineCount, filteredLineCount int)

func FilterLanguage

func FilterLanguage(languages []string)

func GetMD5Hash

func GetMD5Hash(bytes []byte) string

func NormalizeHtmlEntities

func NormalizeHtmlEntities()

func Sentesize

func Sentesize(min, max int)

func Split

func Split(use_udpipe, output_lemmas bool)

func StripHtml

func StripHtml()

func Unique

func Unique(DEBUG bool)

Types

type CSS

type CSS string

CSS encapsulates known safe content that matches any of:

  1. The CSS3 stylesheet production, such as `p { color: purple }`.
  2. The CSS3 rule production, such as `a[href=~"https:"].foo#bar`.
  3. CSS3 declaration productions, such as `color: red; margin: 2px`.
  4. The CSS3 value production, such as `rgba(0, 0, 255, 127)`.

See http://www.w3.org/TR/css3-syntax/#parsing and https://web.archive.org/web/20090211114933/http://w3.org/TR/css3-syntax#style

Use of this type presents a security risk: the encapsulated content should come from a trusted source, as it will be included verbatim in the template output.

type HTML

type HTML string

HTML encapsulates a known safe HTML document fragment. It should not be used for HTML from a third-party, or HTML with unclosed tags or comments. The outputs of a sound HTML sanitizer and a template escaped by this package are fine for use with HTML.

Use of this type presents a security risk: the encapsulated content should come from a trusted source, as it will be included verbatim in the template output.

type HTMLAttr

type HTMLAttr string

HTMLAttr encapsulates an HTML attribute from a trusted source, for example, ` dir="ltr"`.

Use of this type presents a security risk: the encapsulated content should come from a trusted source, as it will be included verbatim in the template output.

type JS

type JS string

JS encapsulates a known safe EcmaScript5 Expression, for example, `(x + y * z())`. Template authors are responsible for ensuring that typed expressions do not break the intended precedence and that there is no statement/expression ambiguity as when passing an expression like "{ foo: bar() }\n['foo']()", which is both a valid Expression and a valid Program with a very different meaning.

Use of this type presents a security risk: the encapsulated content should come from a trusted source, as it will be included verbatim in the template output.

Using JS to include valid but untrusted JSON is not safe. A safe alternative is to parse the JSON with json.Unmarshal and then pass the resultant object into the template, where it will be converted to sanitized JSON when presented in a JavaScript context.

type JSStr

type JSStr string

JSStr encapsulates a sequence of characters meant to be embedded between quotes in a JavaScript expression. The string must match a series of StringCharacters:

StringCharacter :: SourceCharacter but not `\` or LineTerminator
                 | EscapeSequence

Note that LineContinuations are not allowed. JSStr("foo\\nbar") is fine, but JSStr("foo\\\nbar") is not.

Use of this type presents a security risk: the encapsulated content should come from a trusted source, as it will be included verbatim in the template output.

type URL

type URL string

URL encapsulates a known safe URL or URL substring (see RFC 3986). A URL like `javascript:checkThatFormNotEditedBeforeLeavingPage()` from a trusted source should go in the page, but by default dynamic `javascript:` URLs are filtered out since they are a frequently exploited injection vector.

Use of this type presents a security risk: the encapsulated content should come from a trusted source, as it will be included verbatim in the template output.

Directories

Path Synopsis
Package cld2 implements language detection using the Compact Language Detector.
Package cld2 implements language detection using the Compact Language Detector.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL