Vectorizers

These are pipeline components responsible for transforming input objects into their vectorized form. The input objects are domain specific but the resulting vectors are the numerical representation suitable for machine learning. Think of vectorizers as signal generators that emphasize or suppress the aspects being learned & come in the following forms:

Type	Description
encoders	transform inputs into vectorized forms
reducers	dimensionality reduce other encodings

scikitlab.vectorizers.encoder.EnumeratedEncoder

EnumeratedEncoder(handle_unknown='error')

Encodes a list of arrays into -- encodes a set of dimensions -- similar to label encoder but allows Nd array. -- -1 reserved for unknown -- acts as both input x or output y or multi xy encoder.

Parameters:

Name	Type	Description	Default
`handle_unknown`	`str`	behaviour when un-known objects or classes are seen. Either `error` or `ignore`	`'error'`

scikitlab.vectorizers.frequential.ItemCountVectorizer

ItemCountVectorizer(
    fn_norm=None,
    min_freq=0.0,
    max_freq=1.0,
    max_items=None,
    out_of_vocab=None,
    binary=False,
    **kwargs
)

A general purpose counting vectorizer of arbitrary collection of items throughout a dataset. This class was adapted from scikits CountVectorizer to generalize items not only to text documents. Class also decouples any internal item pre-processing into a user definable normalization function.

Parameters:

Name	Type	Description	Default
`fn_norm`	`Callable`	how to transform individual items per input.	`None`
`min_freq`	`float`	filter rare items bellow this threshold.	`0.0`
`max_freq`	`float`	filter frequent items above this threshold.	`1.0`
`max_items`	`int`	keep only top most frequent items in corpus.	`None`
`out_of_vocab`	`str`	feature name for catching un-recognized/filtered items.	`None`
`binary`	`bool`	simply flag all non-zero counts items.	`False`

scikitlab.vectorizers.spatial.GeoVectorizer

GeoVectorizer(
    resolution,
    index_scheme="h3",
    items=None,
    offset=1,
    **kwargs
)

Converts shapely latitude & longitude point coordinates to a geospatial indexing scheme. This is useful to quantize areas & model neighboring or hierarchical spatial relationships between them. Some relationships are 1:1 but others are 1:many, so resulting vectors denote occurrence counts of all train-time known areas. Any unrecognized area at inference time is ignored & vectorized as zero. Depending on spatial resolution & coverage, vectors can be high-dimensional & are encoded as sparse. Users may want to cap dimensionality by size, frequency or perform other dimensionality reduction techniques.

Parameters:

Name	Type	Description	Default
`resolution`	`int`	Cell size of fetched areas. Range depends on scheme.	required
`index_scheme`	`str`	Geo indexing scheme: `h3`, `geohash`, `s2` ..	`'h3'`
`items`	`Set[str]`	Combination of items to fetch: `cells` (default), `neighbors`, `parents` or `children`.	`None`
`offset`	`int`	Number of neighbouring or hierarchical cells away from hits to fetch.	`1`

scikitlab.vectorizers.temporal.DateTimeVectorizer

DateTimeVectorizer(weights=None, utc_norm=False, **kwargs)

Encodes a date-time into a collection of trigonometrically encoded attribute parts. This vectorizer allows to emphasize on certain time attributes by weight as well as standardize to a common timezone.

Parameters:

Name	Type	Description	Default
`weights`	`Dict[str, float]`	set of (weighted) time attributes. Choose from: `season`, `month`, `weekday`, `hour`, `minute`, `second`, `microsec` else default to all with equal weight.	`None`
`utc_norm`	`bool`	converts to coordinated universal time	`False`
`kwargs`		other parameters for base transformer.	`{}`

scikitlab.vectorizers.temporal.PeriodicityVectorizer

PeriodicityVectorizer(period, **kwargs)

Trigonometrically encodes a periodic signal as combination of sin & cosine values. This is useful to fairly capture cyclical distances between final & start of periods such as for dates. This encoding also caps the number of dimensions to 2 acting as a tradeoff between reducing the curse of high dimensionality from needing to one-hot-encode signal values & having fine approximation from using various radial- basis-functions.

Note that since both sin & cosine intercept the axis twice per period, both dimensions are required to precisely disambiguate where along the original signal the encoding lies.

Parameters:

Name	Type	Description	Default
`period`	`int`	how many units before cycle repeats	required

scikitlab.vectorizers.text.WeightedNgramVectorizer

WeightedNgramVectorizer(
    vectorizer_type="tfidf",
    weight_fn=None,
    ngram_range=(1, 1),
    n_jobs=None,
    verbose=False,
    **kwargs
)

Composite scikit component that manages ngram weighted frequency vectors like tf-idf or count vectorizers. Since smaller ngrams are more frequent but less insightful than larger ngrams, this component allows fair weighting & filtering ngrams proportionally to their token size throughout the corpus.

Parameters:

Name	Type	Description	Default
`vectorizer_type`	`str`	Type of the vectorizer "tfidf" or "count"	`'tfidf'`
`weight_fn`	`Callable[[int], float]`	function to weight n-grams by size.	`None`
`ngram_range`	`tuple`	min/max ngram sizes to build.	`(1, 1)`
`n_jobs`	`int`	parallel process	`None`
`verbose`	`bool`	verbose level	`False`
`kwargs`		other parameters for base vectorizer	`{}`

scikitlab.vectorizers.text.UniversalSentenceEncoder

UniversalSentenceEncoder(resource_dir=None)

Wrapper to pre-trained universal sentence encoder model to convert short English texts into fixed size dense semantic vectors without need for text preprocessing. This component may require warm start but is useful to cluster or compare document similarities.

Parameters:

Name	Type	Description	Default
`resource_dir`	`str`	path to zip model else downloads from web.	`None`