Skip to content

Vectorizers

These are pipeline components responsible for transforming input objects into their vectorized form. The input objects are domain specific but the resulting vectors are the numerical representation suitable for machine learning. Think of vectorizers as signal generators that emphasize or suppress the aspects being learned & come in the following forms:

Type Description
encoders transform inputs into vectorized forms
reducers dimensionality reduce other encodings

scikitlab.vectorizers.encoder.EnumeratedEncoder

EnumeratedEncoder(handle_unknown='error')

Encodes a list of arrays into -- encodes a set of dimensions -- similar to label encoder but allows Nd array. -- -1 reserved for unknown -- acts as both input x or output y or multi xy encoder.

Parameters:

Name Type Description Default
handle_unknown str

behaviour when un-known objects or classes are seen. Either error or ignore

'error'

scikitlab.vectorizers.frequential.ItemCountVectorizer

ItemCountVectorizer(
    fn_norm=None,
    min_freq=0.0,
    max_freq=1.0,
    max_items=None,
    out_of_vocab=None,
    binary=False,
    **kwargs
)

A general purpose counting vectorizer of arbitrary collection of items throughout a dataset. This class was adapted from scikits CountVectorizer to generalize items not only to text documents. Class also decouples any internal item pre-processing into a user definable normalization function.

Parameters:

Name Type Description Default
fn_norm Callable

how to transform individual items per input.

None
min_freq float

filter rare items bellow this threshold.

0.0
max_freq float

filter frequent items above this threshold.

1.0
max_items int

keep only top most frequent items in corpus.

None
out_of_vocab str

feature name for catching un-recognized/filtered items.

None
binary bool

simply flag all non-zero counts items.

False

scikitlab.vectorizers.spatial.GeoVectorizer

GeoVectorizer(
    resolution,
    index_scheme="h3",
    items=None,
    offset=1,
    **kwargs
)

Converts shapely latitude & longitude point coordinates to a geospatial indexing scheme. This is useful to quantize areas & model neighboring or hierarchical spatial relationships between them. Some relationships are 1:1 but others are 1:many, so resulting vectors denote occurrence counts of all train-time known areas. Any unrecognized area at inference time is ignored & vectorized as zero. Depending on spatial resolution & coverage, vectors can be high-dimensional & are encoded as sparse. Users may want to cap dimensionality by size, frequency or perform other dimensionality reduction techniques.

Parameters:

Name Type Description Default
resolution int

Cell size of fetched areas. Range depends on scheme.

required
index_scheme str

Geo indexing scheme: h3, geohash, s2 ..

'h3'
items Set[str]

Combination of items to fetch: cells (default), neighbors, parents or children.

None
offset int

Number of neighbouring or hierarchical cells away from hits to fetch.

1

scikitlab.vectorizers.temporal.DateTimeVectorizer

DateTimeVectorizer(weights=None, utc_norm=False, **kwargs)

Encodes a date-time into a collection of trigonometrically encoded attribute parts. This vectorizer allows to emphasize on certain time attributes by weight as well as standardize to a common timezone.

Parameters:

Name Type Description Default
weights Dict[str, float]

set of (weighted) time attributes. Choose from: season, month, weekday, hour, minute, second, microsec else default to all with equal weight.

None
utc_norm bool

converts to coordinated universal time

False
kwargs

other parameters for base transformer.

{}

scikitlab.vectorizers.temporal.PeriodicityVectorizer

PeriodicityVectorizer(period, **kwargs)

Trigonometrically encodes a periodic signal as combination of sin & cosine values. This is useful to fairly capture cyclical distances between final & start of periods such as for dates. This encoding also caps the number of dimensions to 2 acting as a tradeoff between reducing the curse of high dimensionality from needing to one-hot-encode signal values & having fine approximation from using various radial- basis-functions.

Note that since both sin & cosine intercept the axis twice per period, both dimensions are required to precisely disambiguate where along the original signal the encoding lies.

Parameters:

Name Type Description Default
period int

how many units before cycle repeats

required

scikitlab.vectorizers.text.WeightedNgramVectorizer

WeightedNgramVectorizer(
    vectorizer_type="tfidf",
    weight_fn=None,
    ngram_range=(1, 1),
    n_jobs=None,
    verbose=False,
    **kwargs
)

Composite scikit component that manages ngram weighted frequency vectors like tf-idf or count vectorizers. Since smaller ngrams are more frequent but less insightful than larger ngrams, this component allows fair weighting & filtering ngrams proportionally to their token size throughout the corpus.

Parameters:

Name Type Description Default
vectorizer_type str

Type of the vectorizer "tfidf" or "count"

'tfidf'
weight_fn Callable[[int], float]

function to weight n-grams by size.

None
ngram_range tuple

min/max ngram sizes to build.

(1, 1)
n_jobs int

parallel process

None
verbose bool

verbose level

False
kwargs

other parameters for base vectorizer

{}

scikitlab.vectorizers.text.UniversalSentenceEncoder

UniversalSentenceEncoder(resource_dir=None)

Wrapper to pre-trained universal sentence encoder model to convert short English texts into fixed size dense semantic vectors without need for text preprocessing. This component may require warm start but is useful to cluster or compare document similarities.

Parameters:

Name Type Description Default
resource_dir str

path to zip model else downloads from web.

None