Vectorizers
These are pipeline components responsible for transforming input objects into their vectorized form. The input objects are domain specific but the resulting vectors are the numerical representation suitable for machine learning. Think of vectorizers as signal generators that emphasize or suppress the aspects being learned & come in the following forms:
Type | Description |
---|---|
encoders | transform inputs into vectorized forms |
reducers | dimensionality reduce other encodings |
scikitlab.vectorizers.encoder.EnumeratedEncoder
EnumeratedEncoder(handle_unknown='error')
Encodes a list of arrays into -- encodes a set of dimensions -- similar to label encoder but allows Nd array. -- -1 reserved for unknown -- acts as both input x or output y or multi xy encoder.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
handle_unknown |
str
|
behaviour when un-known objects or classes are seen. Either |
'error'
|
scikitlab.vectorizers.frequential.ItemCountVectorizer
ItemCountVectorizer(
fn_norm=None,
min_freq=0.0,
max_freq=1.0,
max_items=None,
out_of_vocab=None,
binary=False,
**kwargs
)
A general purpose counting vectorizer of arbitrary collection of items throughout a dataset. This class was adapted from scikits CountVectorizer to generalize items not only to text documents. Class also decouples any internal item pre-processing into a user definable normalization function.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
fn_norm |
Callable
|
how to transform individual items per input. |
None
|
min_freq |
float
|
filter rare items bellow this threshold. |
0.0
|
max_freq |
float
|
filter frequent items above this threshold. |
1.0
|
max_items |
int
|
keep only top most frequent items in corpus. |
None
|
out_of_vocab |
str
|
feature name for catching un-recognized/filtered items. |
None
|
binary |
bool
|
simply flag all non-zero counts items. |
False
|
scikitlab.vectorizers.spatial.GeoVectorizer
GeoVectorizer(
resolution,
index_scheme="h3",
items=None,
offset=1,
**kwargs
)
Converts shapely latitude & longitude point coordinates to a geospatial indexing scheme. This is useful to quantize areas & model neighboring or hierarchical spatial relationships between them. Some relationships are 1:1 but others are 1:many, so resulting vectors denote occurrence counts of all train-time known areas. Any unrecognized area at inference time is ignored & vectorized as zero. Depending on spatial resolution & coverage, vectors can be high-dimensional & are encoded as sparse. Users may want to cap dimensionality by size, frequency or perform other dimensionality reduction techniques.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
resolution |
int
|
Cell size of fetched areas. Range depends on scheme. |
required |
index_scheme |
str
|
Geo indexing scheme: |
'h3'
|
items |
Set[str]
|
Combination of items to fetch: |
None
|
offset |
int
|
Number of neighbouring or hierarchical cells away from hits to fetch. |
1
|
scikitlab.vectorizers.temporal.DateTimeVectorizer
DateTimeVectorizer(weights=None, utc_norm=False, **kwargs)
Encodes a date-time into a collection of trigonometrically encoded attribute parts. This vectorizer allows to emphasize on certain time attributes by weight as well as standardize to a common timezone.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
weights |
Dict[str, float]
|
set of (weighted) time attributes. Choose from: |
None
|
utc_norm |
bool
|
converts to coordinated universal time |
False
|
kwargs |
other parameters for base transformer. |
{}
|
scikitlab.vectorizers.temporal.PeriodicityVectorizer
PeriodicityVectorizer(period, **kwargs)
Trigonometrically encodes a periodic signal as combination of sin & cosine values. This is useful to fairly capture cyclical distances between final & start of periods such as for dates. This encoding also caps the number of dimensions to 2 acting as a tradeoff between reducing the curse of high dimensionality from needing to one-hot-encode signal values & having fine approximation from using various radial- basis-functions.
Note that since both sin & cosine intercept the axis twice per period, both dimensions are required to precisely disambiguate where along the original signal the encoding lies.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
period |
int
|
how many units before cycle repeats |
required |
scikitlab.vectorizers.text.WeightedNgramVectorizer
WeightedNgramVectorizer(
vectorizer_type="tfidf",
weight_fn=None,
ngram_range=(1, 1),
n_jobs=None,
verbose=False,
**kwargs
)
Composite scikit component that manages ngram weighted frequency vectors like tf-idf or count vectorizers. Since smaller ngrams are more frequent but less insightful than larger ngrams, this component allows fair weighting & filtering ngrams proportionally to their token size throughout the corpus.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
vectorizer_type |
str
|
Type of the vectorizer "tfidf" or "count" |
'tfidf'
|
weight_fn |
Callable[[int], float]
|
function to weight n-grams by size. |
None
|
ngram_range |
tuple
|
min/max ngram sizes to build. |
(1, 1)
|
n_jobs |
int
|
parallel process |
None
|
verbose |
bool
|
verbose level |
False
|
kwargs |
other parameters for base vectorizer |
{}
|
scikitlab.vectorizers.text.UniversalSentenceEncoder
UniversalSentenceEncoder(resource_dir=None)
Wrapper to pre-trained universal sentence encoder model to convert short English texts into fixed size dense semantic vectors without need for text preprocessing. This component may require warm start but is useful to cluster or compare document similarities.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
resource_dir |
str
|
path to zip model else downloads from web. |
None
|