Samplers
These are pipeline components responsible for changing training data volumes to match desired distributions to learn from. Useful when data is imbalanced or when it is scarce/expensive to obtain. Samplers typically implement filtering, cloning, synthesizing or permuting strategies that modify input rows only at learning time but have NO effect when predicting.
Integrating samplers within model pipeline ensures downstream data assumptions are preserved & allows for experimentation to treating data distributions & volumes as hyper-tunable parameters.
scikitlab.samplers.balancing.StrataBalancer
StrataBalancer(
sampling_mode, columns, random_state=0, **kwargs
)
Enforce fairness by balancing a dataset based on a sub-population
strata of the input. Each strata is defined from a specific set of
variables in X
having identical values. This component will modify
the dataset only at fit time by either over or under sampling for
equal volumes per strata but has no effect at predict time.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
sampling_mode |
str
|
either |
required |
columns |
list
|
indices or column names in in |
required |
random_state |
int
|
for determinism |
0
|
scikitlab.samplers.balancing.RegressionBalancer
RegressionBalancer(
sampling_mode,
fn_classifier=None,
random_state=0,
**kwargs
)
Over or under samples a regression dataset based on a category mapping over the target variables. This is useful when certain ranges in the predict regress variable are rare.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
sampling_mode |
str
|
either |
required |
fn_classifier |
Callable
|
how to triage regression target into class ranges |
None
|
random_state |
int
|
for determinism |
0
|
scikitlab.samplers.balancing.VectorBalancer
VectorBalancer(
deduplicate=False,
down_sample=False,
over_sample=False,
synthesize=False,
random_state=0,
**kwargs
)
Balance training data for proper learning. This components can over-sample the non-majority classes near the decision boundary to generate synthetic but relevant examples as well under-sample from all classes near the decision boundary to cleanup for noisy data. In order to interpolate the space, this component operates on vectors rather than raw datapoints.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
deduplicate |
bool
|
remove duplicate / identical vectors |
False
|
down_sample |
bool
|
remove redundant vectors |
False
|
over_sample |
bool
|
clone minority vectors |
False
|
synthesize |
bool
|
generate borderline synthetic vectors |
False
|
random_state |
int
|
for determinism |
0
|