Skip to content

Samplers

These are pipeline components responsible for changing training data volumes to match desired distributions to learn from. Useful when data is imbalanced or when it is scarce/expensive to obtain. Samplers typically implement filtering, cloning, synthesizing or permuting strategies that modify input rows only at learning time but have NO effect when predicting.

Integrating samplers within model pipeline ensures downstream data assumptions are preserved & allows for experimentation to treating data distributions & volumes as hyper-tunable parameters.


scikitlab.samplers.balancing.StrataBalancer

StrataBalancer(
    sampling_mode, columns, random_state=0, **kwargs
)

Enforce fairness by balancing a dataset based on a sub-population strata of the input. Each strata is defined from a specific set of variables in X having identical values. This component will modify the dataset only at fit time by either over or under sampling for equal volumes per strata but has no effect at predict time.

Parameters:

Name Type Description Default
sampling_mode str

either over or under sampling

required
columns list

indices or column names in in X that define the strata.

required
random_state int

for determinism

0

scikitlab.samplers.balancing.RegressionBalancer

RegressionBalancer(
    sampling_mode,
    fn_classifier=None,
    random_state=0,
    **kwargs
)

Over or under samples a regression dataset based on a category mapping over the target variables. This is useful when certain ranges in the predict regress variable are rare.

Parameters:

Name Type Description Default
sampling_mode str

either over or under sampling

required
fn_classifier Callable

how to triage regression target into class ranges

None
random_state int

for determinism

0

scikitlab.samplers.balancing.VectorBalancer

VectorBalancer(
    deduplicate=False,
    down_sample=False,
    over_sample=False,
    synthesize=False,
    random_state=0,
    **kwargs
)

Balance training data for proper learning. This components can over-sample the non-majority classes near the decision boundary to generate synthetic but relevant examples as well under-sample from all classes near the decision boundary to cleanup for noisy data. In order to interpolate the space, this component operates on vectors rather than raw datapoints.

Parameters:

Name Type Description Default
deduplicate bool

remove duplicate / identical vectors

False
down_sample bool

remove redundant vectors

False
over_sample bool

clone minority vectors

False
synthesize bool

generate borderline synthetic vectors

False
random_state int

for determinism

0