Samplers

These are pipeline components responsible for changing training data volumes to match desired distributions to learn from. Useful when data is imbalanced or when it is scarce/expensive to obtain. Samplers typically implement filtering, cloning, synthesizing or permuting strategies that modify input rows only at learning time but have NO effect when predicting.

Integrating samplers within model pipeline ensures downstream data assumptions are preserved & allows for experimentation to treating data distributions & volumes as hyper-tunable parameters.

scikitlab.samplers.balancing.StrataBalancer

StrataBalancer(
    sampling_mode, columns, random_state=0, **kwargs
)

Enforce fairness by balancing a dataset based on a sub-population strata of the input. Each strata is defined from a specific set of variables in X having identical values. This component will modify the dataset only at fit time by either over or under sampling for equal volumes per strata but has no effect at predict time.

Parameters:

Name	Type	Description	Default
`sampling_mode`	`str`	either `over` or `under` sampling	required
`columns`	`list`	indices or column names in in `X` that define the strata.	required
`random_state`	`int`	for determinism	`0`

scikitlab.samplers.balancing.RegressionBalancer

RegressionBalancer(
    sampling_mode,
    fn_classifier=None,
    random_state=0,
    **kwargs
)

Over or under samples a regression dataset based on a category mapping over the target variables. This is useful when certain ranges in the predict regress variable are rare.

Parameters:

Name	Type	Description	Default
`sampling_mode`	`str`	either `over` or `under` sampling	required
`fn_classifier`	`Callable`	how to triage regression target into class ranges	`None`
`random_state`	`int`	for determinism	`0`

scikitlab.samplers.balancing.VectorBalancer

VectorBalancer(
    deduplicate=False,
    down_sample=False,
    over_sample=False,
    synthesize=False,
    random_state=0,
    **kwargs
)

Balance training data for proper learning. This components can over-sample the non-majority classes near the decision boundary to generate synthetic but relevant examples as well under-sample from all classes near the decision boundary to cleanup for noisy data. In order to interpolate the space, this component operates on vectors rather than raw datapoints.

Parameters:

Name	Type	Description	Default
`deduplicate`	`bool`	remove duplicate / identical vectors	`False`
`down_sample`	`bool`	remove redundant vectors	`False`
`over_sample`	`bool`	clone minority vectors	`False`
`synthesize`	`bool`	generate borderline synthetic vectors	`False`
`random_state`	`int`	for determinism	`0`