Property | Value |
URL | https://scikit-learn.org/stable/modules/preprocessing.html |
Created | 2021-06-28T15:54:00.000Z |
Already Read | false |
Name | 6.3. Preprocessing data — scikit-learn 0.24.2 documentation |
parent | References |
Standardization of datasets is a common requirement for many machine learning estimators implemented in scikit-learn; they might behave badly if the individual features do not more or less look like standard normally distributed data: Gaussian with zero mean and unit variance.
In practice we often ignore the shape of the distribution and just transform the data to center it by removing the mean value of each feature, then scale it by dividing non-constant features by their standard deviation.
For instance, many elements used in the objective function of a learning algorithm (such as the RBF kernel of Support Vector Machines or the l1 and l2 regularizers of linear models) assume that all features are centered around zero and have variance in the same order. If a feature has a variance that is orders of magnitude larger than others, it might dominate the objective function and make the estimator unable to learn from other features correctly as expected.
The preprocessing
module provides the StandardScaler
utility class, which is a quick and easy way to perform the following operation on an array-like dataset:
Scaled data has zero mean and unit variance:
This class implements the Transformer
API to compute the mean and standard deviation on a training set so as to be able to later re-apply the same transformation on the testing set. This class is hence suitable for use in the early steps of a Pipeline
:
It is possible to disable either centering or scaling by either passing with_mean=False
or with_std=False
to the constructor of StandardScaler
.
An alternative standardization is scaling features to lie between a given minimum and maximum value, often between zero and one, or so that the maximum absolute value of each feature is scaled to unit size. This can be achieved using MinMaxScaler
or MaxAbsScaler
, respectively.
The motivation to use this scaling include robustness to very small standard deviations of features and preserving zero entries in sparse data.
Here is an example to scale a toy data matrix to the [0, 1]
range:
The same instance of the transformer can then be applied to some new test data unseen during the fit call: the same scaling and shifting operations will be applied to be consistent with the transformation performed on the train data:
It is possible to introspect the scaler attributes to find about the exact nature of the transformation learned on the training data:
If MinMaxScaler
is given an explicit feature_range=(min, max)
the full formula is:
MaxAbsScaler
works in a very similar fashion, but scales in a way that the training data lies within the range [-1, 1]
by dividing through the largest maximum value in each feature. It is meant for data that is already centered at zero or sparse data.
Here is how to use the toy data from the previous example with this scaler:
Centering sparse data would destroy the sparseness structure in the data, and thus rarely is a sensible thing to do. However, it can make sense to scale sparse inputs, especially if features are on different scales.
MaxAbsScaler
was specifically designed for scaling sparse data, and is the recommended way to go about this. However, StandardScaler
can accept scipy.sparse
matrices as input, as long as with_mean=False
is explicitly passed to the constructor. Otherwise a ValueError
will be raised as silently centering would break the sparsity and would often crash the execution by allocating excessive amounts of memory unintentionally. RobustScaler
cannot be fitted to sparse inputs, but you can use the transform
method on sparse inputs.
Note that the scalers accept both Compressed Sparse Rows and Compressed Sparse Columns format (see scipy.sparse.csr_matrix
and scipy.sparse.csc_matrix
). Any other sparse input will be converted to the Compressed Sparse Rows representation. To avoid unnecessary memory copies, it is recommended to choose the CSR or CSC representation upstream.
Finally, if the centered data is expected to be small enough, explicitly converting the input to an array using the toarray
method of sparse matrices is another option.
If your data contains many outliers, scaling using the mean and variance of the data is likely to not work very well. In these cases, you can use RobustScaler
as a drop-in replacement instead. It uses more robust estimates for the center and range of your data.
References:
Further discussion on the importance of centering and scaling data is available on this FAQ: Should I normalize/standardize/rescale the data?
Scaling vs Whitening
It is sometimes not enough to center and scale the features independently, since a downstream model can further make some assumption on the linear independence of the features.
To address this issue you can use PCA
with whiten=True
to further remove the linear correlation across features.
If you have a kernel matrix of a kernel that computes a dot product in a feature space defined by function , a KernelCenterer
can transform the kernel matrix so that it contains inner products in the feature space defined by followed by removal of the mean in that space.
Two types of transformations are available: quantile transforms and power transforms. Both quantile and power transforms are based on monotonic transformations of the features and thus preserve the rank of the values along each feature.
Quantile transforms put all features into the same desired distribution based on the formula where is the cumulative distribution function of the feature and the quantile function of the desired output distribution . This formula is using the two following facts: (i) if is a random variable with a continuous cumulative distribution function then is uniformly distributed on ; (ii) if is a random variable with uniform distribution on then has distribution . By performing a rank transformation, a quantile transform smooths out unusual distributions and is less influenced by outliers than scaling methods. It does, however, distort correlations and distances within and across features.
Power transforms are a family of parametric transformations that aim to map data from any distribution to as close to a Gaussian distribution.
QuantileTransformer
provides a non-parametric transformation to map the data to a uniform distribution with values between 0 and 1:
This feature corresponds to the sepal length in cm. Once the quantile transformation applied, those landmarks approach closely the percentiles previously defined:
This can be confirmed on a independent testing set with similar remarks:
In many modeling scenarios, normality of the features in a dataset is desirable. Power transforms are a family of parametric, monotonic transformations that aim to map data from any distribution to as close to a Gaussian distribution as possible in order to stabilize variance and minimize skewness.
PowerTransformer
currently provides two such power transformations, the Yeo-Johnson transform and the Box-Cox transform.
The Yeo-Johnson transform is given by:
while the Box-Cox transform is given by:
Box-Cox can only be applied to strictly positive data. In both methods, the transformation is parameterized by , which is determined through maximum likelihood estimation. Here is an example of using Box-Cox to map samples drawn from a lognormal distribution to a normal distribution:
While the above example sets the standardize
option to False
, PowerTransformer
will apply zero-mean, unit-variance normalization to the transformed output by default.
Below are examples of Box-Cox and Yeo-Johnson applied to various probability distributions. Note that when applied to certain distributions, the power transforms achieve very Gaussian-like results, but with others, they are ineffective. This highlights the importance of visualizing the data before and after transformation.
It is also possible to map data to a normal distribution using QuantileTransformer
by setting output_distribution='normal'
. Using the earlier example with the iris dataset:
Thus the median of the input becomes the mean of the output, centered at 0. The normal output is clipped so that the input’s minimum and maximum — corresponding to the 1e-7 and 1 - 1e-7 quantiles respectively — do not become infinite under the transformation.
Normalization is the process of scaling individual samples to have unit norm. This process can be useful if you plan to use a quadratic form such as the dot-product or any other kernel to quantify the similarity of any pair of samples.
This assumption is the base of the Vector Space Model often used in text classification and clustering contexts.
The function normalize
provides a quick and easy way to perform this operation on a single array-like dataset, either using the l1
, l2
, or max
norms:
The preprocessing
module further provides a utility class Normalizer
that implements the same operation using the Transformer
API (even though the fit
method is useless in this case: the class is stateless as this operation treats samples independently).
This class is hence suitable for use in the early steps of a Pipeline
:
The normalizer instance can then be used on sample vectors as any transformer:
Note: L2 normalization is also known as spatial sign preprocessing.
Sparse input
normalize
and Normalizer
accept both dense array-like and sparse matrices from scipy.sparse as input.
For sparse input the data is converted to the Compressed Sparse Rows representation (see scipy.sparse.csr_matrix
) before being fed to efficient Cython routines. To avoid unnecessary memory copies, it is recommended to choose the CSR representation upstream.