Skewness in Data#
How to Address Skewness in Data#
Once you’ve identified skewness, there are several techniques to transform the data and reduce skewness, improving the suitability for regression or other modeling tasks. Here are some common methods:
1. Use Log Transformation (for Positive Skew)#
A log transformation compresses large values and spreads smaller values, making the distribution more symmetric.
Code Example:#
import numpy as np
# Apply log transformation to positively skewed features
log_transformed = industry_df[VALUE_METRICES].apply(lambda x: np.log1p(x)) # log(1 + x)
When to use: Only for positive values.
Effect: Reduces positive skew (right tail).
2. Use Square Root Transformation#
The square root transformation can help reduce moderate skewness.
Code Example:#
sqrt_transformed = industry_df[VALUE_METRICES].apply(lambda x: np.sqrt(x))
When to use: Works for non-negative data.
Effect: Reduces right skew but less aggressive than log transformation.
3. Use Box-Cox Transformation (for Positive Values)#
The Box-Cox transformation applies a parameterized power transformation to make data more normal.
Code Example:#
from scipy.stats import boxcox
# Apply Box-Cox transformation to each column
boxcox_transformed = industry_df[VALUE_METRICES].apply(lambda x: boxcox(x + 1)[0] if (x > 0).all() else x)
When to use: Only for strictly positive data.
Effect: Handles various degrees of skewness.
4. Use PowerTransformer (Yeo-Johnson) for Both Positive and Negative Values#
The Yeo-Johnson transformation is similar to Box-Cox but works for both positive and negative values.
Code Example:#
from sklearn.preprocessing import PowerTransformer
pt = PowerTransformer(method='yeo-johnson')
X_transformed = pd.DataFrame(pt.fit_transform(industry_df[VALUE_METRICES]), columns=VALUE_METRICES)
When to use: Works with positive and negative data.
Effect: Reduces skewness and stabilizes variance.
5. Use RobustScaler if Outliers are the Cause#
Instead of transforming the data, RobustScaler reduces the influence of outliers by scaling based on the median and IQR.
Code Example:#
from sklearn.preprocessing import RobustScaler
scaler = RobustScaler()
X_scaled = pd.DataFrame(scaler.fit_transform(industry_df[VALUE_METRICES]), columns=VALUE_METRICES)
When to use: When skewness is due to outliers.
Effect: Maintains data distribution but reduces extreme effects of outliers.
6. Handle Skewness by Clipping Outliers#
Clip extreme values to a specific percentile to reduce the impact of outliers.
Code Example:#
# Clip values to the 1st and 99th percentiles
clipped_df = industry_df[VALUE_METRICES].apply(lambda x: x.clip(lower=x.quantile(0.01), upper=x.quantile(0.99)))
When to use: When skewness is caused by extreme outliers.
Effect: Reduces skew without changing most data.
7. Evaluate Results#
After applying any transformation, recalculate skewness to check improvement.
Code Example:#
transformed_skewness = X_transformed.skew()
print("Skewness after transformation:")
print(transformed_skewness)
Summary Table of Methods#
Transformation |
Use Case |
Handles Negative? |
Handles Zero? |
|---|---|---|---|
Log Transformation |
Positive skew, large values |
No |
No |
Square Root |
Moderate positive skew |
No |
Yes |
Box-Cox |
Positive skew |
No |
No |
Yeo-Johnson |
Positive/negative skew |
Yes |
Yes |
RobustScaler |
Outliers causing skew |
Yes |
Yes |
Clipping |
Outliers causing skew |
Yes |
Yes |
Would you like a specific transformation applied to your data, or further customization based on your dataset?