Skewness in Data#

How to Address Skewness in Data#

Once you’ve identified skewness, there are several techniques to transform the data and reduce skewness, improving the suitability for regression or other modeling tasks. Here are some common methods:

1. Use Log Transformation (for Positive Skew)#

A log transformation compresses large values and spreads smaller values, making the distribution more symmetric.

Code Example:#

import numpy as np

# Apply log transformation to positively skewed features
log_transformed = industry_df[VALUE_METRICES].apply(lambda x: np.log1p(x))  # log(1 + x)

When to use: Only for positive values.
Effect: Reduces positive skew (right tail).

2. Use Square Root Transformation#

The square root transformation can help reduce moderate skewness.

Code Example:#

sqrt_transformed = industry_df[VALUE_METRICES].apply(lambda x: np.sqrt(x))

When to use: Works for non-negative data.
Effect: Reduces right skew but less aggressive than log transformation.

3. Use Box-Cox Transformation (for Positive Values)#

The Box-Cox transformation applies a parameterized power transformation to make data more normal.

Code Example:#

from scipy.stats import boxcox

# Apply Box-Cox transformation to each column
boxcox_transformed = industry_df[VALUE_METRICES].apply(lambda x: boxcox(x + 1)[0] if (x > 0).all() else x)

When to use: Only for strictly positive data.
Effect: Handles various degrees of skewness.

4. Use PowerTransformer (Yeo-Johnson) for Both Positive and Negative Values#

The Yeo-Johnson transformation is similar to Box-Cox but works for both positive and negative values.

Code Example:#

from sklearn.preprocessing import PowerTransformer

pt = PowerTransformer(method='yeo-johnson')
X_transformed = pd.DataFrame(pt.fit_transform(industry_df[VALUE_METRICES]), columns=VALUE_METRICES)

When to use: Works with positive and negative data.
Effect: Reduces skewness and stabilizes variance.

5. Use RobustScaler if Outliers are the Cause#

Instead of transforming the data, RobustScaler reduces the influence of outliers by scaling based on the median and IQR.

Code Example:#

from sklearn.preprocessing import RobustScaler

scaler = RobustScaler()
X_scaled = pd.DataFrame(scaler.fit_transform(industry_df[VALUE_METRICES]), columns=VALUE_METRICES)

When to use: When skewness is due to outliers.
Effect: Maintains data distribution but reduces extreme effects of outliers.

6. Handle Skewness by Clipping Outliers#

Clip extreme values to a specific percentile to reduce the impact of outliers.

Code Example:#

# Clip values to the 1st and 99th percentiles
clipped_df = industry_df[VALUE_METRICES].apply(lambda x: x.clip(lower=x.quantile(0.01), upper=x.quantile(0.99)))

When to use: When skewness is caused by extreme outliers.
Effect: Reduces skew without changing most data.

7. Evaluate Results#

After applying any transformation, recalculate skewness to check improvement.

Code Example:#transformed_skewness = X_transformed.skew()
print("Skewness after transformation:")
print(transformed_skewness)

Summary Table of Methods#

Transformation	Use Case	Handles Negative?	Handles Zero?
Log Transformation	Positive skew, large values	No	No
Square Root	Moderate positive skew	No	Yes
Box-Cox	Positive skew	No	No
Yeo-Johnson	Positive/negative skew	Yes	Yes
RobustScaler	Outliers causing skew	Yes	Yes
Clipping	Outliers causing skew	Yes	Yes

Would you like a specific transformation applied to your data, or further customization based on your dataset?