Skewness in Data#

How to Address Skewness in Data#

Once you’ve identified skewness, there are several techniques to transform the data and reduce skewness, improving the suitability for regression or other modeling tasks. Here are some common methods:

1. Use Log Transformation (for Positive Skew)#

A log transformation compresses large values and spreads smaller values, making the distribution more symmetric.

Code Example:#

import numpy as np

# Apply log transformation to positively skewed features
log_transformed = industry_df[VALUE_METRICES].apply(lambda x: np.log1p(x))  # log(1 + x)
  • When to use: Only for positive values.

  • Effect: Reduces positive skew (right tail).

2. Use Square Root Transformation#

The square root transformation can help reduce moderate skewness.

Code Example:#

sqrt_transformed = industry_df[VALUE_METRICES].apply(lambda x: np.sqrt(x))
  • When to use: Works for non-negative data.

  • Effect: Reduces right skew but less aggressive than log transformation.

3. Use Box-Cox Transformation (for Positive Values)#

The Box-Cox transformation applies a parameterized power transformation to make data more normal.

Code Example:#

from scipy.stats import boxcox

# Apply Box-Cox transformation to each column
boxcox_transformed = industry_df[VALUE_METRICES].apply(lambda x: boxcox(x + 1)[0] if (x > 0).all() else x)
  • When to use: Only for strictly positive data.

  • Effect: Handles various degrees of skewness.

4. Use PowerTransformer (Yeo-Johnson) for Both Positive and Negative Values#

The Yeo-Johnson transformation is similar to Box-Cox but works for both positive and negative values.

Code Example:#

from sklearn.preprocessing import PowerTransformer

pt = PowerTransformer(method='yeo-johnson')
X_transformed = pd.DataFrame(pt.fit_transform(industry_df[VALUE_METRICES]), columns=VALUE_METRICES)
  • When to use: Works with positive and negative data.

  • Effect: Reduces skewness and stabilizes variance.

5. Use RobustScaler if Outliers are the Cause#

Instead of transforming the data, RobustScaler reduces the influence of outliers by scaling based on the median and IQR.

Code Example:#

from sklearn.preprocessing import RobustScaler

scaler = RobustScaler()
X_scaled = pd.DataFrame(scaler.fit_transform(industry_df[VALUE_METRICES]), columns=VALUE_METRICES)
  • When to use: When skewness is due to outliers.

  • Effect: Maintains data distribution but reduces extreme effects of outliers.

6. Handle Skewness by Clipping Outliers#

Clip extreme values to a specific percentile to reduce the impact of outliers.

Code Example:#

# Clip values to the 1st and 99th percentiles
clipped_df = industry_df[VALUE_METRICES].apply(lambda x: x.clip(lower=x.quantile(0.01), upper=x.quantile(0.99)))
  • When to use: When skewness is caused by extreme outliers.

  • Effect: Reduces skew without changing most data.

7. Evaluate Results#

After applying any transformation, recalculate skewness to check improvement.

Code Example:#

transformed_skewness = X_transformed.skew()
print("Skewness after transformation:")
print(transformed_skewness)

Summary Table of Methods#

Transformation

Use Case

Handles Negative?

Handles Zero?

Log Transformation

Positive skew, large values

No

No

Square Root

Moderate positive skew

No

Yes

Box-Cox

Positive skew

No

No

Yeo-Johnson

Positive/negative skew

Yes

Yes

RobustScaler

Outliers causing skew

Yes

Yes

Clipping

Outliers causing skew

Yes

Yes


Would you like a specific transformation applied to your data, or further customization based on your dataset?