Data Imputation#

Imputation methods are techniques used to fill in missing data in datasets, ensuring that analyses remain robust and accurate. Here’s an overview of various imputation methods. Python code examples for each of the imputation methods using popular libraries like pandas, sklearn, fancyimpute, and statsmodels are given.

1. Simple Imputation Methods#

These methods are straightforward and assume the missing data can be approximated by basic statistics.

Mean Imputation#

Replaces missing values with the mean of the observed values for a variable.

Example:

Suppose you have a dataset of test scores for a class of students, but one student’s score is missing.

Dataset: [85, 90, 78, NaN, 88]
Imputation: The mean score is (85 + 90 + 78 + 88) / 4 = 85.25
Imputed Dataset: [85, 90, 78, 85.25, 88]

[12]:

import pandas as pd
import numpy as np

# Create a sample dataset
data = {'score': [85, 90, 78, np.nan, 88]}
df = pd.DataFrame(data)

# Impute missing values with mean
df['score'].fillna(df['score'].mean(), inplace=True)
print(df)

Median Imputation#

Uses the median instead of the mean, useful for skewed distributions.

Example:

You have a dataset of household incomes, but one income is missing.

Dataset: [45000, 52000, 60000, NaN, 75000]
Imputation: The median income is 56000 (the middle value).
Imputed Dataset: [45000, 52000, 60000, 56000, 75000]

Explanation

When you calculate the median, it’s essential to consider all the non-missing values (excluding np.nan), and then find the middle value of the sorted list.

Remove the missing value (np.nan) and sort the remaining values:
```
[45000, 52000, 60000, 75000]
```
Calculate the median:
- For an even number of values (4 values in this case), the median is the average of the two middle numbers.
- The middle values are 52000 and 60000.
- So, the median is:
  
  \[\text{Median} = \frac{52000 + 60000}{2} = \frac{112000}{2} = 56000\]

Thus, when you run df['income'].median(), the result is 56000, not 60000.

Updated DataFrame: After imputing the missing value with the median 56000, your DataFrame will look like this:

   income
45000.0
52000.0
60000.0
56000.0
75000.0

[13]:

# Create a sample dataset
data = {'income': [45000, 52000, 60000, np.nan, 75000]}
df = pd.DataFrame(data)

# Impute missing values with median
df['income'].fillna(df['income'].median(), inplace=True)
print(df)

Mode Imputation#

Fills missing values with the mode, commonly used for categorical data.

Example

You have a dataset of customer preferences, but one preference is missing.

Dataset: ['Red', 'Blue', 'Red', 'Green', NaN]
Imputation: The mode (most frequent) value is ‘Red’.
Imputed Dataset: ['Red', 'Blue', 'Red', 'Green', 'Red']

[14]:

# Impute missing values with mode (for categorical data)
data = {'color': ['Red', 'Blue', 'Red', 'Green', np.nan]}
df = pd.DataFrame(data)
df['color'].fillna(df['color'].mode()[0], inplace=True)
print(df)

   color
0    Red
1   Blue
2    Red
3  Green
4    Red

Constant Imputation#

Assigns a predefined constant (e.g., 0 or -1) to missing values.

Example

A survey includes a question on age, but one response is missing. You decide to impute with a constant value of 0 (assuming the person skipped that question).

Dataset: [25, 30, NaN, 22, 28]
Imputation: Replace NaN with 0.
Imputed Dataset: [25, 30, 0, 22, 28]

[15]:

# Create a sample dataset
data = {'score': [85, 90, 78, np.nan, 88]}
df = pd.DataFrame(data)
# Impute missing values with a constant value
df['score'].fillna(0, inplace=True)
print(df)

2. Advanced Statistical Imputation Methods#

These methods consider relationships between variables for more accurate imputations.

Linear Regression Imputation#

Predicts missing values using a regression model based on other variables.

Example: You have a dataset where one feature is missing, but it’s correlated with another feature.
Dataset: [Height (cm), Weight (kg)]
Suppose you know that Weight can be predicted based on Height (e.g., through a linear regression model Weight = 0.5 * Height + 10). If Weight is missing for Height = 170 cm, use the regression to impute it.
Imputation: Weight = 0.5 * 170 + 10 = 95 kg
Imputed Dataset: [170, 95] for the missing Weight value.

Logistic Regression Imputation#

Used for categorical data where the missing values are predicted with logistic regression.

K-Nearest Neighbors (KNN) Imputation#

Replaces missing values by averaging or taking the mode of the nearest neighbors based on distance metrics.

Example: You have a dataset of patients with missing values for their cholesterol levels. The missing value is imputed by finding the nearest neighbors (patients with similar characteristics, like age and weight) and averaging their cholesterol levels.
Dataset: [Age, Weight, Cholesterol]
For a patient with missing cholesterol data, the algorithm looks for the closest patients in terms of age and weight and takes the average cholesterol level of those patients.

Multivariate Imputation by Chained Equations (MICE)#

Iteratively imputes missing data using a set of regression models for each variable with missing data.

Example: You have a dataset with missing values across several variables (e.g., Income, Education, Age).
MICE will impute each missing value using a regression model that accounts for other variables. It repeats this process multiple times until convergence is reached.
The imputed values depend on the relationships between all variables, which results in more accurate estimates.

3. Machine Learning-Based Imputation#

These methods leverage machine learning models for more sophisticated imputations.

Random Forest Imputation: Uses random forests to predict missing values based on other data in the dataset.
Gradient Boosting Machines (GBM) Imputation: Similar to random forests but uses boosting algorithms for prediction.
Deep Learning Imputation: Neural networks can be trained to predict missing values, especially in complex datasets.

4. Multiple Imputation#

Instead of filling in a single value, multiple imputation generates several possible datasets by imputing values multiple times and then combines the results.

Rubin’s Multiple Imputation: Imputes missing data multiple times, runs analyses on each complete dataset, and pools the results.

5. Time-Series Specific Imputation#

For time-series data, the following methods are often used:

Last Observation Carried Forward (LOCF): Replaces missing values with the last observed value.
Next Observation Carried Backward (NOCB): Uses the next available observation.
Interpolation: Estimates missing values based on linear, spline, or polynomial interpolation.
Seasonal Decomposition of Time Series (STL) Imputation: Decomposes the time series and imputes based on the trend, seasonality, and residuals.

6. Probabilistic Imputation Methods#

These methods use probability distributions to estimate missing values.

Expectation-Maximization (EM) Algorithm: Estimates missing data by iteratively updating the expected value.
Bayesian Imputation: Draws imputations from a posterior distribution based on prior knowledge.

7. Domain-Specific Imputation#

In certain domains, imputation is tailored to the data’s characteristics:

Genetic Data Imputation: Uses linkage disequilibrium patterns to predict missing genotypes.
Spatial Data Imputation: Employs geostatistical techniques like Kriging.

Choosing the Right Method#

Simple imputation works for small datasets with limited missingness.
Advanced and ML-based methods are suitable for larger, complex datasets where relationships between variables are crucial.
Time-series methods work well when data is sequential.
Multiple imputation is recommended when preserving variability and uncertainty is important.

Example#

3. Machine Learning-Based Imputation#

Random Forest Imputation
- Example: In a dataset of customer information, some values for Annual Income are missing, but other columns like Age, Occupation, and Location are available.
- A random forest model is trained using these features to predict the missing Income values.
- Imputed Dataset: The model predicts and imputes the missing values based on the relationships it learns from other customer data.
Gradient Boosting Machines (GBM) Imputation
- Example: Similar to random forest, but this time a gradient boosting algorithm (like XGBoost or LightGBM) is used to predict the missing Annual Income based on other variables (e.g., Age, Gender, Occupation).
- The model builds several trees and combines their predictions to impute the missing income values.
Deep Learning Imputation
- Example: You have a complex dataset of medical records with missing values. A neural network (e.g., an autoencoder) is trained to predict missing values based on the patterns it learns from the observed data.
- The autoencoder compresses the dataset into a lower-dimensional space and reconstructs it, filling in missing values during this process.

4. Multiple Imputation#

Rubin’s Multiple Imputation
- Example: In a clinical trial, data on patients’ blood pressure is missing for some individuals. Instead of imputing a single value, multiple imputations are performed (e.g., 5 datasets), each with different plausible values for the missing data. The analysis is performed on all 5 datasets, and the results are combined to account for the uncertainty in the imputations.
- Imputed Datasets: 5 datasets are created with different values for the missing blood pressure data.

5. Time-Series Specific Imputation#

Last Observation Carried Forward (LOCF)
- Example: In a time-series dataset of daily sales, if a data point is missing for a specific day, it can be imputed by carrying forward the last observed sales value.
- Imputed Dataset: If sales on day 4 are missing, then the value from day 3 is used to fill the gap.
Interpolation
- Example: You have monthly temperature data, and one month’s data is missing. You can use linear interpolation to estimate the missing value based on the values from the adjacent months.
- Imputed Dataset: If the temperatures for January and March are known, the missing temperature for February can be interpolated as the average of January and March temperatures.
Seasonal Decomposition of Time Series (STL) Imputation
- Example: For a monthly sales dataset with a seasonal pattern, STL decomposition can separate the trend, seasonality, and residuals. Missing values are then imputed based on the seasonal and trend components.
- Imputed Dataset: The missing data is imputed based on the seasonal and trend components of the decomposition.

6. Probabilistic Imputation Methods#

Expectation-Maximization (EM) Algorithm
- Example: In a dataset where income is missing, the EM algorithm estimates missing values by iterating between two steps: the expectation step (estimating missing values given the observed data) and the maximization step (estimating model parameters).
- The EM algorithm gradually fills in missing data with the most likely values based on the distribution of the observed data.
Bayesian Imputation
- Example: In a dataset with missing values for a feature like Age, Bayesian imputation would estimate missing values based on the posterior distribution, incorporating prior beliefs about the distribution of age.
- Imputed Dataset: Missing Age values are filled with estimates drawn from the posterior distribution.

7. Domain-Specific Imputation#

Genetic Data Imputation
- Example: In a genetic study, missing genotypic data (like SNPs) can be imputed using information from a reference panel that provides probabilistic estimates of the missing genotypes based on linkage disequilibrium.
- Imputed Dataset: Missing SNP values are imputed based on patterns of genetic variation in the population.
Spatial Data Imputation (Kriging)
- Example: For environmental data, if a few sensor locations are missing temperature readings, Kriging (a geostatistical method) can be used to interpolate values at those missing locations based on surrounding sensor data.
- Imputed Dataset: Missing temperature values are imputed using nearby measurements, with a spatial correlation model.

Summary#

Each method has its strengths and is suited to different types of missing data. Simple methods work well when the data is missing at random, while more complex models are better for capturing intricate patterns and relationships in the data.

Codes#

from sklearn.linear_model import LinearRegression

# Create a sample dataset
data = {'Height': [150, 160, 170, 180, 190],
        'Weight': [50, 60, np.nan, 80, 90]}
df = pd.DataFrame(data)

# Prepare the data for imputation (remove missing value rows for regression model)
train_data = df.dropna()

# Train a linear regression model to predict missing values
model = LinearRegression()
model.fit(train_data[['Height']], train_data['Weight'])

# Predict missing values for 'Weight'
missing_height = df[df['Weight'].isna()]['Height']
predicted_weight = model.predict(missing_height.values.reshape(-1, 1))

# Impute missing values
df.loc[df['Weight'].isna(), 'Weight'] = predicted_weight
print(df)

from sklearn.impute import KNNImputer

# Create a sample dataset with missing values
data = {'Height': [150, 160, 170, np.nan, 190],
        'Weight': [50, 60, np.nan, 80, 90]}
df = pd.DataFrame(data)

# KNN Imputer
knn = KNNImputer(n_neighbors=2)
df_imputed = knn.fit_transform(df)
df_imputed = pd.DataFrame(df_imputed, columns=df.columns)

print(df_imputed)

from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

# Create a sample dataset
data = {'Height': [150, 160, 170, np.nan, 190],
        'Weight': [50, 60, np.nan, 80, 90]}
df = pd.DataFrame(data)

# MICE Imputation
mice = IterativeImputer(max_iter=10, random_state=0)
df_imputed = mice.fit_transform(df)
df_imputed = pd.DataFrame(df_imputed, columns=df.columns)

print(df_imputed)

from sklearn.ensemble import RandomForestRegressor

# Create a sample dataset
data = {'Height': [150, 160, 170, 180, 190],
        'Weight': [50, 60, np.nan, 80, 90]}
df = pd.DataFrame(data)

# Prepare the data for imputation (remove missing value rows for training)
train_data = df.dropna()

# Train a random forest model
model_rf = RandomForestRegressor()
model_rf.fit(train_data[['Height']], train_data['Weight'])

# Predict missing values for 'Weight'
missing_height = df[df['Weight'].isna()]['Height']
predicted_weight_rf = model_rf.predict(missing_height.values.reshape(-1, 1))

# Impute missing values
df.loc[df['Weight'].isna(), 'Weight'] = predicted_weight_rf
print(df)

from sklearn.ensemble import GradientBoostingRegressor

# Create a sample dataset
data = {'Height': [150, 160, 170, 180, 190],
        'Weight': [50, 60, np.nan, 80, 90]}
df = pd.DataFrame(data)

# Prepare the data for imputation (remove missing value rows for training)
train_data = df.dropna()

# Train a gradient boosting model
model_gbm = GradientBoostingRegressor()
model_gbm.fit(train_data[['Height']], train_data['Weight'])

# Predict missing values for 'Weight'
missing_height = df[df['Weight'].isna()]['Height']
predicted_weight_gbm = model_gbm.predict(missing_height.values.reshape(-1, 1))

# Impute missing values
df.loc[df['Weight'].isna(), 'Weight'] = predicted_weight_gbm
print(df)

from sklearn.neural_network import MLPRegressor

# Create a sample dataset
data = {'Height': [150, 160, 170, 180, 190],
        'Weight': [50, 60, np.nan, 80, 90]}
df = pd.DataFrame(data)

# Prepare the data for imputation (remove missing value rows for training)
train_data = df.dropna()

# Train a simple neural network for imputation (MLP Regressor)
model_nn = MLPRegressor(hidden_layer_sizes=(5,), max_iter=1000)
model_nn.fit(train_data[['Height']], train_data['Weight'])

# Predict missing values for 'Weight'
missing_height = df[df['Weight'].isna()]['Height']
predicted_weight_nn = model_nn.predict(missing_height.values.reshape(-1, 1))

# Impute missing values
df.loc[df['Weight'].isna(), 'Weight'] = predicted_weight_nn
print(df)

import statsmodels.api as sm

# Create a sample dataset
data = {'Height': [150, 160, 170, np.nan, 190],
        'Weight': [50, 60, np.nan, 80, 90]}
df = pd.DataFrame(data)

# Use the Multiple Imputation package (MICE) from Statsmodels
imp = sm.imputation.MICEData(df)
df_imputed = imp.data

print(df_imputed)

import pandas as pd

# Create a time-series dataset
data = {'date': pd.date_range(start='2024-01-01', periods=5, freq='D'),
        'sales': [200, 210, np.nan, 250, np.nan]}
df = pd.DataFrame(data)

# Impute missing sales values by carrying forward the last observation
df['sales'].fillna(method='ffill', inplace=True)
print(df)

# Impute missing values using linear interpolation
df['sales'] = df['sales'].interpolate(method='linear')
print(df)

from statsmodels.tsa.seasonal import STL
import numpy as np

# Create a sample time-series data with missing values
data = {'date': pd.date_range(start='2024-01-01', periods=10, freq='D'),
        'sales': [200, 210, np.nan, 250, np.nan, 230, np.nan, 260, 270, 280]}
df = pd.DataFrame(data)

# Perform STL decomposition and interpolate missing values
df.set_index('date', inplace=True)
decomposition = STL(df['sales'], seasonal=7)
result = decomposition.fit()

# Interpolate the missing values using the trend and seasonal components
df['sales'] = result.trend + result.seasonal + result.resid
print(df)

from statsmodels.imputation import mice

# Create a sample dataset
data = {'Height': [150, 160, 170, np.nan, 190],
        'Weight': [50, 60, np.nan, 80, 90]}
df = pd.DataFrame(data)

# Use MICE to perform expectation-maximization imputation
imp = mice.MICEData(df)
df_imputed = imp.data
print(df_imputed)

These examples should help you get started with different imputation methods in Python. Depending on your data type (numeric, categorical, time-series, etc.), you can choose the method that best suits your needs.