Feature Engineering: Handling Missing Data

What is a feature? Machine learning algorithms require at least some input data to work. Inputs are comprised of features. Producing accurate predictions is the goal of a machine learning algorithm and feature engineering ties it all together. Feature engineering includes everything from filling missing values, to variable transformation, to building new variables from existing ones.

Here we will walk through a few approaches for handling missing data for numerical variables. These methods include complete case analysis, mean/median imputation and end of distribution imputation. The coding examples will be performed in python using the Titanic dataset. Below is a look at the different features. The feature “Survived” is what we will refer to as our target variable. This is the variable we are trying to predict.

Complete Case Analysis

We will begin by focusing on complete case analysis (CCA). CCA involves discarding observations that contain missing values. The advantage here is that it is easiest to implement. CCA works when data is missing completely at random. Let’s take a look at our Titanic dataset.

Features Age, Cabin, and Embarked appear to carry missing values. By removing everything, we are left with 20% of the original dataset. This is a significant data drop and will eventually lead to overfitting.

Instead, we can try an alternative CCA approach by creating a new field representing the missing values. In the example below, we created a new field called “Missing Age” that contains a 1 when the “Age” variable is null, otherwise, it contains a 0.

The distribution of people that survived is 40% for observations without missing data and 29% for those with missing data. Once again, this is the wrong approach. We would want the distributions to be matching for alternative CCA to work.

Mean / Median Imputation

Mean / median imputation involves replacing missing data within a variable by the mean (if the variable follows a normal distribution) or median (if the variable follows a skewed distribution). The advantage again is quick implementation and also leaves our entire dataset intact. A disadvantage of this approach is it may distort the original variance.

One important thing to note is that the imputation should be performed on the training dataset and then spread over the test set. This is done to avoid overfitting.

To determine which one to use, we’ll plot the distribution, then impute the values. Age appears to be fairly gaussian, so we can replace the nulls with the mean.

End of Distribution Imputation

If there is suspicion that the missing value is not at random then capturing that information is important. In this scenario, one would want to replace missing data with values that are at the tails of the distribution of the variable. The advantage is that it is quick and captures the importance of missing values (if one suspects the missing data is valuable). On the flipside, performing this action may distort the variable, mask predictive power if missingness is not important, hide true outliers if the missing data is large or create an unintended outlier if N/As are small. Once again, this method should be performed on the training set and propagated on the test set. Since we know “Age” follows a normal distribution, outliers will be computed using the mean rather than the median.

Handling missing values is just the tip of the iceberg when it comes to feature engineering. In future posts, we will describe how to handle categorical variables, dates, anomalies, scaling, normalizing and discretization.

Please be aware that there is no one size fits all for data science, but these best practices can be leveraged in your search for model optimization.

References
Towards Data Science
Stack Overflow
Elite Data Science
KDNuggest