Your Privacy

This site uses cookies to enhance your browsing experience and deliver personalized content. By continuing to use this site, you consent to our use of cookies.
COOKIE POLICY

Feature Engineering: Handling Missing Data

Feature Engineering: Handling Missing Data
Back to insights

What is a feature? Machine learning algorithms require at least some input data to work. Inputs are comprised of features. Producing accurate predictions is the goal of a machine learning algorithm and feature engineering ties it all together. Feature engineering includes everything from filling missing values, to variable transformation, to building new variables from existing ones.

Here we will walk through a few approaches for handling missing data for numerical variables. These methods include complete case analysis, mean/median imputation and end of distribution imputation. The coding examples will be performed in python using the Titanic dataset. Below is a look at the different features. The feature “Survived” is what we will refer to as our target variable. This is the variable we are trying to predict.

feature engineering 1

Complete Case Analysis

We will begin by focusing on complete case analysis (CCA). CCA involves discarding observations that contain missing values. The advantage here is that it is easiest to implement. CCA works when data is missing completely at random. Let’s take a look at our Titanic dataset.

feature engineering 2

Features Age, Cabin, and Embarked appear to carry missing values. By removing everything, we are left with 20% of the original dataset. This is a significant data drop and will eventually lead to overfitting.

feature engineering 3

Instead, we can try an alternative CCA approach by creating a new field representing the missing values. In the example below, we created a new field called “Missing Age” that contains a 1 when the “Age” variable is null, otherwise, it contains a 0.

feature engineering 4

The distribution of people that survived is 40% for observations without missing data and 29% for those with missing data. Once again, this is the wrong approach. We would want the distributions to be matching for alternative CCA to work.

Mean / Median Imputation

Mean / median imputation involves replacing missing data within a variable by the mean (if the variable follows a normal distribution) or median (if the variable follows a skewed distribution). The advantage again is quick implementation and also leaves our entire dataset intact. A disadvantage of this approach is it may distort the original variance.

One important thing to note is that the imputation should be performed on the training dataset and then spread over the test set. This is done to avoid overfitting.

feature engineering 5

To determine which one to use, we’ll plot the distribution, then impute the values. Age appears to be fairly gaussian, so we can replace the nulls with the mean.

feature engineering 6

End of Distribution Imputation

If there is suspicion that the missing value is not at random then capturing that information is important. In this scenario, one would want to replace missing data with values that are at the tails of the distribution of the variable. The advantage is that it is quick and captures the importance of missing values (if one suspects the missing data is valuable). On the flipside, performing this action may distort the variable, mask predictive power if missingness is not important, hide true outliers if the missing data is large or create an unintended outlier if N/As are small.  Once again, this method should be performed on the training set and propagated on the test set.  Since we know “Age” follows a normal distribution, outliers will be computed using the mean rather than the median.

feature engineering 7

Handling missing values is just the tip of the iceberg when it comes to feature engineering.  In future posts, we will describe how to handle categorical variables, dates, anomalies, scaling, normalizing and discretization.

Please be aware that there is no one size fits all for data science, but these best practices can be leveraged in your search for model optimization.

References
Towards Data Science
Stack Overflow
Elite Data Science
KDNuggest

Digging In

  • Artificial Intelligence

    Capitalizing the AI Wave to Advance Data Governance

    AI is everywhere and getting a lot of attention, but you can’t successfully leverage AI without good data. You can use the buzz around AI to advance your data governance capabilities. Join us as we explore the intersection of AI and Data Governance.

  • Artificial Intelligence

    How Prompt Engineering Impacts the Deployment of AI

    The integration of artificial intelligence (AI) within enterprise operations marks a significant shift towards more efficient, informed decision-making processes. At the heart of this transformation is prompt engineering — a nuanced approach that plays a pivotal role in optimizing AI model interactions. This post explores the intricate framework of prompt engineering, outlines the structures of […]

  • Artificial Intelligence

    Emerging Technology: Artificial Intelligence (AI)

    From enhancing customer experiences to streamlining operations and enabling data-driven decision-making, AI is a transformative force that no agency can afford to ignore. Is Your Business Ready for AI?

  • Artificial Intelligence

    Is Your Business AI Ready?

    In the not-so-distant past, the concept of artificial intelligence (AI) often belonged to the realms of science fiction, promising a future of autonomous robots and sentient machines. Fast forward to today, and AI has not only emerged as a reality but has also skyrocketed in popularity, infiltrating virtually every sector of the business world. From […]

  • Artificial Intelligence

    Teaching a Robot to Read

    Many businesses are struggling to become more efficient and drive higher levels of employee engagement and customer satisfaction.  Intelligent Automation solutions could address all of those.  UDig can help you determine if it’s right for your organization, and if it is, you may get the opportunity to teach a robot how to read. You might think that […]

  • Artificial Intelligence

    Machine Learning in the Cloud

    In most machine learning projects, there is a common workflow that, at a minimum, consists of data preparation, model training, and model deployment. Still in its infancy, the Data Science community is testing various methodologies to streamline this process with varying degrees of success. This is the market that companies like Microsoft and Amazon are […]