Feature Engineering: Handling Missing Data

What is a feature? Machine learning algorithms require at least some input data to work. Inputs are comprised of features. Producing accurate predictions is the goal of a machine learning algorithm and feature engineering ties it all together. Feature engineering includes everything from filling missing values, to variable transformation, to building new variables from existing ones.

Here we will walk through a few approaches for handling missing data for numerical variables. These methods include complete case analysis, mean/median imputation and end of distribution imputation. The coding examples will be performed in python using the Titanic dataset. Below is a look at the different features. The feature “Survived” is what we will refer to as our target variable. This is the variable we are trying to predict.

Complete Case Analysis

We will begin by focusing on complete case analysis (CCA). CCA involves discarding observations that contain missing values. The advantage here is that it is easiest to implement. CCA works when data is missing completely at random. Let’s take a look at our Titanic dataset.

Features Age, Cabin, and Embarked appear to carry missing values. By removing everything, we are left with 20% of the original dataset. This is a significant data drop and will eventually lead to overfitting.

Instead, we can try an alternative CCA approach by creating a new field representing the missing values. In the example below, we created a new field called “Missing Age” that contains a 1 when the “Age” variable is null, otherwise, it contains a 0.

The distribution of people that survived is 40% for observations without missing data and 29% for those with missing data. Once again, this is the wrong approach. We would want the distributions to be matching for alternative CCA to work.

Mean / Median Imputation

Mean / median imputation involves replacing missing data within a variable by the mean (if the variable follows a normal distribution) or median (if the variable follows a skewed distribution). The advantage again is quick implementation and also leaves our entire dataset intact. A disadvantage of this approach is it may distort the original variance.

One important thing to note is that the imputation should be performed on the training dataset and then spread over the test set. This is done to avoid overfitting.

To determine which one to use, we’ll plot the distribution, then impute the values. Age appears to be fairly gaussian, so we can replace the nulls with the mean.

End of Distribution Imputation

If there is suspicion that the missing value is not at random then capturing that information is important. In this scenario, one would want to replace missing data with values that are at the tails of the distribution of the variable. The advantage is that it is quick and captures the importance of missing values (if one suspects the missing data is valuable). On the flipside, performing this action may distort the variable, mask predictive power if missingness is not important, hide true outliers if the missing data is large or create an unintended outlier if N/As are small. Once again, this method should be performed on the training set and propagated on the test set. Since we know “Age” follows a normal distribution, outliers will be computed using the mean rather than the median.

Handling missing values is just the tip of the iceberg when it comes to feature engineering. In future posts, we will describe how to handle categorical variables, dates, anomalies, scaling, normalizing and discretization.

Please be aware that there is no one size fits all for data science, but these best practices can be leveraged in your search for model optimization.

References
Towards Data Science
Stack Overflow
Elite Data Science
KDNuggest

Digging In

Artificial Intelligence
AI Agents in Action: 3 Proof of Concepts with Make.com, N8N, and CrewAI
Our recent exploration into AI agent frameworks revealed fascinating insights about the practical implementation of autonomous business processes. By building three distinct proof of concepts using Make.com, N8N, and CrewAI, we discovered that each platform offers unique strengths for different automation scenarios. From meeting preparation to project management and resource allocation, these AI agents demonstrated […]
Read More
Artificial Intelligence
The State of AI: Building Trust and Aligning Strategy to Drive Adoption and Impact
If you’ve been in a room with technology leaders lately, you’ve probably heard a lot of excitement – and a lot of frustration – about AI. Artificial intelligence has moved rapidly from a conceptual tool to a C-suite priority that offers boundless potential, but implementation remains a messy, human process. The truth is, we’re all […]
Read More
Artificial Intelligence
Can You Shortcut Testing to Expedite Your Digital Roadmap?
Slow testing cycles are the silent blockers to your product roadmap – it’s time for a change. AI-enabled automated testing can be a force multiplier as businesses look to increase the speed of digital transformation. In this article, we will cover: The Challenge: Complexities in Testing The AI-Driven Solution Innovations of AI-Driven Test Automation Real-World […]
Read More
Artificial Intelligence
Transforming the Tractor Supply Store Experience: AI’s Role in Modern Retail
Join us for a fireside conversation on how AI is reshaping the in-store experience at Tractor Supply. Business and technology leaders will explore the real-world impact of AI across retail—unpacking practical use cases, leadership insights, and future possibilities.
Read More
Artificial Intelligence
Unlocking Your Hidden Goldmine of Information: The Power of Document Intelligence
Did you know you are already sitting on a hidden goldmine of information that can deliver powerful, actionable insights? Here’s a truth bomb: a mountain of knowledge – and vast untapped potential – resides in a wellspring, far below the surface of your organization. Every text document, contract, report, policy, email, or manual contains critical […]
Read More
Artificial Intelligence
Building a Multi-Model LLM Chatbot with Azure OpenAI and Amazon Bedrock
This video will explore the journey of the creation of a Multi-Model LLM Chatbot that utilizes both Azure OpenAI and Amazon Bedrock.
Read More

Your Privacy

Feature Engineering: Handling Missing Data

Complete Case Analysis

Mean / Median Imputation

End of Distribution Imputation

Digging In

AI Agents in Action: 3 Proof of Concepts with Make.com, N8N, and CrewAI

The State of AI: Building Trust and Aligning Strategy to Drive Adoption and Impact

Can You Shortcut Testing to Expedite Your Digital Roadmap?

Transforming the Tractor Supply Store Experience: AI’s Role in Modern Retail

Unlocking Your Hidden Goldmine of Information: The Power of Document Intelligence

Building a Multi-Model LLM Chatbot with Azure OpenAI and Amazon Bedrock