Feature Engineering: Handling Missing Data

What is a feature? Machine learning algorithms require at least some input data to work. Inputs are comprised of features. Producing accurate predictions is the goal of a machine learning algorithm and feature engineering ties it all together. Feature engineering includes everything from filling missing values, to variable transformation, to building new variables from existing ones.

Here we will walk through a few approaches for handling missing data for numerical variables. These methods include complete case analysis, mean/median imputation and end of distribution imputation. The coding examples will be performed in python using the Titanic dataset. Below is a look at the different features. The feature “Survived” is what we will refer to as our target variable. This is the variable we are trying to predict.

Complete Case Analysis

We will begin by focusing on complete case analysis (CCA). CCA involves discarding observations that contain missing values. The advantage here is that it is easiest to implement. CCA works when data is missing completely at random. Let’s take a look at our Titanic dataset.

Features Age, Cabin, and Embarked appear to carry missing values. By removing everything, we are left with 20% of the original dataset. This is a significant data drop and will eventually lead to overfitting.

Instead, we can try an alternative CCA approach by creating a new field representing the missing values. In the example below, we created a new field called “Missing Age” that contains a 1 when the “Age” variable is null, otherwise, it contains a 0.

The distribution of people that survived is 40% for observations without missing data and 29% for those with missing data. Once again, this is the wrong approach. We would want the distributions to be matching for alternative CCA to work.

Mean / Median Imputation

Mean / median imputation involves replacing missing data within a variable by the mean (if the variable follows a normal distribution) or median (if the variable follows a skewed distribution). The advantage again is quick implementation and also leaves our entire dataset intact. A disadvantage of this approach is it may distort the original variance.

One important thing to note is that the imputation should be performed on the training dataset and then spread over the test set. This is done to avoid overfitting.

To determine which one to use, we’ll plot the distribution, then impute the values. Age appears to be fairly gaussian, so we can replace the nulls with the mean.

End of Distribution Imputation

If there is suspicion that the missing value is not at random then capturing that information is important. In this scenario, one would want to replace missing data with values that are at the tails of the distribution of the variable. The advantage is that it is quick and captures the importance of missing values (if one suspects the missing data is valuable). On the flipside, performing this action may distort the variable, mask predictive power if missingness is not important, hide true outliers if the missing data is large or create an unintended outlier if N/As are small. Once again, this method should be performed on the training set and propagated on the test set. Since we know “Age” follows a normal distribution, outliers will be computed using the mean rather than the median.

Handling missing values is just the tip of the iceberg when it comes to feature engineering. In future posts, we will describe how to handle categorical variables, dates, anomalies, scaling, normalizing and discretization.

Please be aware that there is no one size fits all for data science, but these best practices can be leveraged in your search for model optimization.

References
Towards Data Science
Stack Overflow
Elite Data Science
KDNuggest

Digging In

Artificial Intelligence
Automating Discovery: Turning Requirements into Jira Stories with AI
When UDig was asked to explore ways to accelerate delivery, the brief was intentionally open-ended, inviting the team to rethink existing processes and challenge assumptions. One area quickly emerged as a clear opportunity: discovery. While essential, discovery can slow momentum when large volumes of requirements must be manually translated into user stories. Like most projects, […]
Read More
Artificial Intelligence
Generative BI: Building a Natural-Language Analytics Engine
Our recent exploration into generative analytics uncovered exciting possibilities for the future of business intelligence. We set out with a broad goal: to democratize analytics insights and eliminate bottlenecks by giving users a personal data analyst. The result was GenBI, an internal proof of concept demonstrating how large language models can sit on top of structured datasets, translate natural language into SQL, and generate accurate charts in […]
Read More
Artificial Intelligence
Agentic Commerce: Four Paths Retailers Can Take Right Now
With over 40% of shoppers saying AI is now their primary source of insight, today’s agentic commerce tools create unprecedented visibility into consumer purchase intent and decision-making patterns. Today’s AI agents excel at surfacing clear product data and creating frictionless shopping experiences. Smart retailers already recognize agentic commerce as a differentiation opportunity, and some major […]
Read More
Artificial Intelligence
From Experimentation to Enterprise: Making AI Adoption Real A Q&A with Josh Bartels, Chief Technology Officer
Everyone’s talking about AI, but how do you actually move from buzz to business impact? We sat down with UDig CTO Josh Bartels to break down what it really takes to move beyond experimentation and build meaningful, scalable adoption across the enterprise. Q: How can organizations move beyond experimentation and start realizing real value with […]
Read More
Artificial Intelligence
Paid Media Analyzer Prototype
Built during UDig’s internal Airwave program, this prototype delivers automated Google Ads intelligence that pinpoints what’s working and what’s not, freeing teams from manual reporting and boosting ROI through faster, data-driven decisions.
Read More
Artificial Intelligence
Generative BI Prototype
Built during UDig’s internal Airwave program, this prototype lets users explore enterprise data in plain language through a conversational interface that translates questions into SQL and instantly returns results as charts or insights.
Read More

Your Privacy

Feature Engineering: Handling Missing Data

Complete Case Analysis

Mean / Median Imputation

End of Distribution Imputation

Digging In

Automating Discovery: Turning Requirements into Jira Stories with AI

Generative BI: Building a Natural-Language Analytics Engine

Agentic Commerce: Four Paths Retailers Can Take Right Now

From Experimentation to Enterprise: Making AI Adoption Real A Q&A with Josh Bartels, Chief Technology Officer

Paid Media Analyzer Prototype

Generative BI Prototype