Your Privacy

This site uses cookies to enhance your browsing experience and deliver personalized content. By continuing to use this site, you consent to our use of cookies.
COOKIE POLICY

Skip to main content

Feature Engineering: Handling Missing Data

Feature Engineering: Handling Missing Data
Back to insights

What is a feature? Machine learning algorithms require at least some input data to work. Inputs are comprised of features. Producing accurate predictions is the goal of a machine learning algorithm and feature engineering ties it all together. Feature engineering includes everything from filling missing values, to variable transformation, to building new variables from existing ones.

Here we will walk through a few approaches for handling missing data for numerical variables. These methods include complete case analysis, mean/median imputation and end of distribution imputation. The coding examples will be performed in python using the Titanic dataset. Below is a look at the different features. The feature “Survived” is what we will refer to as our target variable. This is the variable we are trying to predict.

feature engineering 1

Complete Case Analysis

We will begin by focusing on complete case analysis (CCA). CCA involves discarding observations that contain missing values. The advantage here is that it is easiest to implement. CCA works when data is missing completely at random. Let’s take a look at our Titanic dataset.

feature engineering 2

Features Age, Cabin, and Embarked appear to carry missing values. By removing everything, we are left with 20% of the original dataset. This is a significant data drop and will eventually lead to overfitting.

feature engineering 3

Instead, we can try an alternative CCA approach by creating a new field representing the missing values. In the example below, we created a new field called “Missing Age” that contains a 1 when the “Age” variable is null, otherwise, it contains a 0.

feature engineering 4

The distribution of people that survived is 40% for observations without missing data and 29% for those with missing data. Once again, this is the wrong approach. We would want the distributions to be matching for alternative CCA to work.

Mean / Median Imputation

Mean / median imputation involves replacing missing data within a variable by the mean (if the variable follows a normal distribution) or median (if the variable follows a skewed distribution). The advantage again is quick implementation and also leaves our entire dataset intact. A disadvantage of this approach is it may distort the original variance.

One important thing to note is that the imputation should be performed on the training dataset and then spread over the test set. This is done to avoid overfitting.

feature engineering 5

To determine which one to use, we’ll plot the distribution, then impute the values. Age appears to be fairly gaussian, so we can replace the nulls with the mean.

feature engineering 6

End of Distribution Imputation

If there is suspicion that the missing value is not at random then capturing that information is important. In this scenario, one would want to replace missing data with values that are at the tails of the distribution of the variable. The advantage is that it is quick and captures the importance of missing values (if one suspects the missing data is valuable). On the flipside, performing this action may distort the variable, mask predictive power if missingness is not important, hide true outliers if the missing data is large or create an unintended outlier if N/As are small.  Once again, this method should be performed on the training set and propagated on the test set.  Since we know “Age” follows a normal distribution, outliers will be computed using the mean rather than the median.

feature engineering 7

Handling missing values is just the tip of the iceberg when it comes to feature engineering.  In future posts, we will describe how to handle categorical variables, dates, anomalies, scaling, normalizing and discretization.

Please be aware that there is no one size fits all for data science, but these best practices can be leveraged in your search for model optimization.

References
Towards Data Science
Stack Overflow
Elite Data Science
KDNuggest

Digging In

  • Artificial Intelligence

    From Experimentation to Enterprise: Making AI Adoption Real A Q&A with Josh Bartels, Chief Technology Officer

    Everyone’s talking about AI, but how do you actually move from buzz to business impact? We sat down with UDig CTO Josh Bartels to break down what it really takes to move beyond experimentation and build meaningful, scalable adoption across the enterprise. Q: How can organizations move beyond experimentation and start realizing real value with […]

  • Artificial Intelligence

    Paid Media Analyzer Prototype

    Built during UDig’s internal Airwave program, this prototype delivers automated Google Ads intelligence that pinpoints what’s working and what’s not, freeing teams from manual reporting and boosting ROI through faster, data-driven decisions.

  • Artificial Intelligence

    Generative BI Prototype

    Built during UDig’s internal Airwave program, this prototype lets users explore enterprise data in plain language through a conversational interface that translates questions into SQL and instantly returns results as charts or insights.

  • Artificial Intelligence

    Airwave

    Accelerate AI adoption with clarity. By tuning into the right wavelength, enterprises move past the noise, build fluency fast, and turn experiments into scalable business impact.

  • Artificial Intelligence

    Meet UDig’s 2025 Intern Cohort

    This summer, four talented students from universities across the Southeast joined UDig as interns, bringing curiosity and fresh perspectives to the table. Sarah Galloway is studying Industrial Design at Georgia Institute of Technology. Vansh Joshi is a Computer Science major at the University of Tennessee – Knoxville. Kat Leon is pursuing Computer Science at Virginia […]

  • Artificial Intelligence

    UDig Joins CNBC AI Summit as Gold Sponsor to Advance AI Adoption

    Nashville, Tennessee – August 6, 2025 — UDig, a leading technology consulting firm, is proud to announce its participation as a Gold Sponsor of the inaugural CNBC AI Summit, taking place on October 15, 2025, in Nashville, Tennessee. The CNBC AI Summit will convene top executives, entrepreneurs, and AI leaders to explore how artificial intelligence […]