Your Privacy

This site uses cookies to enhance your browsing experience and deliver personalized content. By continuing to use this site, you consent to our use of cookies.
COOKIE POLICY

Measuring Success in Data Science Efforts

Measuring Success in Data Science Efforts
Back to insights

A common challenge in data science efforts is discussing how well a model is performing. Some big lessons we’ve learned in our client engagements is that understanding the measures of success and translating them into language SMEs and stakeholders can understand is absolutely vital to this kind of work. In this blog, I am going to look at three measures used to assess the efficacy of a classification model: accuracy, precision, and recall. Further, I will discuss how these measures have to be taken contextually; defining the success of a model can vary by industry, challenge, and how mission critical the decisions made by the model are. Again, these measures only apply to classification models (a model where a discrete conclusion is drawn given a set of data).

For the remainder of this blog I will use simple confusion matrices to illustrate my points. This matrix will consist of predicted “positives” and “negatives”, as well as actual “positives” and “negatives”. Before we begin, let’s look at an example.

Example: Predicting the Outcome of a Mortgage Application

In this example, let’s assume we use a training set of 10,000 historical mortgage applications. We’ve built and trained a model to predict when an application would be approved or not. We’ll use nice round numbers just for ease!

prediction table

Of the four cells let’s use this terminology:

  • True Positive: The model predicted the application would be approved and it was approved.
  • True Negative: The model predicted the application would be rejected and it was rejected.
  • False Positive: The model predicted the application would be approved, but it was rejected.
  • False Negative: The model predicted the application would be rejected, but it was approved.

In this case, we can say the following with confidence: 8,500 applications were correctly predicted. 1,500 were not. Now, let’s break down the efficacy of the model by looking at accuracy, precision, and recall.

Accuracy is the most easily understood measure. It is quite simply the total number of true positives and true negatives divided by the entire data set. This is, of course, a very useful measure because it lets us know how well our model is performing overall.

The accuracy of this sample model is 85% (7,000 + 1,500 all divided by 10,000). Not bad, depending on the scenario; which we will discuss more in depth shortly.

Precision is simply a measure of true positives (positives our model identified) divided by all actual positives (both positives our model identified and those it missed).

In our example, that is 7,000 over 8,000 or 87.5%.

Before we talk about recall, let’s understand how these measures differ, and when you should care about one or the other.

If the cost of false positives is high, then precision is very important. In our mortgage application, false positives are certainly undesirable (We don’t want to lend to people who might not be able to keep up with the mortgage!) but might not spell doom for the organization. What if, however, our model was being utilized by a self-driving car and was attempting to predict when to apply emergency braking. False positives (ie braking when the situation is not actually warranted) could be extremely dangerous, or even fatal depending on the road conditions. Conversely, if our model was attempting to diagnose the recurrence of cancer, a false positive would only result in a more in-depth follow up from an oncologist.

Recall is the measure of true positives (positives our model identified) divided by true positives and false negatives (positives our model missed).

In our sample data that would be 7,000 over 7,500. which gives us a Recall of 0.933 or 93.3%

If the cost of false negatives is high, then Recall is very important. Again, in our mortgage application, false negatives are undesirable (We don’t want to miss out on a good borrower receiving a mortgage from us!) but potentially not catastrophic. But now let’s consider a model that detects micro-fissures in the casing of nuclear reactors. If our model predicts “There’s no fault, everything is fine!” when in fact there is (false negative), the outcome could be totally disastrous!

Recall and Precision are two very Important measures, and which one you should care more about varies from use case to use case.

F1 Score

I know this has been complicated enough, but a discussion on precision and recall would not be complete without mentioning F1 score. The F1 score is a function of both precision and recall and applies in situations where both false positives and false negatives are extremely important. I won’t go into the calculation here but know that it is a fourth evaluation criteria for certain applications.

In closing, evaluating a classification model is not quite as straightforward as one might think. It is worth the effort to think through exactly what your model is predicting and decide early how to measure your model in a meaningful way that is tailored to the use case. In cases of life and death, extreme care should be taken to ensure the model is erring on the side of caution, while using a model to recommend a product someone might want to buy can afford to cast a much wider net. I hope this look at how to evaluate a model has been helpful! Thanks for reading.

 

 

Digging In

  • Data & Analytics

    Unlocking the Full Potential of a Customer 360: A Comprehensive Guide

    In today’s fast-paced digital economy, understanding your customer has never been more critical. The concept of a customer 360 view has emerged as a revolutionary approach to gaining a comprehensive understanding of consumers by integrating data from different touchpoints to offer a holistic view.  A customer 360 view is about taking an overarching approach to […]

  • Data & Analytics

    Microsoft Fabric: A New Unified Data Platform

    MicroPopular data services and tools often specialize in specific aspects of the data analytics pipeline, serving teams in the data lifecycle. For instance, Snowflake addresses large-scale data warehousing challenges, while Databricks focuses on data engineering and science. Power BI and Tableau have become standard tools for business intelligence tasks. So, where does Microsoft Fabric create […]

  • Data & Analytics

    Improve Member Experience: Maximize Engagement & Value for Associations

    As you know, member engagement is key to providing value and retaining members over time. However, you must also recognize that member needs and preferences are evolving rapidly, especially as they desire more seamless digital experiences. Additionally, member expectations for personalized, omnichannel interactions have risen in recent years, and this means that associations must strategically […]

  • Data & Analytics

    A Guide to Data Strategy Success in Your Association

    While countless organizations aim to harness the potential of data, few possess a clear strategy to transform raw information into actionable insights that fuel their operations and marketing efforts. Don’t fall into the trap of investing in limited, tactical solutions.

  • Data & Analytics

    ChatGPT & Your Data Strategy – Revolution or Evolution?

    You would be hard-pressed to find a single person who was not some degree of impressed when they first tried out ChatGPT. After its public release, the conversation in the tech space seemingly changed overnight about how AI would change everything. But much like past hot topics in the tech world – such as the […]

  • Data & Analytics

    Revamping Data Pipeline Infrastructure to Increase Owner Satisfaction at Twiddy

    In an ever-evolving technological landscape, embracing new methodologies is vital for enhancing efficiency. Our data and analytics interns recently undertook a significant overhaul of one of Twiddy’s data pipeline infrastructures, implementing Airbyte pipelines with Kestra orchestration to replace an existing Java application. Motivated by several challenges with the previous system, most importantly a complete loss […]