Accuracy Never Tells the Whole Story
Updated: Mar 9
The data science process always involves, at some point, building a model. Models are simplified representations of some phenomenon we observe in the world. We might build models of how customers interact with a retailer, or we might build computer vision models that see the world around them in a way that attempts to represent human vision. Later in the data science process, we depend on these models to derive business insights. Model outputs often direct critical resource allocation and guide high-impact strategic and investment decisions. We expect these models to deliver real value. So it’s natural to ask: how accurate are they?
We intuitively understand accuracy as a measure of how well a model performs, that is, how well a model actually represents a real-world phenomenon. Usually accuracy is a number between 0 and 100%. A good model is 100% accurate. But when we try to pin down a more nuanced definition, we find that accuracy has many meanings. A model with 95% accuracy could be very good, or terrible, depending on the context. For example, a self-driving car that only recognizes red lights 95% of the time is a serious safety threat, but a navigation system that predicts travel times with 95% accuracy would be a success. This contextual ambiguity makes accuracy a problematic metric for measuring model performance. In some cases, accuracy can be completely meaningless. Here we explain why accuracy is so commonly cited yet so problematic and offer some helpful advice on contextualizing model performance along the way.
In the context of data science models, accuracy has several technical definitions, but we’ll start with the more colloquial dictionary definition: conformity to truth or to a standard; exactness. So, when a business stakeholder asks a data scientist how accurate a model is or how well it performs, what they are really asking is, “is the output of the model true and can I trust it to make decisions?” Accuracy is a single metric that attempts to answer this question. It’s concise and seems simple to interpret. It feels comforting.
Confusion occurs when the colloquial definition of accuracy collides with the technical definitions. First, let’s walk through an example that demonstrates the difference between these definitions. You would like to classify images of ice cream as either chocolate or vanilla. In data science parlance, we would call these classes. In your data set, you have exactly 500 images of chocolate ice cream and 500 images of vanilla. You train a computer vision model on these images and report an accuracy of 80%, meaning that 80% of the ice cream images were correctly classified by flavor. Does that mean that 80% of chocolate ice cream images were correctly classified? No. If vanilla ice cream images were classified correctly 100% of the time, then chocolate ice cream images could have been classified correctly only 60% of the time. If identifying chocolate ice cream is critical to your business, this performance discrepancy is a problem. Even in this simple classification example, we see that “accuracy” is actually an average accuracy which can hide the underlying variation in how a model performs. To answer the question of whether you can trust the model, something more than average accuracy is needed.
Let’s take our example above and place our model predictions into a table, along with the true classes (chocolate or vanilla).
Here we immediately see the underlying variation that the single accuracy measure was hiding: all of the vanilla images were correctly classified as vanilla, but only 300 of 500 chocolate images (60%) were classified correctly. This type of table is called a confusion matrix. It is an unfortunate name, but a confusion matrix actually adds clarity to how a model performs. It tells us exactly what data went into the model (by summing the columns), and how frequently the model made correct predictions (along the diagonal in green) and incorrect predictions (in red). Confusion matrices are so common that each box has a specific name. We show these names in a table below.
You may have seen these terms before. We have positive and negative cases instead of vanilla and chocolate. The terms true positive, false positive, true negative, and false negative depend on which class gets defined as positive and which gets defined as negative. A confusion matrix always shows these four values. So, when a data scientist reports model accuracy, one way to clarify that value is to ask for a confusion matrix.
We’ve now seen an example in which accuracy is misleading, and have observed how a confusion matrix can help. However, this issue with average accuracy (sometimes called overall accuracy) comes up again and again. Next, we’ll examine another case of misleading accuracy: detecting rare events like fraud.
Accurately Predicting Fraud and Other Rare Events
Fraud is an enormous problem across many sectors, from banking to healthcare to telecommunications. Fraud is also rare. For example, imagine that data scientists at Visa build a new fraud detection model. The model labels transactions as either valid or fraudulent. Visa provides you a data set of transactions that are 99% valid and 1% fraudulent. The data scientists carefully construct a deep learning model to detect fraudulent transactions and report back to Visa management that the model is 99% accurate. Can you trust the model? The short answer is no. If the data scientists simply labeled all transactions as valid, they would be correct 99% of the time. So in this case, its possible to have a fraud detection model that is 99% accurate but actually does not detect fraud at all.
In data science, we have a term for problems of this type: imbalanced classes. Real-world data sets rarely have balanced classes. A class is just a label for a data point. For a credit card transaction, the class could be either “valid” or “fraudulent.” Models that predict a class for a given data point are called classifiers. The training data for a classifier will contain data for each of the classes that the model tries to predict, but the volume of data for each class varies. In the fraud example, the “fraud” class makes up less than 1% of the data, and the “valid” class makes up 99% of the data. When the size of the classes is not equal, we call the dataset imbalanced. We have to account for this imbalance when we interpret accuracy. A good rule of thumb is that the more imbalanced the training data, the less meaningful accuracy becomes. In extreme cases, where data for one class is very rare, accuracy becomes almost meaningless.
To understand why overall accuracy becomes meaningless when classes are highly imbalanced, let’s construct a confusion matrix for the above example. Imagine that the Visa data scientists had data on 100,000 transactions. Here is a confusion matrix for the fraud detection model that predicts all transactions as valid, but has an overall accuracy of 99%.
The model seems to perform well (99% accurate!), but it has misclassified all 1000 cases of fraud as valid. Thanks to our confusion matrix, we can clearly see that without significant improvement this model is hardly the ideal tool for Visa.
What to Consider When Accuracy is Reported
We started this post with an example of balanced classes in which we classified ice cream flavors, and demonstrated that accuracy can be misleading. We then discussed accuracy in the context of rare events and imbalanced classes and showed that there too, accuracy can mislead. We’ve illustrated how confusion matrices can add clarity to both types of classification problems.
Measuring how well a model represents a real world phenomenon and how well it makes predictions is a difficult problem that even data scientists disagree on. We often talk about accuracy because it carries a simple colloquial meaning and people tend to understand it. But, as we’ve demonstrated here, accuracy never tells the whole story. Accuracy can mislead to the point of being meaningless.
In summary, when a data scientist reports the accuracy of a model, it is essential to ask for more detail and context. Ask what an acceptable accuracy level would be for your particular problem and recognize that higher is not always better. Are you classifying ice cream flavors or diagnosing cancer? Ask to see a confusion matrix. Always ask for the context and detail you need to truly assess whether a model can be trusted for critical decision-making.