Classification Machine Learning Metrics to Learn and Know
Whether you are a scientist or stakeholder, understanding these are key!
So you built your classification machine learning model and are now in the next stage — measuring its performance. This question requires much more business context than one may think. The top metrics(s) used will be dependent on the goals, risk factors, and cost of the thing being modeled in your organization. In this blog, I walk through each machine learning metric and why it may be used.
Accuracy
Accuracy is the most straight-forward metric and the simplest to understand. It calculates the number of correct predictions divided by the number of total predictions.
Example: 135 correct predictions / 150 total predictions = 90% Accuracy
Pros: It is the easiest to understand and implement when you are working with a balanced dataset, and there is an equal amount of risk and/or cost for different types of errors. It is best for simple to understand models.
Cons: Though this metric is easy to understand, it can be deceiving in how well it performs especially for imbalanced datasets. In addition, some organizations have a larger amount of risk and/or cost attached to classifying false positives (for example, financial or medical-related problems). For these cases, a different metric should be used to minimize risk and/or cost.
Precision
For the next set of metrics, it is a good idea to create a confusion matrix first so that you can calculate the number of true positives, false positives, true negatives, and false negatives. Precision is a good metric to look at when dealing with an imbalanced dataset. It is the number of True Positives divided by the number of True Positives plus False Positives.
Calculation: True Positives / (True Positives + False Positives) = Precision
Pros: Precision is good to use when there is a high risk and/or cost attached to False Positives. It is also great for imbalanced datasets, ensuring that the model does not solely predict the majority class (making up the biggest proportion of the data).
Cons: Recall should be used if there is a high risk and/or cost attached to False Negatives, as precision does not include it in its calculation. This also relates to looking into a low recall, as a model could have high precision but low recall (see next metric).
Recall (Also Called Sensitivity)
Recall is the proportion of samples positively predicted by the model. It is the number of true positives divided by the number of True Positives plus False Negatives.
Calculation: True Positives / (True Positives + False Negatives) = Recall
Pros: Recall should be used when there is a high risk and/or cost attached to False Negatives. Similar to precision, it is also good to use with imbalanced datasets so that the majority class is not being predicted all the time.
Cons: An inverse to precision — if the model always predicts the same class then it can have high recall and low precision.
F1-Score (And No, it Does Not Mean Formula 1 Racing)
If you are not isolating just precision or just recall as your priority metric, then you may want to use the F1-score. The F1-score combines both precision and recall so that you do not need to compromise on one or other, and can place equal importance on both. It is 2 times precision times recall divided by precision plus recall.
Calculation: 2 x Precision x Recall / (Precision + Recall) = F1-Score
Pros: This metric is able to showcase both how the model is able to correctly predict positive samples and avoid incorrectly predicting negative samples. It is great to use when there is an equal risk and/or cost attached to False Positives and False Negatives.
Cons: This metric should not be used if the precision and recall are very different, or if there is a different risk and/or cost attached to False Positives and False Negatives.
Specificity
Specificity is moreso used in the medical field, and is the True Negative rate. It is a good metric to use when wanting to measure: of patients who did not have a disease, how many got negative results back? It is the inverse of sensitivity/recall and calculates the proportion of samples negatively predicted that were actually correct. It is the True Negatives divided by the True Negatives plus False Positives.
Calculation: True Negatives / (True Negatives + False Positives) = Specificity
Pros: It is a good metric to use when there is a high risk and/or cost attached to False Positives, and if you want the model to predict the probability of a negative test result given that the patient does not have the disease.
Cons: If sensitivity/recall is low, specificity is not a good sole metric to use — it will be important to look at both. It is also not a measure of False Negatives, if there is a high risk and/or cost attached to it.
If you enjoyed this blog post, I would greatly appreciate you taking a moment to browse my other blog posts (I write on lifestyle, beauty, travel, restaurants, working in tech, and cocktails + wine), subscribe, and/or make a donation. Donation proceeds go toward monthly Squarespace fees, PO box fees, website enhancements, ad campaigns, SEO tools, and time investment in addition to my full-time job. Thank you for your readership from the bottom of my heart! xx Nicole