Classification Evaluation Metrics: The Ultimate Guide to Accurate Predictions.

MD TAHSEEN EQUBAL
6 min readAug 23, 2024

--

This blog is completely dedicated to the crucial metrics used in classification problems. You might have come across problem statements where we have to use metrics other than the well known ‘accuracy’ score. Let us try to understand confusion matrix, accuracy, recall, precision, F1 Score, ROC- AUC curve and their usage.

Classification Evaluation Metrices

Confusion Matrix :

We need a Confusion Matrix for classification evaluation because it provides a clear picture of how well a model is performing by showing not just the overall accuracy, but also where the model is making specific errors. It helps us understand:

  1. True Successes: How many predictions were correct.
  2. Errors: How many times the model predicted incorrectly, broken down into types of errors (false positives and false negatives).

This detailed view helps in diagnosing issues with the model and improving its performance.

Follow Below Link for Better Understanding of Confusion Matrix .

Confusion Matrices
  • True Positive (TP): Number of correctly identified positive class instances
  • False Positive (FP): Number of negative class instances wrongly identified as positive class instances
  • True Negative (TN): Number of correctly identified negative class instances
  • False Negative (FN): Number of positive class instances wrongly identified as negative class instances

CLASSIFICATION EVALUATION MATRICES

Classification Evaluation Metrices.

1. Accuracy

Define:

Accuracy is the ratio of correctly predicted instances (both true positives and true negatives) to the total instances. It provides a simple way to evaluate the performance of a classification model.

It is Suitable for Balanced Data.

Formula:

Accuracy

Where:

  • TP = True Positives
  • TN = True Negatives
  • FP = False Positives
  • FN = False Negatives

Advantage:

  • Easy to understand and calculate.
  • Useful when the classes are balanced (i.e., the number of instances in each class is roughly equal).

Disadvantage:

  • Misleading for imbalanced datasets, where one class dominates the other.
  • Accuracy doesn't capture the FP&FN(i.e. cost function)

When to Use:

  • When you have a balanced dataset or when class distribution is not skewed.

Desirable Value:

  • More Accuracy Better Model

Python Implementation:

from sklearn.metrics import accuracy_score

# Example
y_true = [0, 1, 1, 0, 1, 1, 0, 0, 1, 0]
y_pred = [0, 1, 1, 0, 0, 1, 0, 0, 1, 1]

accuracy = accuracy_score(y_true, y_pred)
print(f'Accuracy: {accuracy}')
Output
Accuracy: 0.8

2. Precision

Define:

Precision is the ratio of true positive predictions to the total positive predictions made by the model. It measures how many of the predicted positive instances are actually positive.

It is Suitable for False Positive Data.

Formula:

Precision

Advantage:

  • Useful in scenarios where the cost of false positives is high (e.g., spam detection)

Disadvantage:

  • Can be misleading if not considered alongside recall, especially in cases of imbalanced datasets.

When to Use:

  • When you want to minimize the number of false positives, such as in medical diagnostics or fraud detection.

Desirable Value:

  • More Precision Better Model

Python Implementation:

from sklearn.metrics import precision_score

# Example
precision = precision_score(y_true, y_pred)
print(f'Precision: {precision}')
Output
Precision: 0.8

3. Recall

Define:

Recall (also known as sensitivity or true positive rate) is the ratio of true positive predictions to the total actual positives. It measures how many of the actual positive instances the model correctly identified.

It is Suitable for False Negative.

Formula:

Recall

Advantage:

  • Useful in scenarios where the cost of false negatives is high (e.g., detecting cancer).

Disadvantage:

  • Can lead to high false positives if not balanced with precision.

When to Use:

  • When missing a positive instance has a higher cost than incorrectly predicting a positive instance, such as in medical testing or security screening.

Desirable Value:

  • More Recall Better Model

Python Implementation:

from sklearn.metrics import recall_score

# Example
recall = recall_score(y_true, y_pred)
print(f'Recall: {recall}')
Output
Recall: 0.8

4. F1 Score

Define:

The F1 Score is the harmonic mean of precision and recall. It provides a single metric that balances both precision and recall, especially useful in cases of imbalanced datasets.

It is suitable for Imbalanced Data.

Formula:

F1 Score

Advantage:

  • Balances the trade-off between precision and recall.
  • Useful when the classes are imbalanced and both precision and recall are important.

Disadvantage:

  • Less interpretable compared to individual precision and recall scores.

When to Use:

  • When you need a balance between precision and recall, especially in cases of imbalanced datasets.

Desirable Value:

  • More F1 Score Better Model

Python Implementation:

from sklearn.metrics import f1_score

# Example
f1 = f1_score(y_true, y_pred)
print(f'F1 Score: {f1}')
Output
F1 Score: 0.8

5. AUC-ROC Curve

Define:

The Area Under the Receiver Operating Characteristic Curve (AUC-ROC) measures the model’s ability to distinguish between classes.

The ROC curve plots the true positive rate (recall) against the false positive rate.

The AUC represents the area under this curve.

Used on Threshold Selection.

Formula:

There’s no direct formula for AUC-ROC, but the ROC curve is plotted as:

AUC-ROC Curve
  • x-axis: False Positive Rate (FPR)
  • y-axis: True Positive Rate (TPR)

Advantage:

  • Provides a comprehensive view of model performance across all classification thresholds.
  • Useful for comparing different models.

Disadvantage:

  • Can be less intuitive to interpret.
  • A higher AUC-ROC doesn’t always translate to better real-world performance, especially with imbalanced datasets.

When to Use:

  • When you want to evaluate the model’s ability to separate classes and compare the performance of different models.

Desirable Value:

  • More AUC-ROC Curve Value Better Model

Python Implementation

from sklearn.metrics import roc_auc_score, roc_curve
import matplotlib.pyplot as plt

# Example
y_true = [0, 1, 1, 0, 1, 1, 0, 0, 1, 0]
y_scores = [0.1, 0.4, 0.35, 0.8, 0.7, 0.6, 0.2, 0.3, 0.9, 0.5]

auc = roc_auc_score(y_true, y_scores)
fpr, tpr, _ = roc_curve(y_true, y_scores)

plt.plot(fpr, tpr, label=f'AUC = {auc:.2f}')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend(loc="lower right")
plt.show()
AUC- ROC Curve
Output
AUC = 0.72

How to Select Optimal Threshold Value :

Based upon the cost, on increasing or decreasing threshold we reduce the FP and FN

Steps:

  • Using This Python Command Create Data Frame with features like : FPR, TPR, Threshold, Distance
df['distance'] = (df['fpr']**2 + (1-df['tpr'])**2)**0.5
Output
  • Now Choose Threshold which Distance is minimum compare to Other like in Our Above Example. (0.4) is Optimal Threshold Value

Summary

Summary of Classification Problem Evaluation Metrices.

Enjoyed this article?

If you found this post helpful and insightful, please take a moment to like it. Your feedback helps me continue creating content that matters to you.

I’d love to hear your thoughts and questions — leave a comment below and let’s start a conversation!

For more articles on Data Science, follow my Medium page to stay updated with the latest content and updates. Your support means a lot!

Thank you for reading, and I look forward to connecting with you!

--

--

MD TAHSEEN EQUBAL
MD TAHSEEN EQUBAL

Written by MD TAHSEEN EQUBAL

I write to help make sense of Data Science

Responses (2)