Evaluation

lesson home

Overview

Teaching: 15 min
Exercises: 10 min

Questions

How can I tell if my classifier is good?

Objectives

Understand what a confusion matrix is

Learn about the different metrics

Below is a refresher of the code we’ve been using to classify the data.

from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(max_depth=2, random_state=0)

clf.fit(features_train,target_train)

pred_validation = clf.predict(features_validation)
print(calcAccuracy(target_validation,pred_validation))

True positives, etc

To calculate different performance metrics, we want to know where the classifier is succeeding and failing. The successes are charactered by true positives and true negatives. The failures are characterized by false positives and false negatives. Let’s calculate the number of true positives.

true_positives = 0

for i in range(N_validation):
  if target_validation[i] == True and pred_validation[i] == True:
    true_positives += 1

print(true_positives)

Let’s calculate the other three counts: true negatives, false positives and false negatives.

Solution

true_positives = 0
true_negatives = 0
false_positives = 0
false_negatives = 0

for i in range(N_validation):
  if target_validation[i] == True and pred_validation[i] == True:
    true_positives += 1
  if target_validation[i] == False and pred_validation[i] == True:
    false_positives += 1
  if target_validation[i] == True and pred_validation[i] == False:
    false_negatives += 1
  if target_validation[i] == False and pred_validation[i] == False:
    true_negatives += 1
    
print(true_positives,true_negatives,false_positives,false_negatives)

Metrics

With those four counts, you can calculate a variety of different metrics. These four counts make up the confusion matrix. For a binary classification problem (i.e. predicting positive/negative), it is a 2x2 grid. The Wikipedia article gives a nice overview of the different metrics that you can calculate with these numbers. We calculate a few of the popular metrics below. All these metrics vary between 0 and 1 with higher meaning better performance.

accuracy = (true_positives + true_negatives) / (true_positives + true_negatives + false_positives + false_negatives)
precision = true_positives / (true_positives + false_positives)
recall = true_positives / (true_positives + false_negatives)
print(accuracy, precision, recall)

You should really consult more than one metric to decide if your classifier is succesful. However, many use the F1-score as their main focus. It is the harmonic mean of the precision and recall scores.

f1score = 2 * (precision * recall) / (precision + recall)
print(f1score)

Using scikit-learn from metrics

You don’t need to code this all manually next time. Scikit-learn provides a number of functions for calculating classifier performance metrics. We use several below, including one to calculate the confusion matrix itself. Note that the four counts are the same as we calculated previously.

from sklearn.metrics import confusion_matrix
print(confusion_matrix(target_validation,pred_validation))

from sklearn.metrics import accuracy_score, f1_score, recall_score, precision_score

print(accuracy_score(target_validation,pred_validation))
print(precision_score(target_validation,pred_validation))
print(recall_score(target_validation,pred_validation))
print(f1_score(target_validation,pred_validation))

Final evaluation using the test set

We’ve tried a few different models and the Random Forest classifier seems to do quite well, or perhaps you found another one that does better. Let’s do a final evaluation using the held-out testing set. These results would be the ones to report in any publications or report. You should only be checking against your test set once.

# Use the classifier to make the test predictions
pred_test = clf.predict(features_test)

print(accuracy_score(target_test,pred_test))
print(precision_score(target_test,pred_test))
print(recall_score(target_test,pred_test))
print(f1_score(target_test,pred_test))

Saving your classifier

We’ve built a classifier that we think performs well and would like to use it for future data. Do we need to retrain it every time? No, we can save the trained model and load it for future use. Scikit-learn supports Python pickling as below to save it.

import pickle

# Save the trained classifier to "saved_classifier.pickle" file
with open('saved_classifier.pickle','wb') as f:
  pickle.dump(clf,f)

And then you could load it in the future with the following code and start predicting (without having to use train). The pickle saves all the necessary information that the classifier needs to start predicting (as long as you trained it before saving it).

# Load the classifier from the file. It's now ready for predicting!
with open('saved_classifier.pickle','rb') as f:
  clf = pickle.load(f)

Key Points

A confusion matrix shows the counts of the true positives, false positives, true negatives and false negatives that the classifier gives.

Various statistics can be calculate from these four numbers. The statistic to use depends on what errors you want to minimize.

Further reading: Points of Significance: Classification Evaluation

previous episode

Machine Learning Concepts

Evaluation

lesson home

Overview

True positives, etc

Solution

Metrics

Using scikit-learn from metrics

Final evaluation using the test set

Saving your classifier

Key Points

previous episode

lesson home