Francisca Dias
In this paper I will focus on the evaluation metrics available for calssification problems using the dataset Costumer Churn.
Predicting when a costumer is about to churn is valuable in many businesses: communications, banking, movie rental, etc.
With the data at our disposal we can predict, using different models, who are the costumers that are more likely to churn.
My goal here is to show you different metrics and explain each one.
For this purposes I will leave out the results from other models, since I only want to focus in the meaning of results.
This is a binary classification problem, since the predicted results are Yes (1) or No (0).
Many classification models could be used here: KNeighbors Classifier, Logistic Regression, Gradient Boosting Classifier and so on.
I will only deploy the results from the Random Forest Classifier.
This dataset can be found here.
import pandas as pd
import numpy as np
churn_df = pd.read_csv('churn.csv')
# Drop the columns that we have decided won't be used in prediction
churn_df = churn_df.drop(["Phone", "Area Code", "State"], axis=1)
churn_df["Churn?"] = np.where(churn_df["Churn?"] == 'True.',1,0)
churn_df["Int'l Plan"] = np.where(churn_df["Int'l Plan"] == 'yes',1,0)
churn_df["VMail Plan"] = np.where(churn_df["VMail Plan"] == 'yes',1,0)
array = churn_df.values
X = array[:,0:17]
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X = scaler.fit_transform(X)
y = array[:,17]
The metrics to evaluate the machine learning algorithms are very important.
The choice of metrics influence how you weight the importance of diferent characteristics in the results and your ultimate choice of which algorithm to choose.
Below are metrics that can be used to evaluate predictions on classification problems:
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier
model_rf = RandomForestClassifier()
kfold = KFold(n_splits=10, random_state=7)
results_rf = cross_val_score(model_rf, X, y, cv=kfold, scoring='accuracy')
print('The results for Random Forest are', results_rf.mean())
Here is the problem on using accuracy as a model performance measurement.
There are two scenarios that should be considered and addressed:
First it can happen that my classifier predicted a customer would churn and they didn't.
Second my classifier can predicted a customer would stay within the business, so nothing was done, and in the end they churned.
Second scenario is bad, and that is the one we will try to avoid by minimizing the value.
The confusion matrix is a handy presentation of the accuracy of a model with two or more classes.
We can draw a confusion matrix in two ways:
from IPython.display import Image
Image("conf_matrix.png")
TN - True negatives : we predicted a costumer would not churn and they did not churn.
FP - False positives : We predicted a costumer would churn, but they did not.
FN - False negatives : we predicted a costumer would not churn and they actually churn.
TP - True positives : we predicted a costumer churn and they actually churn.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=7)
model_rf.fit(X_train, y_train)
y_pred_rf = model_rf.predict(X_test)
confusion_rf = pd.crosstab(y_test, y_pred_rf, rownames=['Actual'], colnames=['Predicted'], margins=True)
confusion_rf
Here are some important questions that I can now answer:
Accuracy
TP = 92
TN = 692
Total = 834
accuracy = (TP + TN)/Total
accuracy
Sensitivity or Recall
TP = 92
Actual_Yes = 129
sensitivity = TP / Actual_Yes
sensitivity
Precision
TP = 92
Predicted_Yes = 105
precision = TP / Predicted_Yes
precision
The scikit-learn library provides a great report when working on classification problems such as this one.
The Classification Report will give you a quick idea of the accuracy of a model using a number of measures:
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred_rf))