Francisca Dias
The goal here is to predict who are the costumers that are most likely to churn in the future.
This is Binary classification problem: the costumers churn or they don't.
If the result is negative (0: negative class), the costumer will be assumed to stay within our service, while if the result is positive (1: positive class), the costumer will leave our service.
In this report I will show how to apply machine learning on a predictive modeling.
This dataset can be found here.
import pandas as pd
import numpy as np
churn_df = pd.read_csv('churn.csv')
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
churn_df.hist(figsize=(16, 10))
plt.show()
We can see that pretty much all attributes may have a Gaussian or nearly Gaussian distribution.
This is interesting because many machine learning techniques assume a Gaussian univariate distribution on the input variables.
# Drop the columns that we have decided won't be used in prediction
churn_df = churn_df.drop(["Phone", "Area Code", "State"], axis=1)
churn_df["Churn?"] = np.where(churn_df["Churn?"] == 'True.',1,0)
churn_df["Int'l Plan"] = np.where(churn_df["Int'l Plan"] == 'yes',1,0)
churn_df["VMail Plan"] = np.where(churn_df["VMail Plan"] == 'yes',1,0)
array = churn_df.values
X = array[:,0:17]
y = array[:,17:]
In order to confirm the accuracy of our final model, we should create a sample of the data that we spare from our initial analysis and modelling: It is called a validation hold-out set.
We will use 80% of the data for modeling and hold back 20% for validation, as seen on test_size=0.2.
from sklearn.model_selection import train_test_split
X_train, X_validation, y_train, y_validation = train_test_split(X, y,test_size=0.2, random_state=7)
I will select 6 different algorithms for this classification problem:
Please note that I will use the default tuning parameters (tuning parameters will go later when I choose to run the model(s) with the best accuracy score).
from sklearn.linear_model import LogisticRegression
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
# Spot-Check Algorithms
models = []
models.append(( 'LR' , LogisticRegression()))
models.append(( 'LDA' , LinearDiscriminantAnalysis()))
models.append(( 'KNN' , KNeighborsClassifier()))
models.append(( 'CART' , DecisionTreeClassifier()))
models.append(( 'NB' , GaussianNB()))
models.append(( 'SVM' , SVC()))
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
results = []
names = []
for name, model in models:
kfold = KFold(n_splits=10, random_state=7)
cv_results = cross_val_score(model, X_train, y_train.ravel(), cv=kfold, scoring='accuracy')
results.append(cv_results)
names.append(name)
msg = "%s: %f" % (name, cv_results.mean())
print(msg)
The results suggest that both KNN (87% accuracy) and Decision Tree Classifier (90%accuracy) may be worth for further study.
Sometimes when the dataset has differing distributions such as this one, the performance of some algorithms may become compromised, so it is a goof idea transforming the data such as each attribute has a mean value of zero and a standard deviation of one.
Data Leakage
Lata leakage is another issue we should avoid when transforming the data.
In order to avoid it, we should use pipelines that will both standardize and build the model for each fold in the cross validation test.
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
pipelines = []
pipelines.append(( 'ScaledLR' , Pipeline([( 'Scaler' , StandardScaler()),( 'LR' ,LogisticRegression())])))
pipelines.append(( 'ScaledLDA' , Pipeline([( 'Scaler' , StandardScaler()),( 'LDA' ,LinearDiscriminantAnalysis())])))
pipelines.append(( 'ScaledKNN' , Pipeline([( 'Scaler' , StandardScaler()),( 'KNN' ,KNeighborsClassifier())])))
pipelines.append(( 'ScaledCART' , Pipeline([( 'Scaler' , StandardScaler()),( 'CART' ,DecisionTreeClassifier())])))
pipelines.append(( 'ScaledNB' , Pipeline([( 'Scaler' , StandardScaler()),( 'NB' ,GaussianNB())])))
pipelines.append(( 'ScaledSVM' , Pipeline([( 'Scaler' , StandardScaler()),( 'SVM' , SVC())])))
results = []
names = []
for name, model in pipelines:
kfold = KFold(n_splits=10, random_state=7)
cv_results = cross_val_score(model, X_train, y_train.ravel(), cv=kfold, scoring='accuracy')
results.append(cv_results)
names.append(name)
msg = "%s: %f" % (name, cv_results.mean())
print(msg)
The standardization of the data has increased the accuracy for both CART and SVM, when compared to the previous results.
But the results suggest that we should work more on SVM (the algorithm with highest accuracy: 91.7%).
Remember that I used the default parameters when estimating accuracy. Therefore we got to a point where we should find the right parameters - tuning parameters.
This might yield to even more accurate model (an increased accuracy).
Tuning SVM
There are two key parameters in SVM algorithm that we can tune:
the value of C (how much to relax the margin) and
the type of kernel (linear, poly , rbf and sigmoid)
The default for SVM is the rbf kernel with a C value equal to 1.0.
I will perform a grid search using 10-fold cross validation with a standardized copy of the training dataset.
from sklearn.model_selection import GridSearchCV
scaler = StandardScaler().fit(X_train)
rescaledX = scaler.transform(X_train)
c_values = [0.1, 0.3, 0.5, 0.7, 0.9, 1.0, 1.3, 1.5, 1.7, 2.0]
kernel_values = [ 'linear' , 'poly' , 'rbf' , 'sigmoid' ]
param_grid = dict(C=c_values, kernel=kernel_values)
model_SVM = SVC()
kfold = KFold(n_splits=10, random_state=7)
grid = GridSearchCV(estimator=model_SVM, param_grid=param_grid, cv=kfold, scoring='accuracy')
grid_result = grid.fit(rescaledX, y_train.ravel())
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
means = grid_result.cv_results_[ 'mean_test_score' ]
params = grid_result.cv_results_[ 'params' ]
for mean, param in zip(means, params):
print("%f with: %r" % (mean, param))
We can see the most accurate configuration was SVM with a poly kernel and a C value of 2.0.
The accuracy is now 92.1%, better than the SVM default tuning parameters that we achieve earlier with 91.7% accuracy.
Another way that we can improve the performance of algorithms on this classification problem is by using ensemble methods.
I will run 4 ensemble methods:
I will still use the 10-fold cross validation.
Please note that I will not perform data standardization here: all four ensemble algorithms are based on decision trees that are less sensitive to data distributions.
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import ExtraTreesClassifier
# ensembles
ensembles = []
ensembles.append(( 'AB' , AdaBoostClassifier()))
ensembles.append(( 'GBM' , GradientBoostingClassifier()))
ensembles.append(( 'RF' , RandomForestClassifier()))
ensembles.append(( 'ET' , ExtraTreesClassifier()))
results = []
names = []
for name, model in ensembles:
kfold = KFold(n_splits=10, random_state=7)
cv_results = cross_val_score(model, X_train, y_train.ravel(), cv=kfold, scoring='accuracy')
results.append(cv_results)
names.append(name)
msg = "%s: %f " % (name, cv_results.mean())
print(msg)
We can see that boosting and bagging techniques provide very strong accuracy scores with the default configurations.
The results suggest Gradient Boosting (GBM) may be worthy of further study with the highest accuracy achieved so far : 94,9%.
Tuning GBM
GBM has a few parameters that can dramatically affect the accuracy of the model.
I will focus on two of them:
Learning rate: In general, a small learning rate (and large number of estimators) will yield more accurate GBM models, though it will also take the model longer to train since it does more iterations through the cycle.
number of estimators:The greater the number of estimators, the more overfitting which is accurate predictions on training data, but inaccurate predictions on new data. Typical values range from 100-1000. I will set number of estimators equal to 1000.
model_gbm = GradientBoostingClassifier(learning_rate=0.05, n_estimators=1000)
cv_results = cross_val_score(model_gbm, X_train, y_train.ravel(), cv=kfold, scoring='accuracy')
print(cv_results.mean())
The GBM (learning_rate=0.05, n_estimators=1000) is so far the model with highest accuracy: 94.97%.
Here I will finalize the model by training it on the entire training dataset and make predictions for the hold-out validation dataset to confirm our findings.
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
# Use the model GBM as before, with the tuning parameters
model_gbm.fit(X_train, y_train.ravel())
# estimate the accuracy on validation dataset
predictions = model_gbm.predict(X_validation)
print(accuracy_score(y_validation, predictions))
print(confusion_matrix(y_validation, predictions))
print(classification_report(y_validation, predictions))
We can see that we achieve an accuracy of nearly 95% on the held-out validation dataset.
A score that matches closely to our expectations estimated before, during the tuning of GBM.
In this resport we worked through a classification predictive modeling machine learning problem using Python.
These were the steps covered were: