Machine Learning algorithms on a Classification Problem

Francisca Dias

Table of Contents

The goal here is to predict who are the costumers that are most likely to churn in the future.

This is Binary classification problem: the costumers churn or they don't.

If the result is negative (0: negative class), the costumer will be assumed to stay within our service, while if the result is positive (1: positive class), the costumer will leave our service.

In this report I will show how to apply machine learning on a predictive modeling.

  • I will use data transforms to improve model performance.
  • I will use algorithm tuning to improve model performance.
  • I will use ensemble methods and tuning of ensemble methods to improve model performance.

This dataset can be found here.

In [1]:
import pandas as pd
import numpy as np

churn_df = pd.read_csv('churn.csv')
In [2]:
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')

churn_df.hist(figsize=(16, 10))
plt.show()

We can see that pretty much all attributes may have a Gaussian or nearly Gaussian distribution.

This is interesting because many machine learning techniques assume a Gaussian univariate distribution on the input variables.

In [3]:
# Drop the columns that we have decided won't be used in prediction
churn_df = churn_df.drop(["Phone", "Area Code", "State"], axis=1)
In [4]:
churn_df["Churn?"] = np.where(churn_df["Churn?"] == 'True.',1,0)
In [5]:
churn_df["Int'l Plan"] = np.where(churn_df["Int'l Plan"] == 'yes',1,0)
churn_df["VMail Plan"] = np.where(churn_df["VMail Plan"] == 'yes',1,0)
In [6]:
array = churn_df.values
X = array[:,0:17]
y = array[:,17:]

In order to confirm the accuracy of our final model, we should create a sample of the data that we spare from our initial analysis and modelling: It is called a validation hold-out set.

We will use 80% of the data for modeling and hold back 20% for validation, as seen on test_size=0.2.

In [8]:
from sklearn.model_selection import train_test_split


X_train, X_validation, y_train, y_validation = train_test_split(X, y,test_size=0.2, random_state=7)

I will select 6 different algorithms for this classification problem:

  • 2 are Linear (Logistic Regression and Linear Discriminant Analysis) and
  • 4 are Nonlinear (Decision Tree Classifier, Support Vector Machine, Gaussian Naive Bayes and k-Nearest Neighbors)

Please note that I will use the default tuning parameters (tuning parameters will go later when I choose to run the model(s) with the best accuracy score).

In [9]:
from sklearn.linear_model import LogisticRegression
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC


# Spot-Check Algorithms
models = []
models.append(( 'LR' , LogisticRegression()))
models.append(( 'LDA' , LinearDiscriminantAnalysis()))
models.append(( 'KNN' , KNeighborsClassifier()))
models.append(( 'CART' , DecisionTreeClassifier()))
models.append(( 'NB' , GaussianNB()))
models.append(( 'SVM' , SVC()))
  • I will split the data in 10-folds (n_splits=10)
  • I will evaluate algorithms using the accuracy metric (scoring='accuracy'). This will give an idea of how correct a given model is.
In [10]:
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score

results = []
names = []
for name, model in models:
    kfold = KFold(n_splits=10, random_state=7)
    cv_results = cross_val_score(model, X_train, y_train.ravel(), cv=kfold, scoring='accuracy')
    results.append(cv_results)
    names.append(name)
    msg = "%s: %f" % (name, cv_results.mean())
    print(msg)
LR: 0.863820
LDA: 0.855194
KNN: 0.878448
CART: 0.905098
NB: 0.864959
SVM: 0.856687

The results suggest that both KNN (87% accuracy) and Decision Tree Classifier (90%accuracy) may be worth for further study.

Sometimes when the dataset has differing distributions such as this one, the performance of some algorithms may become compromised, so it is a goof idea transforming the data such as each attribute has a mean value of zero and a standard deviation of one.

Data Leakage

Lata leakage is another issue we should avoid when transforming the data.

In order to avoid it, we should use pipelines that will both standardize and build the model for each fold in the cross validation test.

In [11]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler


pipelines = []
pipelines.append(( 'ScaledLR' , Pipeline([( 'Scaler' , StandardScaler()),( 'LR' ,LogisticRegression())])))
pipelines.append(( 'ScaledLDA' , Pipeline([( 'Scaler' , StandardScaler()),( 'LDA' ,LinearDiscriminantAnalysis())])))
pipelines.append(( 'ScaledKNN' , Pipeline([( 'Scaler' , StandardScaler()),( 'KNN' ,KNeighborsClassifier())])))
pipelines.append(( 'ScaledCART' , Pipeline([( 'Scaler' , StandardScaler()),( 'CART' ,DecisionTreeClassifier())])))
pipelines.append(( 'ScaledNB' , Pipeline([( 'Scaler' , StandardScaler()),( 'NB' ,GaussianNB())])))
pipelines.append(( 'ScaledSVM' , Pipeline([( 'Scaler' , StandardScaler()),( 'SVM' , SVC())])))


results = []
names = []
for name, model in pipelines:
    kfold = KFold(n_splits=10, random_state=7)
    cv_results = cross_val_score(model, X_train, y_train.ravel(), cv=kfold, scoring='accuracy')
    results.append(cv_results)
    names.append(name)
    msg = "%s: %f" % (name, cv_results.mean())
    print(msg)
ScaledLR: 0.866824
ScaledLDA: 0.855194
ScaledKNN: 0.896828
ScaledCART: 0.899485
ScaledNB: 0.864959
ScaledSVM: 0.916710

The standardization of the data has increased the accuracy for both CART and SVM, when compared to the previous results.

But the results suggest that we should work more on SVM (the algorithm with highest accuracy: 91.7%).

Remember that I used the default parameters when estimating accuracy. Therefore we got to a point where we should find the right parameters - tuning parameters.

This might yield to even more accurate model (an increased accuracy).

Tuning SVM

There are two key parameters in SVM algorithm that we can tune:

  • the value of C (how much to relax the margin) and

  • the type of kernel (linear, poly , rbf and sigmoid)

The default for SVM is the rbf kernel with a C value equal to 1.0.

I will perform a grid search using 10-fold cross validation with a standardized copy of the training dataset.

  • I will try 4 kernel types (the ones I talked above) and
  • C values ranging from 0.1 (means less bias) to 2.0 (means more bias).
In [12]:
from sklearn.model_selection import GridSearchCV

scaler = StandardScaler().fit(X_train)
rescaledX = scaler.transform(X_train)

c_values = [0.1, 0.3, 0.5, 0.7, 0.9, 1.0, 1.3, 1.5, 1.7, 2.0]
kernel_values = [ 'linear' ,  'poly' ,  'rbf' ,  'sigmoid' ]
param_grid = dict(C=c_values, kernel=kernel_values)

model_SVM = SVC()
kfold = KFold(n_splits=10, random_state=7)
grid = GridSearchCV(estimator=model_SVM, param_grid=param_grid, cv=kfold, scoring='accuracy')
grid_result = grid.fit(rescaledX, y_train.ravel())
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))


means = grid_result.cv_results_[ 'mean_test_score' ]
params = grid_result.cv_results_[ 'params' ]

for mean, param in zip(means, params):
    print("%f with: %r" % (mean, param))                                            
Best: 0.921230 using {'C': 2.0, 'kernel': 'poly'}
0.856714 with: {'C': 0.1, 'kernel': 'linear'}
0.878470 with: {'C': 0.1, 'kernel': 'poly'}
0.856714 with: {'C': 0.1, 'kernel': 'rbf'}
0.856714 with: {'C': 0.1, 'kernel': 'sigmoid'}
0.856714 with: {'C': 0.3, 'kernel': 'linear'}
0.900975 with: {'C': 0.3, 'kernel': 'poly'}
0.888222 with: {'C': 0.3, 'kernel': 'rbf'}
0.844336 with: {'C': 0.3, 'kernel': 'sigmoid'}
0.856714 with: {'C': 0.5, 'kernel': 'linear'}
0.909227 with: {'C': 0.5, 'kernel': 'poly'}
0.906977 with: {'C': 0.5, 'kernel': 'rbf'}
0.825581 with: {'C': 0.5, 'kernel': 'sigmoid'}
0.856714 with: {'C': 0.7, 'kernel': 'linear'}
0.908102 with: {'C': 0.7, 'kernel': 'poly'}
0.912603 with: {'C': 0.7, 'kernel': 'rbf'}
0.816954 with: {'C': 0.7, 'kernel': 'sigmoid'}
0.856714 with: {'C': 0.9, 'kernel': 'linear'}
0.912603 with: {'C': 0.9, 'kernel': 'poly'}
0.915604 with: {'C': 0.9, 'kernel': 'rbf'}
0.807577 with: {'C': 0.9, 'kernel': 'sigmoid'}
0.856714 with: {'C': 1.0, 'kernel': 'linear'}
0.912978 with: {'C': 1.0, 'kernel': 'poly'}
0.916729 with: {'C': 1.0, 'kernel': 'rbf'}
0.805326 with: {'C': 1.0, 'kernel': 'sigmoid'}
0.856714 with: {'C': 1.3, 'kernel': 'linear'}
0.914479 with: {'C': 1.3, 'kernel': 'poly'}
0.918230 with: {'C': 1.3, 'kernel': 'rbf'}
0.800450 with: {'C': 1.3, 'kernel': 'sigmoid'}
0.856714 with: {'C': 1.5, 'kernel': 'linear'}
0.917104 with: {'C': 1.5, 'kernel': 'poly'}
0.917854 with: {'C': 1.5, 'kernel': 'rbf'}
0.793698 with: {'C': 1.5, 'kernel': 'sigmoid'}
0.856714 with: {'C': 1.7, 'kernel': 'linear'}
0.918980 with: {'C': 1.7, 'kernel': 'poly'}
0.920105 with: {'C': 1.7, 'kernel': 'rbf'}
0.792573 with: {'C': 1.7, 'kernel': 'sigmoid'}
0.856714 with: {'C': 2.0, 'kernel': 'linear'}
0.921230 with: {'C': 2.0, 'kernel': 'poly'}
0.920105 with: {'C': 2.0, 'kernel': 'rbf'}
0.785821 with: {'C': 2.0, 'kernel': 'sigmoid'}

We can see the most accurate configuration was SVM with a poly kernel and a C value of 2.0.

The accuracy is now 92.1%, better than the SVM default tuning parameters that we achieve earlier with 91.7% accuracy.

Another way that we can improve the performance of algorithms on this classification problem is by using ensemble methods.

I will run 4 ensemble methods:

  • 2 Boosting Methods: AdaBoost (AB) and Gradient Boosting (GBM) and
  • 2 Bagging Methods: Random Forests (RF) and Extra Trees (ET).

I will still use the 10-fold cross validation.

Please note that I will not perform data standardization here: all four ensemble algorithms are based on decision trees that are less sensitive to data distributions.

In [13]:
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import ExtraTreesClassifier

# ensembles
ensembles = []
ensembles.append(( 'AB' , AdaBoostClassifier()))
ensembles.append(( 'GBM' , GradientBoostingClassifier()))
ensembles.append(( 'RF' , RandomForestClassifier()))
ensembles.append(( 'ET' , ExtraTreesClassifier()))
results = []
names = []

for name, model in ensembles:
    kfold = KFold(n_splits=10, random_state=7)
    cv_results = cross_val_score(model, X_train, y_train.ravel(), cv=kfold, scoring='accuracy')
    results.append(cv_results)
    names.append(name)
    msg = "%s: %f " % (name, cv_results.mean())
    print(msg)
AB: 0.880336 
GBM: 0.949732 
RF: 0.945607 
ET: 0.922717 

We can see that boosting and bagging techniques provide very strong accuracy scores with the default configurations.

The results suggest Gradient Boosting (GBM) may be worthy of further study with the highest accuracy achieved so far : 94,9%.

Tuning GBM

GBM has a few parameters that can dramatically affect the accuracy of the model.

I will focus on two of them:

  • learning rate and
  • number of estimators.

Learning rate: In general, a small learning rate (and large number of estimators) will yield more accurate GBM models, though it will also take the model longer to train since it does more iterations through the cycle.

number of estimators:The greater the number of estimators, the more overfitting which is accurate predictions on training data, but inaccurate predictions on new data. Typical values range from 100-1000. I will set number of estimators equal to 1000.

In [14]:
model_gbm = GradientBoostingClassifier(learning_rate=0.05, n_estimators=1000)

cv_results = cross_val_score(model_gbm, X_train, y_train.ravel(), cv=kfold, scoring='accuracy')

print(cv_results.mean())
0.949731069246

The GBM (learning_rate=0.05, n_estimators=1000) is so far the model with highest accuracy: 94.97%.

Here I will finalize the model by training it on the entire training dataset and make predictions for the hold-out validation dataset to confirm our findings.

In [15]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report


# Use the model GBM as before, with the tuning parameters
model_gbm.fit(X_train, y_train.ravel())

# estimate the accuracy on validation dataset
predictions = model_gbm.predict(X_validation)

print(accuracy_score(y_validation, predictions))
print(confusion_matrix(y_validation, predictions))
print(classification_report(y_validation, predictions))
0.947526236882
[[558   8]
 [ 27  74]]
             precision    recall  f1-score   support

        0.0       0.95      0.99      0.97       566
        1.0       0.90      0.73      0.81       101

avg / total       0.95      0.95      0.95       667

We can see that we achieve an accuracy of nearly 95% on the held-out validation dataset.

A score that matches closely to our expectations estimated before, during the tuning of GBM.

In this resport we worked through a classification predictive modeling machine learning problem using Python.

These were the steps covered were:

  • Problem Definition: Predicting if a costumer would churn or not.
  • Analyze Data.
  • Evaluate Algorithms (The two best being: KNN with 87% accuracy and Decision Tree Classifier with 90% accuracy)
  • Evaluate Algorithms with Standardization (in which the SVM had the highest accuracy: 91.7%).
  • Algorithm Tuning (SVM with a poly kernel and a C value of 2.0. was the best).
  • Ensemble Methods (Gradient Boosting (GBM) with 94.9%).
  • Finalize Model (use all training data and confirm using validation dataset).
  • This classification problem report shows all steps when applying machine learning using Python.