Train/Test Split and Cross Validation Tutorial¶

Francisca Dias

I wrote this paper to show how Cross-validation gives a more accurate measure of model quality, which is important in modelling decision.

I will explain on much detail the concepts of Train/Test Split and Cross Validation.

One of the biggest issues or problems in data science is trying to overcome overfitting on both classification and regression problems.

So our task here is to try to minimize overfitting.

In data science, we split our data into two subsets: training data and testing data, or even three (train, validate and test). We then fit our model on the train data and move on to the predictions on the test data.

When we follow this procedure, one of the following two things can happen:

We overfit our model or
We underfit our model

When this happens, we are affecting the predictability of our model and we cannot generalize our predictions new, unseen data.

Overfitting means that our model fits too closely to the training dataset. Meaning that its predictions will be very accurate on the training data, but not accurate on new, unseen data. Because our goal is to make predictions on unseen data, we will not be able to make any inferences on this new data.

Underfitting means that the model does not fit the training dataset and its predictions will not be accurate on the training data. Therefore the model cannot be generalized to unseen data. It usually happens when the dataset has not enough independent variables or when we fit a linear model to data that is not linear. Because this model makes poor predictions, it cannot be generalized to unseen data.

The data set I will be using is a telecom customer data set.

Because our goal is to predict if a costumer will churn or not, this is a classification problem.

Each row represents a subscribing telephone customer.

Each column contains customer attributes such as phone number, call minutes used during different times of day, charges incurred for services, lifetime account duration, and whether or not the customer is still a customer (churn column).

This dataset can be found here.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.preprocessing import StandardScaler
%matplotlib inline

churn_df = pd.read_csv('churn.csv')

col_names = churn_df.columns.tolist()
len(col_names)

21

to_show = col_names[:6] + col_names[-6:]
len(to_show)

12

Isolate target data¶

churn_result = churn_df['Churn?']

y = np.where(churn_result == 'True.',1,0)

We don't need these columns¶

to_drop = ['State','Area Code','Phone','Churn?']
churn_feat_space = churn_df.drop(to_drop,axis=1)

'yes'/'no' has to be converted to boolean values¶

yes_no_cols = ["Int'l Plan","VMail Plan"]
churn_feat_space[yes_no_cols] = churn_feat_space[yes_no_cols] == 'yes'

features = churn_feat_space.columns
print (features)

Index(['Account Length', 'Int'l Plan', 'VMail Plan', 'VMail Message',
       'Day Mins', 'Day Calls', 'Day Charge', 'Eve Mins', 'Eve Calls',
       'Eve Charge', 'Night Mins', 'Night Calls', 'Night Charge', 'Intl Mins',
       'Intl Calls', 'Intl Charge', 'CustServ Calls'],
      dtype='object')

Matrix X¶

X = churn_feat_space.as_matrix().astype(np.float)

StandardScaler¶

scaler = StandardScaler()
X = scaler.fit_transform(X)

print ("Feature space holds %d observations and %d features" % X.shape)
print ("Unique target labels:", np.unique(y))

Feature space holds 3333 observations and 17 features
Unique target labels: [0 1]

len(y)

3333

Train test Split

The data we use to make predictions is usually split into training data and test data.

The training set (X) contains a known output (churn/not churn) and the model learns on this data in order to be generalized to other data later on.

We have the test dataset (y) in order to test our model’s prediction on this subset.

We’ll do this using the train_test_split method.

from sklearn.model_selection import train_test_split

The test_size=0.2 indicates that we are splitting the data by 80/20.

You can split the data in other percentages, such as 70/30, so in this case you would have test_size=0.3.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

print (X_train.shape, y_train.shape)
print (X_test.shape, y_test.shape)

(2666, 17) (2666,)
(667, 17) (667,)

Remember from above, our initial dataset has 3333 observations.

By splitting the data by 80/20, we get 2666 observations for the train data, and 667 observations for the test data.

On X our independent variable matrix, we seill have 17 features or predictors.

On y, our dependent variable, the one we are trying to predict, we only have one feature, which is the output, churn/not churn. This is a vector, and not a matrix, since this only contains the prediction values from our initial dataset.

Model Selection¶

Here I will use the predictions from just one model, since the goal here is to show the difference between test/split and cross validation results, and not the different results given by different models.

I will choose Random Forest for this analysis.

Other models coud also be used such as K Neighbors, Logistic Regression, Gradient Boosting or Support Vector Machine.

The process is as follows:

Fit the model on the training data.

After fitting the model we will try to predict on the test data.

We will print these predictions for Random Forest Model.

Print the accuracy score for that model.

Random Forest

from sklearn.ensemble import RandomForestClassifier as RF
rf = RF()
rf_model = rf.fit(X_train, y_train)
predictions_rf = rf.predict(X_test)
print(predictions_rf[0:5])
print(rf_model.score(X_test, y_test))

[1 0 0 0 0]
0.943028485757

Classification accuracy: percentage of correct predictions

In this case, Classification accuracy is 94.3%

So far here's what I did:

I split our intial dataset into Train and Test datasets;

Fitted the model to the training data;

Made predictions based on this data (only showing the first five predictions)

And tested the predictions on the test data (the score result, which is 0.943028485757);

But here is the problem of train/test split:

We don't know how the data is split, we just know that we have two datasets (train and test). For instance it can happen (not likely, but still) that our training data contains all costumers that churn (483 costumers). These costumers have certain features/characteristics that the costumers who didn't churn don't have. Therefore our analysis would not egenralize well on unseen data.

This is where cross validation comes in.

2 methods on Cross Validation

The cross validation method follows the same logic as the train/test split methid: we still need to split the data. But in cross validation we split the data in k folds.

This means that we split the data into k subsets and train on k-1 subsets. We hold the last subset for test.

There are a many cross validation methods, but I will go over two of them:

K-Folds Cross Validation

Leave One Out Cross Validation (LOOCV)

K-Folds Cross Validation

Here we split the data into k different subsets and use k-1 subsets to train our data, leaving the last subset as test data.

We then take the average of each of the folds and finalize the model. In the end, we test it against the test set.

from sklearn.model_selection import KFold

make sure both X and y are arrays

print(type(X))
print(type(y))

<class 'numpy.ndarray'>
<class 'numpy.ndarray'>

kf = KFold(n_splits=3,shuffle=True)

What we have done here:

import the necessary libray: from sklearn.model_selection import KFold
make sure both X and y are arrays
we split the data into 3 subsets (n_splits=3)

Leave One Out Cross Validation (LOOCV)

In this type of cross validation, the number of folds (subsets) equals to the number of observations we have in the dataset.

We then average all of these folds and build the model with the average.In the end, we test it against the last fold.

Just a side note: This method should be use in small datasets because it is very computationally expensive, since we will have a lot of training sets (equals the number of observations).

So I will introduce this method here (as below), but I will focus on the kfold only.

from sklearn.model_selection import LeaveOneOut

loo = LeaveOneOut()

What method should we use and how many folds?

Here is the logic behind the number of folds to use: (Think in terms of Error in variance VS error in bias)

The more folds we use, the more we are able to reduce the error in bias. BUT we increase the error in variance;
The less folds we use, the more we are able to reduce the error in variance. BUT we increase the error in bias;

Therefore, in big datasets, k=3 is usually advised.

In smaller datasets we should use LOOCV.

from sklearn.model_selection import cross_val_score, cross_val_predict

What is the score after cross validation

K can be any number, but i will use K=6.

K=10 is generally recommended (This has been shown experimentally to produce the best out-of-sample estimate)

For classification problems, stratified sampling is recommended for creating the folds: If dataset has 2 response classes (churm/not churn)

Churn/Not Churn
20% observation = Churn
Each cross-validation fold should consist of exactly 20% Churn

scikit-learn's cross_val_score function does this by default

scores_rf = cross_val_score(rf_model, X, y, cv=6)
print('Random Forest scores:', scores_rf)

Random Forest scores: [ 0.94604317  0.93705036  0.94244604  0.95675676  0.94414414  0.95495495]

In the first iteration, the accuracy is 0.94604317

Second iteration, the accuracy is 0.93705036

Final iteration, the accuracy is 0.95495495

print(scores_rf.mean())

0.946899237367

Remember that in our first prediction on train/test the score was 0.943!

And now after cross validation we can have a score of 0.947, which is higher than the initial score

cross_val_predict_lr = cross_val_predict(rf_model, X, y, cv=6)
print('Random Forest predictions', cross_val_predict_lr)

Random Forest predictions [0 0 0 ..., 0 0 0]