Exploratory data analysis and predicting Employee turnover

Francisca Dias

Table of Contents

In this report I will perform data analysis on the dataset "Human Resources Analytics" and I will apply the Random Forest Model to predict which employees will leave prematurely.

A common problem across companies is employee turnover. Companies often make an investment by training new employees, so every time an employee leaves the company, it represents a lost investment.

Therefore it would be great to have a model that predicts when an employee is likely to leave and offer them incentives to stay. This would impact the company business by saving time and money in training new employees.

Therefore how can we predict when an employee will leave?

I will first go through all features represented in this dataset, with focus on the cluster graph that plots all observations according to satisfaction and evaluation. This is a way of predicting employee turnover without the need to build a model. I will later make predictions by using one of the most popular and easy to understand algorithms, the random forest.

This dataset can be found here.

It consists of 14.999 observations, and for each observation there is information (both quantitative and qualitative) on that employee. We also have information in which we know if the employee left the company or not.

Let's see what is the information we have available:

  • Employee satisfaction level: ranges from 0 to 100
  • Last evaluation: it also goes from 0 to 100
  • Number of projects: The number of projects taken by each employee, from 2 to 7 projects
  • Average monthly hours
  • Time spent at the company: in years
  • Whether they had a work accident: 1 if yes, 0 if no
  • Whether they had a promotion in the last 5 years: 1 if yes, 0 if no
  • Sales: It says the department for each employee works: sales, technical, support, IT, product management, marketing, RandD, accounting, HR and management
  • Salary: It can be Low, Medium and High
  • Whether the employee has left: 1 if yes, 0 if no
In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix
from sklearn.metrics import roc_curve
import matplotlib
import matplotlib.pyplot as plt
from IPython.display import display, HTML
In [2]:
churn_df = pd.read_csv('hr_dataset.csv')
In [3]:
churn_df.describe()
Out[3]:
satisfaction_level last_evaluation number_project average_montly_hours time_spend_company Work_accident left promotion_last_5years
count 14999.000000 14999.000000 14999.000000 14999.000000 14999.000000 14999.000000 14999.000000 14999.000000
mean 0.612834 0.716102 3.803054 201.050337 3.498233 0.144610 0.238083 0.021268
std 0.248631 0.171169 1.232592 49.943099 1.460136 0.351719 0.425924 0.144281
min 0.090000 0.360000 2.000000 96.000000 2.000000 0.000000 0.000000 0.000000
25% 0.440000 0.560000 3.000000 156.000000 3.000000 0.000000 0.000000 0.000000
50% 0.640000 0.720000 4.000000 200.000000 3.000000 0.000000 0.000000 0.000000
75% 0.820000 0.870000 5.000000 245.000000 4.000000 0.000000 0.000000 0.000000
max 1.000000 1.000000 7.000000 310.000000 10.000000 1.000000 1.000000 1.000000
In [70]:
from IPython.display import IFrame
IFrame('https://plot.ly/~FranciscaDias/73/', width=900, height=550)
Out[70]:
In [71]:
IFrame('https://plot.ly/~FranciscaDias/67/', width=900, height=550)
Out[71]:
  • Satisfaction level goes from 0 to 100
  • The average satisfaction level is 0.61
  • 30% of employees are within the range 0.7-0.89
  • if we consider that employees are unhappy below level 50, then 30% of employees are unhappy
In [6]:
churn_df.satisfaction_level.describe()
Out[6]:
count    14999.000000
mean         0.612834
std          0.248631
min          0.090000
25%          0.440000
50%          0.640000
75%          0.820000
max          1.000000
Name: satisfaction_level, dtype: float64
In [7]:
col_to_use =['satisfaction_level', 'left']

satisfaction_level_df = churn_df[col_to_use]
In [8]:
# exports data to csv
# satisfaction_level_df.to_csv('satisfaction_level_df')
  • This metric has the same range as Satisfaction level : goes from 0 to 100
  • The average is 0.71
  • It is mildly positive correlated with the number of projects and the average monthly hours, which suggests that employers who take on more projects and work longer hours tend to have better evaluations
In [9]:
churn_df.last_evaluation.describe()
Out[9]:
count    14999.000000
mean         0.716102
std          0.171169
min          0.360000
25%          0.560000
50%          0.720000
75%          0.870000
max          1.000000
Name: last_evaluation, dtype: float64
In [72]:
IFrame('https://plot.ly/~FranciscaDias/69/', width=900, height=550)
Out[72]:
  • On average an employer take on 3.8 projects
  • 29% of employees take 4 projects
In [11]:
churn_df.number_project.describe()
Out[11]:
count    14999.000000
mean         3.803054
std          1.232592
min          2.000000
25%          3.000000
50%          4.000000
75%          5.000000
max          7.000000
Name: number_project, dtype: float64
In [12]:
churn_df.number_project.value_counts()
Out[12]:
4    4365
3    4055
5    2761
2    2388
6    1174
7     256
Name: number_project, dtype: int64
  • On average each employee works 201 hours a month
  • The maximum monthly work hours in this dataset is 310 hours and this corresponds to 18 employees
  • Interesting to note that all these 18 employees left the company: 9 had low salary and 9 had medium salary. None of them has had a promotion in the last 5 years.
  • Their satisfaction level was very low, although their evaluation was very high (0.83); They were doing projects above average (6.2 projects) and their time spent at the company was on average 4 years
  • One third of these 18 employees were working for HR
In [13]:
churn_df.average_montly_hours.describe()
Out[13]:
count    14999.000000
mean       201.050337
std         49.943099
min         96.000000
25%        156.000000
50%        200.000000
75%        245.000000
max        310.000000
Name: average_montly_hours, dtype: float64
In [14]:
average_montly_hours_310 = churn_df[churn_df['average_montly_hours']==310]
In [15]:
average_montly_hours_310
Out[15]:
satisfaction_level last_evaluation number_project average_montly_hours time_spend_company Work_accident left promotion_last_5years sales salary
566 0.11 0.79 7 310 4 0 1 0 hr low
803 0.10 0.77 6 310 4 0 1 0 technical medium
809 0.11 0.97 7 310 4 0 1 0 support medium
1059 0.70 0.93 2 310 3 0 1 0 product_mng low
1228 0.10 0.85 7 310 5 0 1 0 marketing low
1482 0.09 0.77 6 310 4 0 1 0 hr low
1545 0.10 0.78 6 310 4 0 1 0 sales medium
1578 0.10 0.96 6 310 5 0 1 0 support low
1596 0.36 0.48 5 310 3 0 1 0 product_mng medium
1717 0.11 0.96 6 310 4 0 1 0 technical low
1936 0.11 0.97 6 310 4 0 1 0 accounting medium
1939 0.10 0.79 7 310 4 0 1 0 hr medium
1963 0.09 0.89 7 310 4 0 1 0 support low
1985 0.11 0.84 7 310 4 0 1 0 sales medium
12566 0.11 0.79 7 310 4 0 1 0 hr low
14777 0.11 0.79 7 310 4 0 1 0 hr low
14972 0.11 0.97 6 310 4 0 1 0 accounting medium
14975 0.10 0.79 7 310 4 0 1 0 hr medium
In [16]:
average_montly_hours_310.describe()
Out[16]:
satisfaction_level last_evaluation number_project average_montly_hours time_spend_company Work_accident left promotion_last_5years
count 18.000000 18.000000 18.000000 18.0 18.000000 18.0 18.0 18.0
mean 0.151111 0.838333 6.222222 310.0 4.000000 0.0 1.0 0.0
std 0.149819 0.120208 1.215370 0.0 0.485071 0.0 0.0 0.0
min 0.090000 0.480000 2.000000 310.0 3.000000 0.0 1.0 0.0
25% 0.100000 0.790000 6.000000 310.0 4.000000 0.0 1.0 0.0
50% 0.110000 0.815000 6.500000 310.0 4.000000 0.0 1.0 0.0
75% 0.110000 0.952500 7.000000 310.0 4.000000 0.0 1.0 0.0
max 0.700000 0.970000 7.000000 310.0 5.000000 0.0 1.0 0.0
In [17]:
average_montly_hours_310.sales.value_counts()
Out[17]:
hr             6
support        3
technical      2
accounting     2
product_mng    2
sales          2
marketing      1
Name: sales, dtype: int64
In [18]:
col_to_use =['average_montly_hours', 'satisfaction_level']

average_monthly_hours_df = churn_df[col_to_use]
In [19]:
# exports data to csv
# average_monthly_hours_df.to_csv('average_monthly_hours_df.csv')
  • On average employees have been in the company for 3.5 years
  • 65% have been for 2 and 3 years

Employee Profile of who stayed longer than 10 years

  • There are 214 employees in this dataset that stayed longer than 10 years
  • Their satisfaction level is a little above average - 0.65
  • Also their satisfaction level is a bit higher than average - 0.73
  • Both the number of projects and monthly working hours taken by these employees is lower than average
  • 29% work for management and 27% work in sales
In [20]:
churn_df.time_spend_company.describe()
Out[20]:
count    14999.000000
mean         3.498233
std          1.460136
min          2.000000
25%          3.000000
50%          3.000000
75%          4.000000
max         10.000000
Name: time_spend_company, dtype: float64
In [21]:
churn_df.time_spend_company.value_counts()
Out[21]:
3     6443
2     3244
4     2557
5     1473
6      718
10     214
7      188
8      162
Name: time_spend_company, dtype: int64
In [22]:
time_spent_10_years = churn_df[churn_df['time_spend_company'] == 10]
In [23]:
time_spent_10_years.head()
Out[23]:
satisfaction_level last_evaluation number_project average_montly_hours time_spend_company Work_accident left promotion_last_5years sales salary
11078 0.69 0.88 3 164 10 0 0 0 management medium
11124 0.29 0.75 6 271 10 0 0 0 sales medium
11193 0.75 0.60 4 186 10 1 0 0 marketing low
11194 0.61 0.89 3 242 10 0 0 0 sales high
11195 0.47 0.79 3 284 10 0 0 0 sales low
In [24]:
time_spent_10_years.describe()
Out[24]:
satisfaction_level last_evaluation number_project average_montly_hours time_spend_company Work_accident left promotion_last_5years
count 214.000000 214.000000 214.000000 214.000000 214.0 214.000000 214.0 214.000000
mean 0.655327 0.731495 3.682243 199.224299 10.0 0.233645 0.0 0.074766
std 0.217827 0.157521 0.925366 47.458073 0.0 0.424141 0.0 0.263631
min 0.140000 0.370000 2.000000 108.000000 10.0 0.000000 0.0 0.000000
25% 0.530000 0.617500 3.000000 154.000000 10.0 0.000000 0.0 0.000000
50% 0.640000 0.730000 4.000000 199.000000 10.0 0.000000 0.0 0.000000
75% 0.832500 0.867500 4.000000 244.250000 10.0 0.000000 0.0 0.000000
max 0.990000 1.000000 6.000000 284.000000 10.0 1.000000 0.0 1.000000
In [25]:
time_spent_10_years.sales.value_counts()
Out[25]:
management     62
sales          58
support        26
technical      20
IT             16
marketing      12
accounting     10
product_mng    10
Name: sales, dtype: int64
In [26]:
churn_df.Work_accident.value_counts()
Out[26]:
0    12830
1     2169
Name: Work_accident, dtype: int64
  • As one would expect, the satisfacion level on average for the employees who left the company is way below the average, at 0.44.
  • Only 0.5% of those employees has had a promotion in the last 5 years.
  • 28% of them were working for sales, so turnover seems to be higher for sales people.

In this dataset, 24% of employees have left the company. Let's see if we can find some pattern within these type of employees.

  • If we plot both the level of satisfaction and last evaluation against employees who both left and stayed in the company, we can see that there are 3 clusters, that is, 3 distinct groups:
  • First Cluster: their level of satisfaction is in the lowest range, although their evaluation is very high;
  • Second Cluster : both satisfaction and evaluation is within the range 0.4 and 0.5
  • Third Cluster: this cluster is a little more spread compared to the other two clusters, but we can find a pattern: their last evaluation is high, above 0.8, and their satisfaction level goes from 0.7 to 0.9
In [74]:
IFrame('https://plot.ly/~FranciscaDias/65/', width=900, height=550)
Out[74]:
In [28]:
churn_df.left.value_counts()
Out[28]:
0    11428
1     3571
Name: left, dtype: int64
In [29]:
who_left_the_company = churn_df[churn_df['left'] ==1]
In [30]:
who_left_the_company.head()
Out[30]:
satisfaction_level last_evaluation number_project average_montly_hours time_spend_company Work_accident left promotion_last_5years sales salary
0 0.38 0.53 2 157 3 0 1 0 sales low
1 0.80 0.86 5 262 6 0 1 0 sales medium
2 0.11 0.88 7 272 4 0 1 0 sales medium
3 0.72 0.87 5 223 5 0 1 0 sales low
4 0.37 0.52 2 159 3 0 1 0 sales low
In [31]:
who_left_the_company.describe()
Out[31]:
satisfaction_level last_evaluation number_project average_montly_hours time_spend_company Work_accident left promotion_last_5years
count 3571.000000 3571.000000 3571.000000 3571.000000 3571.000000 3571.000000 3571.0 3571.000000
mean 0.440098 0.718113 3.855503 207.419210 3.876505 0.047326 1.0 0.005321
std 0.263933 0.197673 1.818165 61.202825 0.977698 0.212364 0.0 0.072759
min 0.090000 0.450000 2.000000 126.000000 2.000000 0.000000 1.0 0.000000
25% 0.130000 0.520000 2.000000 146.000000 3.000000 0.000000 1.0 0.000000
50% 0.410000 0.790000 4.000000 224.000000 4.000000 0.000000 1.0 0.000000
75% 0.730000 0.900000 6.000000 262.000000 5.000000 0.000000 1.0 0.000000
max 0.920000 1.000000 7.000000 310.000000 6.000000 1.000000 1.0 1.000000
In [32]:
who_left_the_company.sales.value_counts()
Out[32]:
sales          1014
technical       697
support         555
IT              273
hr              215
accounting      204
marketing       203
product_mng     198
RandD           121
management       91
Name: sales, dtype: int64
In [33]:
churn_df.promotion_last_5years.value_counts()
Out[33]:
0    14680
1      319
Name: promotion_last_5years, dtype: int64
In [34]:
promotion_yes_no = churn_df[churn_df['promotion_last_5years']==1]
In [35]:
promotion_yes_no.sales.value_counts()
Out[35]:
sales         100
management     69
marketing      43
technical      28
RandD          27
support        20
hr             15
accounting     14
IT              3
Name: sales, dtype: int64
In [36]:
col_to_use =['salary', 'left']

salary_df = churn_df[col_to_use]
In [37]:
# exports data to csv
# salary_df.to_csv('salary_df')
In [38]:
churn_df.dtypes
Out[38]:
satisfaction_level       float64
last_evaluation          float64
number_project             int64
average_montly_hours       int64
time_spend_company         int64
Work_accident              int64
left                       int64
promotion_last_5years      int64
sales                     object
salary                    object
dtype: object
In [39]:
# Call it dataframe (df)
df = churn_df
In [40]:
df.dtypes
Out[40]:
satisfaction_level       float64
last_evaluation          float64
number_project             int64
average_montly_hours       int64
time_spend_company         int64
Work_accident              int64
left                       int64
promotion_last_5years      int64
sales                     object
salary                    object
dtype: object
In [41]:
# Convert sales and salary to dummies
df = pd.get_dummies(df)
In [42]:
df.dtypes
Out[42]:
satisfaction_level       float64
last_evaluation          float64
number_project             int64
average_montly_hours       int64
time_spend_company         int64
Work_accident              int64
left                       int64
promotion_last_5years      int64
sales_IT                   uint8
sales_RandD                uint8
sales_accounting           uint8
sales_hr                   uint8
sales_management           uint8
sales_marketing            uint8
sales_product_mng          uint8
sales_sales                uint8
sales_support              uint8
sales_technical            uint8
salary_high                uint8
salary_low                 uint8
salary_medium              uint8
dtype: object
In [43]:
# Isolate target data
y = df['left']
In [44]:
y.value_counts()
Out[44]:
0    11428
1     3571
Name: left, dtype: int64
In [45]:
df.columns
Out[45]:
Index(['satisfaction_level', 'last_evaluation', 'number_project',
       'average_montly_hours', 'time_spend_company', 'Work_accident', 'left',
       'promotion_last_5years', 'sales_IT', 'sales_RandD', 'sales_accounting',
       'sales_hr', 'sales_management', 'sales_marketing', 'sales_product_mng',
       'sales_sales', 'sales_support', 'sales_technical', 'salary_high',
       'salary_low', 'salary_medium'],
      dtype='object')
In [46]:
len(df.columns)
Out[46]:
21
In [47]:
cols_to_use = ['satisfaction_level', 'last_evaluation', 'number_project',
       'average_montly_hours', 'time_spend_company', 'Work_accident',
       'promotion_last_5years', 'sales_IT', 'sales_RandD', 'sales_accounting',
       'sales_hr', 'sales_management', 'sales_marketing', 'sales_product_mng',
       'sales_sales', 'sales_support', 'sales_technical', 'salary_high',
       'salary_low', 'salary_medium']
In [48]:
X = df[cols_to_use]
In [49]:
len(X.columns)
Out[49]:
20
In [50]:
# 14999 rows × 20 columns
X.head(2)
Out[50]:
satisfaction_level last_evaluation number_project average_montly_hours time_spend_company Work_accident promotion_last_5years sales_IT sales_RandD sales_accounting sales_hr sales_management sales_marketing sales_product_mng sales_sales sales_support sales_technical salary_high salary_low salary_medium
0 0.38 0.53 2 157 3 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0
1 0.80 0.86 5 262 6 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1
In [51]:
my_model_forest = RandomForestClassifier()
In [52]:
my_model_forest.fit(X, y)
Out[52]:
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)
In [53]:
my_model_forest.predict(X)
Out[53]:
array([1, 1, 1, ..., 1, 1, 1])
In [54]:
my_model_forest.score(X, y)
Out[54]:
0.99859990666044407

Our model scores almost 100%. But here is the catch: the model was tested in seen data, so we have to make sure that we predict on unseen data, which is the real case when applying this technique.

The model practical value comes from making predictions on new data

Therefore we will split the data by 75% (train) and 25% (test) by using the function train_test_split.

In [55]:
train_X, test_X, train_y, test_y = train_test_split(X, y)
In [56]:
print(train_X.shape)
print(test_X.shape)
print(train_y.shape)
print(test_y.shape)
(11249, 20)
(3750, 20)
(11249,)
(3750,)
In [57]:
my_model_forest_train_test = RandomForestClassifier()
In [58]:
my_model_forest_train_test.fit(train_X, train_y)
Out[58]:
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)
In [59]:
predictions = my_model_forest_train_test.predict(test_X)
probs = my_model_forest_train_test.predict_proba(test_X)
In [60]:
score = my_model_forest_train_test.score(test_X, test_y)
In [61]:
score
Out[61]:
0.99039999999999995

The score is now below the previous one. That is because we split the dataset and tested the model on unseen data.

In [62]:
confusion_matrix = pd.DataFrame(
    confusion_matrix(test_y, predictions), 
    columns=["Predicted False", "Predicted True"], 
    index=["Actual False", "Actual True"]
)
In [63]:
confusion_matrix
Out[63]:
Predicted False Predicted True
Actual False 2863 2
Actual True 34 851
In [64]:
my_model_forest_train_test.feature_importances_
Out[64]:
array([ 0.36459987,  0.12361047,  0.1481263 ,  0.14263597,  0.17250308,
        0.01077476,  0.0014324 ,  0.00216245,  0.00145016,  0.00156077,
        0.00195613,  0.00167453,  0.00157386,  0.00170969,  0.00306223,
        0.00287791,  0.00347191,  0.00566045,  0.00571655,  0.00344051])
In [65]:
features = train_X.columns
In [66]:
df_f = pd.DataFrame(my_model_forest_train_test.feature_importances_, columns=["importance"])
df_f["labels"] = features
df_f.sort_values("importance", inplace=True, ascending=False)
In [67]:
df_f
Out[67]:
importance labels
0 0.364600 satisfaction_level
4 0.172503 time_spend_company
2 0.148126 number_project
3 0.142636 average_montly_hours
1 0.123610 last_evaluation
5 0.010775 Work_accident
18 0.005717 salary_low
17 0.005660 salary_high
16 0.003472 sales_technical
19 0.003441 salary_medium
14 0.003062 sales_sales
15 0.002878 sales_support
7 0.002162 sales_IT
10 0.001956 sales_hr
13 0.001710 sales_product_mng
11 0.001675 sales_management
12 0.001574 sales_marketing
9 0.001561 sales_accounting
8 0.001450 sales_RandD
6 0.001432 promotion_last_5years

If we plot features importance we have some insight as to what features were most useful in our random forest model.

Satisfaction level comes first with 0.32 importance. Then comes the number of projects (0.19) and the time spent at the company (0.18)

This is useful when refining our model in the future.

In [75]:
IFrame('https://plot.ly/~FranciscaDias/71/', width=900, height=550)
Out[75]:
  • The average satisfaction level is 0.61
  • The average on Last Evaluation is 0.71. This feature is mildly positive correlated with the number of projects and the average monthly hours, which suggests that employers who take on more projects and work longer hours tend to have better evaluations
  • On average each employee works 201 hours a month
  • The maximum monthly work hours in this dataset is 310 hours and this corresponds to 18 employees. All of them left the company
  • Only 1.4% of employees stayed more than 10 years
  • 14% had a work accident and 24% of employees have left the company
  • We can see three distinct clusters when plotting Satisfaction and Evaluation: if you don't know how to build a prediction model, you should be able to spot the employees that are more likely to leave the company by plotting the data and identify these groups
  • The model used to predict employee turnover is the random forest. We tested 0.98 accuracy.