Exploratory data analysis and predicting Employee turnover¶

Francisca Dias

Table of Contents¶

Introduction

Overview

Libraries

Correlation

Satisfaction Level

Last Evaluation

Number of projects

Average montly hours

Time spent at the Company

Work accident

Left the company?

Promotion last 5years

Data Preparation

Building the model

Conclusions

Introduction ¶

In this report I will perform data analysis on the dataset "Human Resources Analytics" and I will apply the Random Forest Model to predict which employees will leave prematurely.

A common problem across companies is employee turnover. Companies often make an investment by training new employees, so every time an employee leaves the company, it represents a lost investment.

Therefore it would be great to have a model that predicts when an employee is likely to leave and offer them incentives to stay. This would impact the company business by saving time and money in training new employees.

Therefore how can we predict when an employee will leave?

I will first go through all features represented in this dataset, with focus on the cluster graph that plots all observations according to satisfaction and evaluation. This is a way of predicting employee turnover without the need to build a model. I will later make predictions by using one of the most popular and easy to understand algorithms, the random forest.

Overview ¶

This dataset can be found here.

It consists of 14.999 observations, and for each observation there is information (both quantitative and qualitative) on that employee. We also have information in which we know if the employee left the company or not.

Let's see what is the information we have available:

Employee satisfaction level: ranges from 0 to 100

Last evaluation: it also goes from 0 to 100

Number of projects: The number of projects taken by each employee, from 2 to 7 projects

Average monthly hours

Time spent at the company: in years

Whether they had a work accident: 1 if yes, 0 if no

Whether they had a promotion in the last 5 years: 1 if yes, 0 if no

Sales: It says the department for each employee works: sales, technical, support, IT, product management, marketing, RandD, accounting, HR and management

Salary: It can be Low, Medium and High

Whether the employee has left: 1 if yes, 0 if no

Libraries ¶

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix
from sklearn.metrics import roc_curve
import matplotlib
import matplotlib.pyplot as plt
from IPython.display import display, HTML

churn_df = pd.read_csv('hr_dataset.csv')

churn_df.describe()

Correlation ¶

from IPython.display import IFrame
IFrame('https://plot.ly/~FranciscaDias/73/', width=900, height=550)

Satisfaction Level ¶

IFrame('https://plot.ly/~FranciscaDias/67/', width=900, height=550)

Satisfaction level goes from 0 to 100

The average satisfaction level is 0.61

30% of employees are within the range 0.7-0.89

if we consider that employees are unhappy below level 50, then 30% of employees are unhappy

churn_df.satisfaction_level.describe()

count    14999.000000
mean         0.612834
std          0.248631
min          0.090000
25%          0.440000
50%          0.640000
75%          0.820000
max          1.000000
Name: satisfaction_level, dtype: float64

col_to_use =['satisfaction_level', 'left']

satisfaction_level_df = churn_df[col_to_use]

# exports data to csv
# satisfaction_level_df.to_csv('satisfaction_level_df')

Last Evaluation ¶

This metric has the same range as Satisfaction level : goes from 0 to 100

The average is 0.71

It is mildly positive correlated with the number of projects and the average monthly hours, which suggests that employers who take on more projects and work longer hours tend to have better evaluations

churn_df.last_evaluation.describe()

count    14999.000000
mean         0.716102
std          0.171169
min          0.360000
25%          0.560000
50%          0.720000
75%          0.870000
max          1.000000
Name: last_evaluation, dtype: float64

Number of projects ¶

IFrame('https://plot.ly/~FranciscaDias/69/', width=900, height=550)

On average an employer take on 3.8 projects

29% of employees take 4 projects

churn_df.number_project.describe()

count    14999.000000
mean         3.803054
std          1.232592
min          2.000000
25%          3.000000
50%          4.000000
75%          5.000000
max          7.000000
Name: number_project, dtype: float64

churn_df.number_project.value_counts()

4    4365
3    4055
5    2761
2    2388
6    1174
7     256
Name: number_project, dtype: int64

Average montly hours ¶

On average each employee works 201 hours a month

The maximum monthly work hours in this dataset is 310 hours and this corresponds to 18 employees

Interesting to note that all these 18 employees left the company: 9 had low salary and 9 had medium salary. None of them has had a promotion in the last 5 years.

Their satisfaction level was very low, although their evaluation was very high (0.83); They were doing projects above average (6.2 projects) and their time spent at the company was on average 4 years

One third of these 18 employees were working for HR

churn_df.average_montly_hours.describe()

count    14999.000000
mean       201.050337
std         49.943099
min         96.000000
25%        156.000000
50%        200.000000
75%        245.000000
max        310.000000
Name: average_montly_hours, dtype: float64

average_montly_hours_310 = churn_df[churn_df['average_montly_hours']==310]

average_montly_hours_310

average_montly_hours_310.describe()

average_montly_hours_310.sales.value_counts()

hr             6
support        3
technical      2
accounting     2
product_mng    2
sales          2
marketing      1
Name: sales, dtype: int64

col_to_use =['average_montly_hours', 'satisfaction_level']

average_monthly_hours_df = churn_df[col_to_use]

# exports data to csv
# average_monthly_hours_df.to_csv('average_monthly_hours_df.csv')

Time spent at the Company ¶

On average employees have been in the company for 3.5 years

65% have been for 2 and 3 years

Employee Profile of who stayed longer than 10 years

There are 214 employees in this dataset that stayed longer than 10 years

Their satisfaction level is a little above average - 0.65

Also their satisfaction level is a bit higher than average - 0.73

Both the number of projects and monthly working hours taken by these employees is lower than average

29% work for management and 27% work in sales

churn_df.time_spend_company.describe()

count    14999.000000
mean         3.498233
std          1.460136
min          2.000000
25%          3.000000
50%          3.000000
75%          4.000000
max         10.000000
Name: time_spend_company, dtype: float64

churn_df.time_spend_company.value_counts()

3     6443
2     3244
4     2557
5     1473
6      718
10     214
7      188
8      162
Name: time_spend_company, dtype: int64

time_spent_10_years = churn_df[churn_df['time_spend_company'] == 10]

time_spent_10_years.head()

time_spent_10_years.describe()

time_spent_10_years.sales.value_counts()

management     62
sales          58
support        26
technical      20
IT             16
marketing      12
accounting     10
product_mng    10
Name: sales, dtype: int64

Work accident ¶

churn_df.Work_accident.value_counts()

0    12830
1     2169
Name: Work_accident, dtype: int64

Left the company?¶

As one would expect, the satisfacion level on average for the employees who left the company is way below the average, at 0.44.

Only 0.5% of those employees has had a promotion in the last 5 years.

28% of them were working for sales, so turnover seems to be higher for sales people.

In this dataset, 24% of employees have left the company. Let's see if we can find some pattern within these type of employees.

If we plot both the level of satisfaction and last evaluation against employees who both left and stayed in the company, we can see that there are 3 clusters, that is, 3 distinct groups:

First Cluster: their level of satisfaction is in the lowest range, although their evaluation is very high;

Second Cluster : both satisfaction and evaluation is within the range 0.4 and 0.5

Third Cluster: this cluster is a little more spread compared to the other two clusters, but we can find a pattern: their last evaluation is high, above 0.8, and their satisfaction level goes from 0.7 to 0.9

IFrame('https://plot.ly/~FranciscaDias/65/', width=900, height=550)

churn_df.left.value_counts()

0    11428
1     3571
Name: left, dtype: int64

who_left_the_company = churn_df[churn_df['left'] ==1]

who_left_the_company.head()

who_left_the_company.describe()

who_left_the_company.sales.value_counts()

sales          1014
technical       697
support         555
IT              273
hr              215
accounting      204
marketing       203
product_mng     198
RandD           121
management       91
Name: sales, dtype: int64

Promotion last 5years ¶

churn_df.promotion_last_5years.value_counts()

0    14680
1      319
Name: promotion_last_5years, dtype: int64

promotion_yes_no = churn_df[churn_df['promotion_last_5years']==1]

promotion_yes_no.sales.value_counts()

sales         100
management     69
marketing      43
technical      28
RandD          27
support        20
hr             15
accounting     14
IT              3
Name: sales, dtype: int64

col_to_use =['salary', 'left']

salary_df = churn_df[col_to_use]

# exports data to csv
# salary_df.to_csv('salary_df')

churn_df.dtypes

satisfaction_level       float64
last_evaluation          float64
number_project             int64
average_montly_hours       int64
time_spend_company         int64
Work_accident              int64
left                       int64
promotion_last_5years      int64
sales                     object
salary                    object
dtype: object

Data Preparation ¶

# Call it dataframe (df)
df = churn_df

df.dtypes

satisfaction_level       float64
last_evaluation          float64
number_project             int64
average_montly_hours       int64
time_spend_company         int64
Work_accident              int64
left                       int64
promotion_last_5years      int64
sales                     object
salary                    object
dtype: object

# Convert sales and salary to dummies
df = pd.get_dummies(df)

df.dtypes

satisfaction_level       float64
last_evaluation          float64
number_project             int64
average_montly_hours       int64
time_spend_company         int64
Work_accident              int64
left                       int64
promotion_last_5years      int64
sales_IT                   uint8
sales_RandD                uint8
sales_accounting           uint8
sales_hr                   uint8
sales_management           uint8
sales_marketing            uint8
sales_product_mng          uint8
sales_sales                uint8
sales_support              uint8
sales_technical            uint8
salary_high                uint8
salary_low                 uint8
salary_medium              uint8
dtype: object

# Isolate target data
y = df['left']

y.value_counts()

0    11428
1     3571
Name: left, dtype: int64

df.columns

Index(['satisfaction_level', 'last_evaluation', 'number_project',
       'average_montly_hours', 'time_spend_company', 'Work_accident', 'left',
       'promotion_last_5years', 'sales_IT', 'sales_RandD', 'sales_accounting',
       'sales_hr', 'sales_management', 'sales_marketing', 'sales_product_mng',
       'sales_sales', 'sales_support', 'sales_technical', 'salary_high',
       'salary_low', 'salary_medium'],
      dtype='object')

len(df.columns)

21

cols_to_use = ['satisfaction_level', 'last_evaluation', 'number_project',
       'average_montly_hours', 'time_spend_company', 'Work_accident',
       'promotion_last_5years', 'sales_IT', 'sales_RandD', 'sales_accounting',
       'sales_hr', 'sales_management', 'sales_marketing', 'sales_product_mng',
       'sales_sales', 'sales_support', 'sales_technical', 'salary_high',
       'salary_low', 'salary_medium']

X = df[cols_to_use]

len(X.columns)

20

# 14999 rows × 20 columns
X.head(2)

Building the model ¶

my_model_forest = RandomForestClassifier()

my_model_forest.fit(X, y)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

my_model_forest.predict(X)

array([1, 1, 1, ..., 1, 1, 1])

my_model_forest.score(X, y)

0.99859990666044407

Our model scores almost 100%. But here is the catch: the model was tested in seen data, so we have to make sure that we predict on unseen data, which is the real case when applying this technique.

The model practical value comes from making predictions on new data

Therefore we will split the data by 75% (train) and 25% (test) by using the function train_test_split.

train_X, test_X, train_y, test_y = train_test_split(X, y)

print(train_X.shape)
print(test_X.shape)
print(train_y.shape)
print(test_y.shape)

(11249, 20)
(3750, 20)
(11249,)
(3750,)

my_model_forest_train_test = RandomForestClassifier()

my_model_forest_train_test.fit(train_X, train_y)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

predictions = my_model_forest_train_test.predict(test_X)
probs = my_model_forest_train_test.predict_proba(test_X)

score = my_model_forest_train_test.score(test_X, test_y)

score

0.99039999999999995

The score is now below the previous one. That is because we split the dataset and tested the model on unseen data.

confusion_matrix = pd.DataFrame(
    confusion_matrix(test_y, predictions), 
    columns=["Predicted False", "Predicted True"], 
    index=["Actual False", "Actual True"]
)

confusion_matrix

my_model_forest_train_test.feature_importances_

array([ 0.36459987,  0.12361047,  0.1481263 ,  0.14263597,  0.17250308,
        0.01077476,  0.0014324 ,  0.00216245,  0.00145016,  0.00156077,
        0.00195613,  0.00167453,  0.00157386,  0.00170969,  0.00306223,
        0.00287791,  0.00347191,  0.00566045,  0.00571655,  0.00344051])

features = train_X.columns

df_f = pd.DataFrame(my_model_forest_train_test.feature_importances_, columns=["importance"])
df_f["labels"] = features
df_f.sort_values("importance", inplace=True, ascending=False)

df_f

If we plot features importance we have some insight as to what features were most useful in our random forest model.

Satisfaction level comes first with 0.32 importance. Then comes the number of projects (0.19) and the time spent at the company (0.18)

This is useful when refining our model in the future.

IFrame('https://plot.ly/~FranciscaDias/71/', width=900, height=550)

Conclusions ¶

The average satisfaction level is 0.61

The average on Last Evaluation is 0.71. This feature is mildly positive correlated with the number of projects and the average monthly hours, which suggests that employers who take on more projects and work longer hours tend to have better evaluations

On average each employee works 201 hours a month

The maximum monthly work hours in this dataset is 310 hours and this corresponds to 18 employees. All of them left the company

Only 1.4% of employees stayed more than 10 years

14% had a work accident and 24% of employees have left the company

We can see three distinct clusters when plotting Satisfaction and Evaluation: if you don't know how to build a prediction model, you should be able to spot the employees that are more likely to leave the company by plotting the data and identify these groups

The model used to predict employee turnover is the random forest. We tested 0.98 accuracy.

	satisfaction_level	last_evaluation	number_project	average_montly_hours	time_spend_company	Work_accident	left	promotion_last_5years
count	14999.000000	14999.000000	14999.000000	14999.000000	14999.000000	14999.000000	14999.000000	14999.000000
mean	0.612834	0.716102	3.803054	201.050337	3.498233	0.144610	0.238083	0.021268
std	0.248631	0.171169	1.232592	49.943099	1.460136	0.351719	0.425924	0.144281
min	0.090000	0.360000	2.000000	96.000000	2.000000	0.000000	0.000000	0.000000
25%	0.440000	0.560000	3.000000	156.000000	3.000000	0.000000	0.000000	0.000000
50%	0.640000	0.720000	4.000000	200.000000	3.000000	0.000000	0.000000	0.000000
75%	0.820000	0.870000	5.000000	245.000000	4.000000	0.000000	0.000000	0.000000
max	1.000000	1.000000	7.000000	310.000000	10.000000	1.000000	1.000000	1.000000

	satisfaction_level	last_evaluation	number_project	average_montly_hours	time_spend_company	left	sales	salary
566	0.11	0.79	7	310	4	1	hr	low
803	0.10	0.77	6	310	4	1	technical	medium
809	0.11	0.97	7	310	4	1	support	medium
1059	0.70	0.93	2	310	3	1	product_mng	low
1228	0.10	0.85	7	310	5	1	marketing	low
1482	0.09	0.77	6	310	4	1	hr	low
1545	0.10	0.78	6	310	4	1	sales	medium
1578	0.10	0.96	6	310	5	1	support	low
1596	0.36	0.48	5	310	3	1	product_mng	medium
1717	0.11	0.96	6	310	4	1	technical	low
1936	0.11	0.97	6	310	4	1	accounting	medium
1939	0.10	0.79	7	310	4	1	hr	medium
1963	0.09	0.89	7	310	4	1	support	low
1985	0.11	0.84	7	310	4	1	sales	medium
12566	0.11	0.79	7	310	4	1	hr	low
14777	0.11	0.79	7	310	4	1	hr	low
14972	0.11	0.97	6	310	4	1	accounting	medium
14975	0.10	0.79	7	310	4	1	hr	medium

	satisfaction_level	last_evaluation	number_project	average_montly_hours	time_spend_company	Work_accident	left	promotion_last_5years
count	18.000000	18.000000	18.000000	18.0	18.000000	18.0	18.0	18.0
mean	0.151111	0.838333	6.222222	310.0	4.000000	0.0	1.0	0.0
std	0.149819	0.120208	1.215370	0.0	0.485071	0.0	0.0	0.0
min	0.090000	0.480000	2.000000	310.0	3.000000	0.0	1.0	0.0
25%	0.100000	0.790000	6.000000	310.0	4.000000	0.0	1.0	0.0
50%	0.110000	0.815000	6.500000	310.0	4.000000	0.0	1.0	0.0
75%	0.110000	0.952500	7.000000	310.0	4.000000	0.0	1.0	0.0
max	0.700000	0.970000	7.000000	310.0	5.000000	0.0	1.0	0.0

	satisfaction_level	last_evaluation	number_project	average_montly_hours	time_spend_company	Work_accident	sales	salary
11078	0.69	0.88	3	164	10	0	management	medium
11124	0.29	0.75	6	271	10	0	sales	medium
11193	0.75	0.60	4	186	10	1	marketing	low
11194	0.61	0.89	3	242	10	0	sales	high
11195	0.47	0.79	3	284	10	0	sales	low

	satisfaction_level	last_evaluation	number_project	average_montly_hours	time_spend_company	Work_accident	left	promotion_last_5years
count	214.000000	214.000000	214.000000	214.000000	214.0	214.000000	214.0	214.000000
mean	0.655327	0.731495	3.682243	199.224299	10.0	0.233645	0.0	0.074766
std	0.217827	0.157521	0.925366	47.458073	0.0	0.424141	0.0	0.263631
min	0.140000	0.370000	2.000000	108.000000	10.0	0.000000	0.0	0.000000
25%	0.530000	0.617500	3.000000	154.000000	10.0	0.000000	0.0	0.000000
50%	0.640000	0.730000	4.000000	199.000000	10.0	0.000000	0.0	0.000000
75%	0.832500	0.867500	4.000000	244.250000	10.0	0.000000	0.0	0.000000
max	0.990000	1.000000	6.000000	284.000000	10.0	1.000000	0.0	1.000000

	satisfaction_level	last_evaluation	number_project	average_montly_hours	time_spend_company	left	sales	salary
0	0.38	0.53	2	157	3	1	sales	low
1	0.80	0.86	5	262	6	1	sales	medium
2	0.11	0.88	7	272	4	1	sales	medium
3	0.72	0.87	5	223	5	1	sales	low
4	0.37	0.52	2	159	3	1	sales	low

	satisfaction_level	last_evaluation	number_project	average_montly_hours	time_spend_company	Work_accident	left	promotion_last_5years
count	3571.000000	3571.000000	3571.000000	3571.000000	3571.000000	3571.000000	3571.0	3571.000000
mean	0.440098	0.718113	3.855503	207.419210	3.876505	0.047326	1.0	0.005321
std	0.263933	0.197673	1.818165	61.202825	0.977698	0.212364	0.0	0.072759
min	0.090000	0.450000	2.000000	126.000000	2.000000	0.000000	1.0	0.000000
25%	0.130000	0.520000	2.000000	146.000000	3.000000	0.000000	1.0	0.000000
50%	0.410000	0.790000	4.000000	224.000000	4.000000	0.000000	1.0	0.000000
75%	0.730000	0.900000	6.000000	262.000000	5.000000	0.000000	1.0	0.000000
max	0.920000	1.000000	7.000000	310.000000	6.000000	1.000000	1.0	1.000000

	importance	labels
0	0.364600	satisfaction_level
4	0.172503	time_spend_company
2	0.148126	number_project
3	0.142636	average_montly_hours
1	0.123610	last_evaluation
5	0.010775	Work_accident
18	0.005717	salary_low
17	0.005660	salary_high
16	0.003472	sales_technical
19	0.003441	salary_medium
14	0.003062	sales_sales
15	0.002878	sales_support
7	0.002162	sales_IT
10	0.001956	sales_hr
13	0.001710	sales_product_mng
11	0.001675	sales_management
12	0.001574	sales_marketing
9	0.001561	sales_accounting
8	0.001450	sales_RandD
6	0.001432	promotion_last_5years

	Predicted False	Predicted True
Actual False	2863	2
Actual True	34	851