Exploratory data analysis and predicting Employee turnover

Exploratory data analysis and predicting Employee turnover

Introduction

In this report I will perform data analysis on the dataset "Human Resources Analytics" and I will apply the Random Forest Model to predict which employees will leave prematurely.

A common problem across companies is employee turnover. Companies often make an investment by training new employees, so every time an employee leaves the company, it represents a lost investment.

Therefore it would be great to have a model that predicts when an employee is likely to leave and offer them incentives to stay. This would impact the company business by saving time and money in training new employees.

So how can we predict when an employee will leave?

I will first go through all features represented in this dataset, with focus on the cluster graph that plots all observations according to satisfaction and evaluation. This is a way of predicting employee turnover without the need to build a model. I will later make predictions by using one of the most popular and easy to understand algorithms, the random forest.

Overview

This dataset was taken from Kaggle.com (https://www.kaggle.com/c/sm).

It consists of 14.999 observations, and for each observation there is information (both quantitative and qualitative) on that employee. We also have information in which we know if the employee left the company or not.

Let's see what is the information we have available:

  • ■  Employee satisfaction level: ranges from 0 to 100

  • ■  Last evaluation: it also goes from 0 to 100

  • ■  Number of projects: The number of projects taken by each employee, from 2 to 7 projects

  • ■  Average monthly hours

  • ■  Time spent at the company: in years

  • ■  Whether they had a work accident: 1 if yes, 0 if no

  • ■  Whether they had a promotion in the last 5 years: 1 if yes, 0 if no

  • ■  Sales: It says the department in which employee works: sales, technical, support, IT, product management, marketing, RandD, accounting, HR and management

  • ■  Salary: It can be Low, Medium and High

  • ■  Whether the employee has left: 1 if yes, 0 if no

  • Correlation

    Satisfaction Level

  • ■  Satisfaction level goes from 0 to 100

  • ■  The average satisfaction level is 0.61

  • ■  30% of employees are within the range 0.7-0.89

  • ■  If we consider that employees are unhappy below level 50, then 30% of employees are unhappy

  • Last Evaluation

  • ■  This metric has the same range as Satisfaction level : goes from 0 to 100

  • ■  The average is 0.71

  • ■  It is mildly positive correlated with the number of projects and the average monthly hours, which suggests that employers who take on more projects and work longer hours tend to have better evaluations

  • Number of Projects

  • ■  On average an employer take on 3.8 projects

  • ■  29% of employees take 4 projects

  • Average Monthly Hours

  • ■  On average each employee works 201 hours a month

  • ■  The maximum monthly work hours in this dataset is 310 hours and this corresponds to 18 employees

  • ■  Interesting to note that all these 18 employees left the company: 9 had low salary and 9 had medium salary. None of them has had a promotion in the last 5 years.

  • ■  Their satisfaction level was very low, although their evaluation was very high (0.83); They were doing projects above average (6.2 projects) and their time spent at the company was on average 4 years

  • ■  One third of these 18 employees were working for HR

  • Time spent at the Company

  • ■  On average employees have been in the company for 3.5 years

  • ■  65% have been for 2 and 3 years

  • Employee Profile of who stayed longer than 10 years

  • ■  There are 214 employees in this dataset that stayed longer than 10 years

  • ■  Their satisfaction level is a little above average - 0.65

  • ■  Also their satisfaction level is a bit higher than average - 0.73

  • ■  Both the number of projects and monthly working hours taken by these employees is lower than average

  • ■  29% work for management and 27% work in sales

  • Left the company?

    As one would expect, the satisfacion level on average for the employees who left the company is way below the average, at 0.44.

    Only 0.5% of those employees has had a promotion in the last 5 years.

    28% of them were working for sales, so turnover seems to be higher for sales people.

    In this dataset, 24% of employees have left the company. Let's see if we can find some pattern within these type of employees.

    If we plot both the level of satisfaction and last evaluation against employees who both left and stayed in the company, we can see that there are 3 clusters, that is, 3 distinct groups:

  • ■  First Cluster: their level of satisfaction is in the lowest range, although their evaluation is very high;

  • ■  Second Cluster : both satisfaction and evaluation is within the range 0.4 and 0.5

  • ■  Third Cluster: this cluster is a little more spread compared to the other two clusters, but we can find a pattern: their last evaluation is high, above 0.8, and their satisfaction level goes from 0.7 to 0.9

  • Building the model

    The model was written in python and all the source code can be seen in the main page.

    The model used to predict employee turnover is the random forest. We tested 0.98 accuracy.

    After running the model, we can get some insightss to what features were most useful.

    Below is the raph that plots feature importance in our random forest model:

    Conclusions

  • ■  The average satisfaction level is 0.61

  • ■  The average on Last Evaluation is 0.71. This feature is mildly positive correlated with the number of projects and the average monthly hours, which suggests that employers who take on more projects and work longer hours tend to have better evaluations

  • ■  On average each employee works 201 hours a month

  • ■  The maximum monthly work hours in this dataset is 310 hours and this corresponds to 18 employees. All of them left the company

  • ■  Only 1.4% of employees stayed more than 10 years

  • ■  14% had a work accident and 24% of employees have left the company

  • ■  We can see three distinct clusters when plotting Satisfaction and Evaluation: if you don't know how to build a prediction model, you should be able to spot the employees that are more likely to leave the company by plotting the data and identify these groups

  • ■  The model used to predict employee turnover is the random forest. We tested 0.98 accuracy.