Francisca Dias
In this report I will perform data analysis on the dataset "Human Resources Analytics" and I will apply the Random Forest Model to predict which employees will leave prematurely.
A common problem across companies is employee turnover. Companies often make an investment by training new employees, so every time an employee leaves the company, it represents a lost investment.
Therefore it would be great to have a model that predicts when an employee is likely to leave and offer them incentives to stay. This would impact the company business by saving time and money in training new employees.
Therefore how can we predict when an employee will leave?
I will first go through all features represented in this dataset, with focus on the cluster graph that plots all observations according to satisfaction and evaluation. This is a way of predicting employee turnover without the need to build a model. I will later make predictions by using one of the most popular and easy to understand algorithms, the random forest.
This dataset can be found here.
It consists of 14.999 observations, and for each observation there is information (both quantitative and qualitative) on that employee. We also have information in which we know if the employee left the company or not.
Let's see what is the information we have available:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix
from sklearn.metrics import roc_curve
import matplotlib
import matplotlib.pyplot as plt
from IPython.display import display, HTML
churn_df = pd.read_csv('hr_dataset.csv')
churn_df.describe()
from IPython.display import IFrame
IFrame('https://plot.ly/~FranciscaDias/73/', width=900, height=550)
IFrame('https://plot.ly/~FranciscaDias/67/', width=900, height=550)
churn_df.satisfaction_level.describe()
col_to_use =['satisfaction_level', 'left']
satisfaction_level_df = churn_df[col_to_use]
# exports data to csv
# satisfaction_level_df.to_csv('satisfaction_level_df')
churn_df.last_evaluation.describe()
IFrame('https://plot.ly/~FranciscaDias/69/', width=900, height=550)
churn_df.number_project.describe()
churn_df.number_project.value_counts()
churn_df.average_montly_hours.describe()
average_montly_hours_310 = churn_df[churn_df['average_montly_hours']==310]
average_montly_hours_310
average_montly_hours_310.describe()
average_montly_hours_310.sales.value_counts()
col_to_use =['average_montly_hours', 'satisfaction_level']
average_monthly_hours_df = churn_df[col_to_use]
# exports data to csv
# average_monthly_hours_df.to_csv('average_monthly_hours_df.csv')
Employee Profile of who stayed longer than 10 years
churn_df.time_spend_company.describe()
churn_df.time_spend_company.value_counts()
time_spent_10_years = churn_df[churn_df['time_spend_company'] == 10]
time_spent_10_years.head()
time_spent_10_years.describe()
time_spent_10_years.sales.value_counts()
churn_df.Work_accident.value_counts()
In this dataset, 24% of employees have left the company. Let's see if we can find some pattern within these type of employees.
IFrame('https://plot.ly/~FranciscaDias/65/', width=900, height=550)
churn_df.left.value_counts()
who_left_the_company = churn_df[churn_df['left'] ==1]
who_left_the_company.head()
who_left_the_company.describe()
who_left_the_company.sales.value_counts()
churn_df.promotion_last_5years.value_counts()
promotion_yes_no = churn_df[churn_df['promotion_last_5years']==1]
promotion_yes_no.sales.value_counts()
col_to_use =['salary', 'left']
salary_df = churn_df[col_to_use]
# exports data to csv
# salary_df.to_csv('salary_df')
churn_df.dtypes
# Call it dataframe (df)
df = churn_df
df.dtypes
# Convert sales and salary to dummies
df = pd.get_dummies(df)
df.dtypes
# Isolate target data
y = df['left']
y.value_counts()
df.columns
len(df.columns)
cols_to_use = ['satisfaction_level', 'last_evaluation', 'number_project',
'average_montly_hours', 'time_spend_company', 'Work_accident',
'promotion_last_5years', 'sales_IT', 'sales_RandD', 'sales_accounting',
'sales_hr', 'sales_management', 'sales_marketing', 'sales_product_mng',
'sales_sales', 'sales_support', 'sales_technical', 'salary_high',
'salary_low', 'salary_medium']
X = df[cols_to_use]
len(X.columns)
# 14999 rows × 20 columns
X.head(2)
my_model_forest = RandomForestClassifier()
my_model_forest.fit(X, y)
my_model_forest.predict(X)
my_model_forest.score(X, y)
Our model scores almost 100%. But here is the catch: the model was tested in seen data, so we have to make sure that we predict on unseen data, which is the real case when applying this technique.
The model practical value comes from making predictions on new data
Therefore we will split the data by 75% (train) and 25% (test) by using the function train_test_split.
train_X, test_X, train_y, test_y = train_test_split(X, y)
print(train_X.shape)
print(test_X.shape)
print(train_y.shape)
print(test_y.shape)
my_model_forest_train_test = RandomForestClassifier()
my_model_forest_train_test.fit(train_X, train_y)
predictions = my_model_forest_train_test.predict(test_X)
probs = my_model_forest_train_test.predict_proba(test_X)
score = my_model_forest_train_test.score(test_X, test_y)
score
The score is now below the previous one. That is because we split the dataset and tested the model on unseen data.
confusion_matrix = pd.DataFrame(
confusion_matrix(test_y, predictions),
columns=["Predicted False", "Predicted True"],
index=["Actual False", "Actual True"]
)
confusion_matrix
my_model_forest_train_test.feature_importances_
features = train_X.columns
df_f = pd.DataFrame(my_model_forest_train_test.feature_importances_, columns=["importance"])
df_f["labels"] = features
df_f.sort_values("importance", inplace=True, ascending=False)
df_f
If we plot features importance we have some insight as to what features were most useful in our random forest model.
Satisfaction level comes first with 0.32 importance. Then comes the number of projects (0.19) and the time spent at the company (0.18)
This is useful when refining our model in the future.
IFrame('https://plot.ly/~FranciscaDias/71/', width=900, height=550)