Understand why employees leave a company and apply various machine learning models to predict the next leaver!
"Yeah, they all said that to me...", Bob replied as we were at Starbucks sipping on our dark roast coffee. Bob is a friend of mine and was the owner of a multi-million dollar company, that's right, "m-i-l-l-i-o-n". He used to tell me stories about how his company's productivity and growth has sky rocketed from the previous years and everything has been going great. But recently, he's been noticing some decline within his company. In a five month period, he lost one-fifth of his employees. At least a dozen of them throughout each department made phone calls and even left sticky notes on their tables informing him about their leave. Nobody knew what was happening. In that year, he was contemplating about filing for bankruptcy. Fast-forward seven months later, he's having a conversation with his co-founder of the company. The conversation ends with, "I quit..."
That is the last thing anybody wants to hear from their employees. In a sense, it’s the employees who make the company. It’s the employees who do the work. It’s the employees who shape the company’s culture. Long-term success, a healthy work environment, and high employee retention are all signs of a successful company. But when a company experiences a high rate of employee turnover, then something is going wrong. This can lead the company to huge monetary losses by these innovative and valuable employees.
Companies that maintain a healthy organization and culture are always a good sign of future prosperity. Recognizing and understanding what factors that were associated with employee turnover will allow companies and individuals to limit this from happening and may even increase employee productivity and growth. These predictive insights give managers the opportunity to take corrective steps to build and preserve their successful business.
** "You don't build a business. You build people, and people build the business." - Zig Ziglar**
Hopefully the kernel added some new insights/perspectives to the data science community! If there are any suggestions/changes you would like to see in the Kernel please let me know :). Appreciate every ounce of help! #KernelsAward #Kaggle
This notebook will always be a work in progress. Please leave any comments about further improvements to the notebook! Any feedback or constructive criticism is greatly appreciated!. Thank you guys!
Bob's multi-million dollar company is about to go bankrupt and he wants to know why his employees are leaving.
Bob the Boss
My goal is to understand what factors contribute most to employee turnover and create a model that can predict if a certain employee will leave the company or not.
I’ll be following a typical data science pipeline, which is call “OSEMN” (pronounced awesome).
Obtaining the data is the first approach in solving the problem.
Scrubbing or cleaning the data is the next step. This includes data imputation of missing or invalid data and fixing column names.
Exploring the data will follow right after and allow further insight of what our dataset contains. Looking for any outliers or weird data. Understanding the relationship each explanatory variable has with the response variable resides here and we can do this with a correlation matrix.
Modeling the data will give us our predictive power on whether an employee will leave.
INterpreting the data is last. With all the results and analysis of the data, what conclusion is made? What factors contributed most to employee turnover? What relationship of variables were found?
Note: The data was found from the “Human Resources Analytics” dataset provided by Kaggle’s website. https://www.kaggle.com/ludobenistant/hr-analytics
Note: THIS DATASET IS SIMULATED.
# Import the neccessary modules for data manipulation and visual representation
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib as matplot
import seaborn as sns
%matplotlib inline
#Read the analytics csv file and store our dataset into a dataframe called "df"
df = pd.DataFrame.from_csv('HR_comma_sep.csv', index_col=None)
Typically, cleaning the data requires a lot of work and can be a very tedious procedure. This dataset from Kaggle is super clean and contains no missing values. But still, I will have to examine the dataset to make sure that everything else is readable and that the observation values match the feature names appropriately.
# Check to see if there are any missing values in our data set
df.isnull().any()
satisfaction_level False
last_evaluation False
number_project False
average_montly_hours False
time_spend_company False
Work_accident False
left False
promotion_last_5years False
sales False
salary False
dtype: bool
# Get a quick overview of what we are dealing with in our dataset
df.head()
.dataframe thead th {
text-align: left;
}
.dataframe tbody tr th {
vertical-align: top;
}
</style>
satisfaction_level | last_evaluation | number_project | average_montly_hours | time_spend_company | Work_accident | left | promotion_last_5years | sales | salary | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 0.38 | 0.53 | 2 | 157 | 3 | 0 | 1 | 0 | sales | low |
1 | 0.80 | 0.86 | 5 | 262 | 6 | 0 | 1 | 0 | sales | medium |
2 | 0.11 | 0.88 | 7 | 272 | 4 | 0 | 1 | 0 | sales | medium |
3 | 0.72 | 0.87 | 5 | 223 | 5 | 0 | 1 | 0 | sales | low |
4 | 0.37 | 0.52 | 2 | 159 | 3 | 0 | 1 | 0 | sales | low |
# Renaming certain columns for better readability
df = df.rename(columns={'satisfaction_level': 'satisfaction',
'last_evaluation': 'evaluation',
'number_project': 'projectCount',
'average_montly_hours': 'averageMonthlyHours',
'time_spend_company': 'yearsAtCompany',
'Work_accident': 'workAccident',
'promotion_last_5years': 'promotion',
'sales' : 'department',
'left' : 'turnover'
})
# Move the reponse variable "turnover" to the front of the table
front = df['turnover']
df.drop(labels=['turnover'], axis=1,inplace = True)
df.insert(0, 'turnover', front)
df.head()
.dataframe thead th {
text-align: left;
}
.dataframe tbody tr th {
vertical-align: top;
}
</style>
turnover | satisfaction | evaluation | projectCount | averageMonthlyHours | yearsAtCompany | workAccident | promotion | department | salary | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0.38 | 0.53 | 2 | 157 | 3 | 0 | 0 | sales | low |
1 | 1 | 0.80 | 0.86 | 5 | 262 | 6 | 0 | 0 | sales | medium |
2 | 1 | 0.11 | 0.88 | 7 | 272 | 4 | 0 | 0 | sales | medium |
3 | 1 | 0.72 | 0.87 | 5 | 223 | 5 | 0 | 0 | sales | low |
4 | 1 | 0.37 | 0.52 | 2 | 159 | 3 | 0 | 0 | sales | low |
The dataset has:
# The dataset contains 10 columns and 14999 observations
df.shape
(14999, 10)
# Check the type of our features.
df.dtypes
turnover int64
satisfaction float64
evaluation float64
projectCount int64
averageMonthlyHours int64
yearsAtCompany int64
workAccident int64
promotion int64
department object
salary object
dtype: object
# Looks like about 76% of employees stayed and 24% of employees left.
# NOTE: When performing cross validation, its important to maintain this turnover ratio
turnover_rate = df.turnover.value_counts() / 14999
turnover_rate
0 0.761917
1 0.238083
Name: turnover, dtype: float64
# Display the statistical overview of the employees
df.describe()
.dataframe thead th {
text-align: left;
}
.dataframe tbody tr th {
vertical-align: top;
}
</style>
turnover | satisfaction | evaluation | projectCount | averageMonthlyHours | yearsAtCompany | workAccident | promotion | |
---|---|---|---|---|---|---|---|---|
count | 14999.000000 | 14999.000000 | 14999.000000 | 14999.000000 | 14999.000000 | 14999.000000 | 14999.000000 | 14999.000000 |
mean | 0.238083 | 0.612834 | 0.716102 | 3.803054 | 201.050337 | 3.498233 | 0.144610 | 0.021268 |
std | 0.425924 | 0.248631 | 0.171169 | 1.232592 | 49.943099 | 1.460136 | 0.351719 | 0.144281 |
min | 0.000000 | 0.090000 | 0.360000 | 2.000000 | 96.000000 | 2.000000 | 0.000000 | 0.000000 |
25% | 0.000000 | 0.440000 | 0.560000 | 3.000000 | 156.000000 | 3.000000 | 0.000000 | 0.000000 |
50% | 0.000000 | 0.640000 | 0.720000 | 4.000000 | 200.000000 | 3.000000 | 0.000000 | 0.000000 |
75% | 0.000000 | 0.820000 | 0.870000 | 5.000000 | 245.000000 | 4.000000 | 0.000000 | 0.000000 |
max | 1.000000 | 1.000000 | 1.000000 | 7.000000 | 310.000000 | 10.000000 | 1.000000 | 1.000000 |
# Overview of summary (Turnover V.S. Non-turnover)
turnover_Summary = df.groupby('turnover')
turnover_Summary.mean()
.dataframe thead th {
text-align: left;
}
.dataframe tbody tr th {
vertical-align: top;
}
</style>
satisfaction | evaluation | projectCount | averageMonthlyHours | yearsAtCompany | workAccident | promotion | |
---|---|---|---|---|---|---|---|
turnover | |||||||
0 | 0.666810 | 0.715473 | 3.786664 | 199.060203 | 3.380032 | 0.175009 | 0.026251 |
1 | 0.440098 | 0.718113 | 3.855503 | 207.419210 | 3.876505 | 0.047326 | 0.005321 |
Moderate Positively Correlated Features:
Moderate Negatively Correlated Feature:
Stop and Think:
Summary:
From the heatmap, there is a positive(+) correlation between projectCount, averageMonthlyHours, and evaluation. Which could mean that the employees who spent more hours and did more projects were evaluated highly.
For the negative(-) relationships, turnover and satisfaction are highly correlated. I'm assuming that people tend to leave a company more when they are less satisfied.
#Correlation Matrix
corr = df.corr()
corr = (corr)
sns.heatmap(corr,
xticklabels=corr.columns.values,
yticklabels=corr.columns.values)
sns.plt.title('Heatmap of Correlation Matrix')
corr
.dataframe thead th {
text-align: left;
}
.dataframe tbody tr th {
vertical-align: top;
}
</style>
turnover | satisfaction | evaluation | projectCount | averageMonthlyHours | yearsAtCompany | workAccident | promotion | |
---|---|---|---|---|---|---|---|---|
turnover | 1.000000 | -0.388375 | 0.006567 | 0.023787 | 0.071287 | 0.144822 | -0.154622 | -0.061788 |
satisfaction | -0.388375 | 1.000000 | 0.105021 | -0.142970 | -0.020048 | -0.100866 | 0.058697 | 0.025605 |
evaluation | 0.006567 | 0.105021 | 1.000000 | 0.349333 | 0.339742 | 0.131591 | -0.007104 | -0.008684 |
projectCount | 0.023787 | -0.142970 | 0.349333 | 1.000000 | 0.417211 | 0.196786 | -0.004741 | -0.006064 |
averageMonthlyHours | 0.071287 | -0.020048 | 0.339742 | 0.417211 | 1.000000 | 0.127755 | -0.010143 | -0.003544 |
yearsAtCompany | 0.144822 | -0.100866 | 0.131591 | 0.196786 | 0.127755 | 1.000000 | 0.002120 | 0.067433 |
workAccident | -0.154622 | 0.058697 | -0.007104 | -0.004741 | -0.010143 | 0.002120 | 1.000000 | 0.039245 |
promotion | -0.061788 | 0.025605 | -0.008684 | -0.006064 | -0.003544 | 0.067433 | 0.039245 | 1.000000 |
A one-sample t-test checks whether a sample mean differs from the population mean. Let's test to see whether the average satisfaction level of employees that had a turnover differs from the entire employee population.
Hypothesis Testing: Is there significant difference in the means of satisfaction level between employees who had a turnover and the entire employee population?
Null Hypothesis: (H0: pTS = pES) The null hypothesis would be that there is no difference in satisfaction level between employees who did turnover and the entire employee population.
Alternate Hypothesis: (HA: pTS != pES) The alternative hypothesis would be that there is a difference in satisfaction level between employees who did turnover and the entire employee population.
# Let's compare the means of our employee turnover satisfaction against the employee population satisfaction
emp_population_satisfaction = df['satisfaction'].mean()
emp_turnover_satisfaction = df[df['turnover']==1]['satisfaction'].mean()
print( 'The mean for the employee population is: ' + str(emp_population_satisfaction) )
print( 'The mean for the employees that had a turnover is: ' + str(emp_turnover_satisfaction) )
The mean for the employee population is: 0.6128335222348166
The mean for the employees that had a turnover is: 0.44009801176140917
Let's conduct a t-test at 95% confidence level and see if it correctly rejects the null hypothesis that the sample comes from the same distribution as the employee population. To conduct a one sample t-test, we can use the stats.ttest_1samp() function:
import scipy.stats as stats
stats.ttest_1samp(a= df[df['turnover']==1]['satisfaction'], # Sample of Employee satisfaction who had a Turnover
popmean = emp_population_satisfaction) # Employee Population satisfaction mean
Ttest_1sampResult(statistic=-39.109488943484457, pvalue=9.012781195378076e-279)
The test result shows the test statistic "t" is equal to -39.109. This test statistic tells us how much the sample mean deviates from the null hypothesis. If the t-statistic lies outside the quantiles of the t-distribution corresponding to our confidence level and degrees of freedom, we reject the null hypothesis. We can check the quantiles with stats.t.ppf():
If the t-statistic value we calculated above (-39.109) is outside the quantiles, then we can reject the null hypothesis
degree_freedom = len(df[df['turnover']==1])
LQ = stats.t.ppf(0.025,degree_freedom) # Left Quartile
RQ = stats.t.ppf(0.975,degree_freedom) # Right Quartile
print ('The t-distribution left quartile range is: ' + str(LQ))
print ('The t-distribution right quartile range is: ' + str(RQ))
The t-distribution left quartile range is: -1.9606285216
The t-distribution right quartile range is: 1.9606285216
Reject the null hypothesis because:
Based on the statistical analysis of a one sample t-test, there seems to be some significant difference between the mean satisfaction of employees who had a turnover and the entire employee population. The super low P-value of 9.012e-279 at a 5% confidence level is a good indicator to reject the null hypothesis.
But this does not neccessarily mean that there is practical significance. We would have to conduct more experiments or maybe collect more data about the employees in order to come up with a more accurate finding.
Summary: Let's examine the distribution on some of the employee's features. Here's what I found:
Stop and Think:
# Set up the matplotlib figure
f, axes = plt.subplots(ncols=3, figsize=(15, 6))
# Graph Employee Satisfaction
sns.distplot(df.satisfaction, kde=False, color="g", ax=axes[0]).set_title('Employee Satisfaction Distribution')
axes[0].set_ylabel('Employee Count')
# Graph Employee Evaluation
sns.distplot(df.evaluation, kde=False, color="r", ax=axes[1]).set_title('Employee Evaluation Distribution')
axes[1].set_ylabel('Employee Count')
# Graph Employee Average Monthly Hours
sns.distplot(df.averageMonthlyHours, kde=False, color="b", ax=axes[2]).set_title('Employee Average Monthly Hours Distribution')
axes[2].set_ylabel('Employee Count')
<matplotlib.text.Text at 0x2435d0f34e0>
Summary: This is not unusual. Here's what I found:
Stop and Think:
f, ax = plt.subplots(figsize=(15, 4))
sns.countplot(y="salary", hue='turnover', data=df).set_title('Employee Salary Turnover Distribution');
Summary: Let's see more information about the departments. Here's what I found:
Stop and Think:
# Employee distri
# Types of colors
color_types = ['#78C850','#F08030','#6890F0','#A8B820','#A8A878','#A040A0','#F8D030',
'#E0C068','#EE99AC','#C03028','#F85888','#B8A038','#705898','#98D8D8','#7038F8']
# Count Plot (a.k.a. Bar Plot)
sns.countplot(x='department', data=df, palette=color_types).set_title('Employee Department Distribution');
# Rotate x-labels
plt.xticks(rotation=-45)
(array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]), <a list of 10 Text xticklabel objects>)
f, ax = plt.subplots(figsize=(15, 5))
sns.countplot(y="department", hue='turnover', data=df).set_title('Employee Department Turnover Distribution');
Summary: This graph is quite interesting as well. Here's what I found:
Stop and Think:
ax = sns.barplot(x="projectCount", y="projectCount", hue="turnover", data=df, estimator=lambda x: len(x) / len(df) * 100)
ax.set(ylabel="Percent")
[<matplotlib.text.Text at 0x2435d43eba8>]
Summary:
# Kernel Density Plot
fig = plt.figure(figsize=(15,4),)
ax=sns.kdeplot(df.loc[(df['turnover'] == 0),'evaluation'] , color='b',shade=True,label='no turnover')
ax=sns.kdeplot(df.loc[(df['turnover'] == 1),'evaluation'] , color='r',shade=True, label='turnover')
plt.title('Employee Evaluation Distribution - Turnover V.S. No Turnover')
<matplotlib.text.Text at 0x2435d7a1b70>
Summary:
#KDEPlot: Kernel Density Estimate Plot
fig = plt.figure(figsize=(15,4))
ax=sns.kdeplot(df.loc[(df['turnover'] == 0),'averageMonthlyHours'] , color='b',shade=True, label='no turnover')
ax=sns.kdeplot(df.loc[(df['turnover'] == 1),'averageMonthlyHours'] , color='r',shade=True, label='turnover')
plt.title('Employee AverageMonthly Hours Distribution - Turnover V.S. No Turnover')
<matplotlib.text.Text at 0x2435d863b00>
Summary:
#KDEPlot: Kernel Density Estimate Plot
fig = plt.figure(figsize=(15,4))
ax=sns.kdeplot(df.loc[(df['turnover'] == 0),'satisfaction'] , color='b',shade=True, label='no turnover')
ax=sns.kdeplot(df.loc[(df['turnover'] == 1),'satisfaction'] , color='r',shade=True, label='turnover')
plt.title('Employee Satisfaction Distribution - Turnover V.S. No Turnover')
<matplotlib.text.Text at 0x2435da32048>
Summary:
Stop and Think:
#ProjectCount VS AverageMonthlyHours [BOXPLOT]
#Looks like the average employees who stayed worked about 200hours/month. Those that had a turnover worked about 250hours/month and 150hours/month
import seaborn as sns
sns.boxplot(x="projectCount", y="averageMonthlyHours", hue="turnover", data=df)
<matplotlib.axes._subplots.AxesSubplot at 0x2435da94d68>
Summary: This graph looks very similar to the graph above. What I find strange with this graph is with the turnover group. There is an increase in evaluation for employees who did more projects within the turnover group. But, again for the non-turnover group, employees here had a consistent evaluation score despite the increase in project counts.
Questions to think about:
#ProjectCount VS Evaluation
#Looks like employees who did not leave the company had an average evaluation of around 70% even with different projectCounts
#There is a huge skew in employees who had a turnover though. It drastically changes after 3 projectCounts.
#Employees that had two projects and a horrible evaluation left. Employees with more than 3 projects and super high evaluations left
import seaborn as sns
sns.boxplot(x="projectCount", y="evaluation", hue="turnover", data=df)
<matplotlib.axes._subplots.AxesSubplot at 0x2435dc1eb00>
Summary: This is by far the most compelling graph. This is what I found:
Cluster 1 (Hard-working and Sad Employee): Satisfaction was below 0.2 and evaluations were greater than 0.75. Which could be a good indication that employees who left the company were good workers but felt horrible at their job.
Cluster 2 (Bad and Sad Employee): Satisfaction between about 0.35~0.45 and evaluations below ~0.58. This could be seen as employees who were badly evaluated and felt bad at work.
Cluster 3 (Hard-working and Happy Employee): Satisfaction between 0.7~1.0 and evaluations were greater than 0.8. Which could mean that employees in this cluster were "ideal". They loved their work and were evaluated highly for their performance.
sns.lmplot(x='satisfaction', y='evaluation', data=df,
fit_reg=False, # No regression line
hue='turnover') # Color by evolution stage
<seaborn.axisgrid.FacetGrid at 0x2435d6d2ba8>
Summary: Let's see if theres a point where employees start leaving the company. Here's what I found:
Stop and Think:
ax = sns.barplot(x="yearsAtCompany", y="yearsAtCompany", hue="turnover", data=df, estimator=lambda x: len(x) / len(df) * 100)
ax.set(ylabel="Percent")
[<matplotlib.text.Text at 0x2435de30c50>]
Cluster 1 (Blue): Hard-working and Sad Employees
Cluster 2 (Red): Bad and Sad Employee
Cluster 3 (Green): Hard-working and Happy Employee
Clustering PROBLEM: - How do we know that there are "3" clusters? - We would need expert domain knowledge to classify the right amount of clusters - Hidden uknown structures could be present
# Import KMeans Model
from sklearn.cluster import KMeans
# Graph and create 3 clusters of Employee Turnover
kmeans = KMeans(n_clusters=3,random_state=2)
kmeans.fit(df[df.turnover==1][["satisfaction","evaluation"]])
kmeans_colors = ['green' if c == 0 else 'blue' if c == 2 else 'red' for c in kmeans.labels_]
fig = plt.figure(figsize=(10, 6))
plt.scatter(x="satisfaction",y="evaluation", data=df[df.turnover==1],
alpha=0.25,color = kmeans_colors)
plt.xlabel("Satisfaction")
plt.ylabel("Evaluation")
plt.scatter(x=kmeans.cluster_centers_[:,0],y=kmeans.cluster_centers_[:,1],color="black",marker="X",s=100)
plt.title("Clusters of Employee Turnover")
plt.show()
The best model performance out of the four (Decision Tree Model, AdaBoost Model, Logistic Regression Model, Random Forest Model) was Random Forest!
Note: Base Rate
Note: Evaluating the Model
Precision and Recall / Class Imbalance
This dataset is an example of a class imbalance problem because of the skewed distribution of employees who did and did not leave. More skewed the class means that accuracy breaks down.
In this case, evaluating our model’s algorithm based on accuracy is the wrong thing to measure. We would have to know the different errors that we care about and correct decisions. Accuracy alone does not measure an important concept that needs to be taken into consideration in this type of evaluation: False Positive and False Negative errors.
False Positives (Type I Error): You predict that the employee will leave, but do not
False Negatives (Type II Error): You predict that the employee will not leave, but does leave
In this problem, what type of errors do we care about more? False Positives or False Negatives?
Note: Different Ways to Evaluate Classification Models
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report, precision_score, recall_score, confusion_matrix, precision_recall_curve
from sklearn.preprocessing import RobustScaler
# Create dummy variables for the 'department' and 'salary' features, since they are categorical
department = pd.get_dummies(data=df['department'],drop_first=True,prefix='dep') #drop first column to avoid dummy trap
salary = pd.get_dummies(data=df['salary'],drop_first=True,prefix='sal')
df.drop(['department','salary'],axis=1,inplace=True)
df = pd.concat([df,department,salary],axis=1)
# Create base rate model
def base_rate_model(X) :
y = np.zeros(X.shape[0])
return y
# Create train and test splits
target_name = 'turnover'
X = df.drop('turnover', axis=1)
robust_scaler = RobustScaler()
X = robust_scaler.fit_transform(X)
y=df[target_name]
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.15, random_state=123, stratify=y)
# Check accuracy of base rate model
y_base_rate = base_rate_model(X_test)
from sklearn.metrics import accuracy_score
print ("Base rate accuracy is %2.2f" % accuracy_score(y_test, y_base_rate))
Base rate accuracy is 0.76
# Check accuracy of Logistic Model
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(penalty='l2', C=1)
model.fit(X_train, y_train)
print ("Logistic accuracy is %2.2f" % accuracy_score(y_test, model.predict(X_test)))
Logistic accuracy is 0.79
# Using 10 fold Cross-Validation to train our Logistic Regression Model
from sklearn import model_selection
from sklearn.linear_model import LogisticRegression
kfold = model_selection.KFold(n_splits=10, random_state=7)
modelCV = LogisticRegression(class_weight = "balanced")
scoring = 'roc_auc'
results = model_selection.cross_val_score(modelCV, X_train, y_train, cv=kfold, scoring=scoring)
print("AUC: %.3f (%.3f)" % (results.mean(), results.std()))
AUC: 0.829 (0.011)
# Compare the Logistic Regression Model V.S. Base Rate Model V.S. Random Forest Model
from sklearn.metrics import roc_auc_score
from sklearn.metrics import classification_report
from sklearn.ensemble import RandomForestClassifier
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import VotingClassifier
print ("---Base Model---")
base_roc_auc = roc_auc_score(y_test, base_rate_model(X_test))
print ("Base Rate AUC = %2.2f" % base_roc_auc)
print(classification_report(y_test, base_rate_model(X_test)))
# NOTE: By adding in "class_weight = balanced", the Logistic Auc increased by about 10%! This adjusts the threshold value
logis = LogisticRegression(class_weight = "balanced")
logis.fit(X_train, y_train)
print ("\n\n ---Logistic Model---")
logit_roc_auc = roc_auc_score(y_test, logis.predict(X_test))
print ("Logistic AUC = %2.2f" % logit_roc_auc)
print(classification_report(y_test, logis.predict(X_test)))
# Decision Tree Model
dtree = tree.DecisionTreeClassifier(
#max_depth=3,
class_weight="balanced",
min_weight_fraction_leaf=0.01
)
dtree = dtree.fit(X_train,y_train)
print ("\n\n ---Decision Tree Model---")
dt_roc_auc = roc_auc_score(y_test, dtree.predict(X_test))
print ("Decision Tree AUC = %2.2f" % dt_roc_auc)
print(classification_report(y_test, dtree.predict(X_test)))
# Random Forest Model
rf = RandomForestClassifier(
n_estimators=1000,
max_depth=None,
min_samples_split=10,
class_weight="balanced"
#min_weight_fraction_leaf=0.02
)
rf.fit(X_train, y_train)
print ("\n\n ---Random Forest Model---")
rf_roc_auc = roc_auc_score(y_test, rf.predict(X_test))
print ("Random Forest AUC = %2.2f" % rf_roc_auc)
print(classification_report(y_test, rf.predict(X_test)))
# Ada Boost
ada = AdaBoostClassifier(n_estimators=400, learning_rate=0.1)
ada.fit(X_train,y_train)
print ("\n\n ---AdaBoost Model---")
ada_roc_auc = roc_auc_score(y_test, ada.predict(X_test))
print ("AdaBoost AUC = %2.2f" % ada_roc_auc)
print(classification_report(y_test, ada.predict(X_test)))
---Base Model---
Base Rate AUC = 0.50
precision recall f1-score support
0 0.76 1.00 0.86 1714
1 0.00 0.00 0.00 536
avg / total 0.58 0.76 0.66 2250
---Logistic Model---
Logistic AUC = 0.78
precision recall f1-score support
0 0.92 0.75 0.83 1714
1 0.50 0.80 0.62 536
avg / total 0.82 0.76 0.78 2250
---Decision Tree Model---
Decision Tree AUC = 0.94
precision recall f1-score support
0 0.97 0.97 0.97 1714
1 0.92 0.91 0.92 536
avg / total 0.96 0.96 0.96 2250
C:\Users\Randy\Anaconda3\lib\site-packages\sklearn\metrics\classification.py:1113: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples.
'precision', 'predicted', average, warn_for)
---Random Forest Model---
Random Forest AUC = 0.98
precision recall f1-score support
0 0.99 1.00 0.99 1714
1 0.99 0.96 0.97 536
avg / total 0.99 0.99 0.99 2250
---AdaBoost Model---
AdaBoost AUC = 0.93
precision recall f1-score support
0 0.96 0.97 0.97 1714
1 0.91 0.88 0.90 536
avg / total 0.95 0.95 0.95 2250
# Create ROC Graph
from sklearn.metrics import roc_curve
fpr, tpr, thresholds = roc_curve(y_test, logis.predict_proba(X_test)[:,1])
rf_fpr, rf_tpr, rf_thresholds = roc_curve(y_test, rf.predict_proba(X_test)[:,1])
dt_fpr, dt_tpr, dt_thresholds = roc_curve(y_test, dtree.predict_proba(X_test)[:,1])
ada_fpr, ada_tpr, ada_thresholds = roc_curve(y_test, ada.predict_proba(X_test)[:,1])
plt.figure()
# Plot Logistic Regression ROC
plt.plot(fpr, tpr, label='Logistic Regression (area = %0.2f)' % logit_roc_auc)
# Plot Random Forest ROC
plt.plot(rf_fpr, rf_tpr, label='Random Forest (area = %0.2f)' % rf_roc_auc)
# Plot Decision Tree ROC
plt.plot(dt_fpr, dt_tpr, label='Decision Tree (area = %0.2f)' % dt_roc_auc)
# Plot AdaBoost ROC
plt.plot(ada_fpr, ada_tpr, label='AdaBoost (area = %0.2f)' % ada_roc_auc)
# Plot Base Rate ROC
plt.plot([0,1], [0,1],label='Base Rate' 'k--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Graph')
plt.legend(loc="lower right")
plt.show()
Top 3 Features:
plt.style.use('fivethirtyeight')
plt.rcParams['figure.figsize'] = (12,6)
## plot the importances ##
importances = dtree.feature_importances_
feat_names = df.drop(['turnover'],axis=1).columns
indices = np.argsort(importances)[::-1]
plt.figure(figsize=(12,6))
plt.title("Feature importances by DecisionTreeClassifier")
plt.bar(range(len(indices)), importances[indices], color='lightblue', align="center")
plt.step(range(len(indices)), np.cumsum(importances[indices]), where='mid', label='Cumulative')
plt.xticks(range(len(indices)), feat_names[indices], rotation='vertical',fontsize=14)
plt.xlim([-1, len(indices)])
plt.show()
Note:
This was something interesting to add to the notebook, but I'm still kind of confused on how the root node is (satisfaction <= -0.461). How does satisfaction level become negative? If anybody can respond to this, please do so!
import graphviz
from sklearn import tree
dot_data = tree.export_graphviz(dtree, out_file=None)
graph = graphviz.Source(dot_data)
graph.render("Employee Turnover")
dot_data = tree.export_graphviz(dtree, out_file=None,
feature_names=feat_names,
class_names='turnover',
filled=True, rounded=True,
special_characters=True)
graph = graphviz.Source(dot_data)
graph
---------------------------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
<ipython-input-38-fddf8e719e6a> in <module>()
----> 1 import graphviz
2 from sklearn import tree
3 dot_data = tree.export_graphviz(dtree, out_file=None)
4 graph = graphviz.Source(dot_data)
5 graph.render("Employee Turnover")
ModuleNotFoundError: No module named 'graphviz'
Summary: With all of this information, this is what Bob should know about his company and why his employees probably left:
"You don't build a business. You build people, and people build the business." - Zig Ziglar
Binary Classification: Turnover V.S. Non Turnover
Instance Scoring: Likelihood of employee responding to an offer/incentive to save them from leaving.
Need for Application: Save employees from leaving
In our employee retention problem, rather than simply predicting whether an employee will leave the company within a certain time frame, we would much rather have an estimate of the probability that he/she will leave the company. We would rank employees by their probability of leaving, then allocate a limited incentive budget to the highest probability instances.
Consider employee turnover domain where an employee is given treatment by Human Resources because they think the employee will leave the company within a month, but the employee actually does not. This is a false positive. This mistake could be expensive, inconvenient, and time consuming for both the Human Resources and employee, but is a good investment for relational growth.
Compare this with the opposite error, where Human Resources does not give treatment/incentives to the employees and they do leave. This is a false negative. This type of error is more detrimental because the company lost an employee, which could lead to great setbacks and more money to rehire. Depending on these errors, different costs are weighed based on the type of employee being treated. For example, if it’s a high-salary employee then would we need a costlier form of treatment? What if it’s a low-salary employee? The cost for each error is different and should be weighed accordingly.
Solution 1:
Solution 2: Develop learning programs for managers. Then use analytics to gauge their performance and measure progress. Some advice:
This problem is about people decision. When modeling the data, we should not be using this predictive metric as a solution decider. But, we can use this to arm people with much better relevant information for better decision making.
We would have to conduct more experiments or collect more data about the employees in order to come up with a more accurate finding. I would recommend to gather more variables from the database that could have more impact on determining employee turnover and satisfaction such as their distance from home, gender, age, and etc.
Reverse Engineer the Problem
After trying to understand what caused employees to leave in the first place, we can form another problem to solve by asking ourselves
Reddit Commentor (DSPublic): I worked in HR for a couple of years and here's a few questions I have: People that have HIGH salary and not been promoted, did they leave? If so, could it be a signal that we're not developing people or providing enough opportunities?
How would you define a 'high performer' without using their last evaluation rating? Evaluations tend to be inconsistently applied across departments and highly dependent on your relationship with the person doing that evaluation. Can we do an Evaluation Vs. Departments (see if there are actual differences)? Once defined, did these high performers leave? If so, why? Are we not providing opportunities or recognizing these high performers? Is it a lack of salary?
To add some additional context, 24% turnover rate is high in general but do we know what industry this is from? If the industry norm is 50%, this company is doing great! I see you've done Turnover by dept which is great. If only we have more info and classify these turnovers.
We have voluntary and involuntary turnovers as well. Also, who are these employees - is it part timers, contract workers that turn over? We don't worry about those, they're suppose to go. I'd like to see Turnover vs. Years of service. In real life, we found a cluster / turning point where people 'turn sour' after about 5 years at the company. Can we see satisfaction vs. years at company?
To Do's:
Define "high performers". It's ambiguous and is normally determined by relationships. Could be inconsistent. To verify, do a Evaluation V.S. Department.
Create Expected Value Model. Cost and Benefits. Understand the cost of targeting and cost of employee leaving. Known as Cost Matrix.
Create a tableu dashboard for relevant/important information and highlight
Edit 1: Added Hypothesis testing for employee turnover satisfaction and entire employee satisfaction 8/29/2017
Edit 2: Added Turnover VS Satisfaction graph 9/14/2017
Edit 3: Added pd.get_dummies for 'salary' and 'department' features. This increased the AUC score by 2% (76%-78%) 9/23/2017
Edit 4: Added Random Forest Model and updated the ROC Curve. Added Base Rate Model explanation. Added AdaBoost Model. Added Decision Tree Model 9/27/2017
Edit 5: Added decision tree classifier feature importance. Added visualization for decision tree. 9/30/2017
Edit 6: Added more information about precision/recall and class imbalance solutions. Updated potential solution section and included a new section: evaluating model. 10/1/2017
If anybody would like to discuss any other projects or just have a chat about data science topics, I'll be more than happy to connect with you on LinkedIn: https://www.linkedin.com/in/randylaosat/
This notebook will always be a work in progress. Please leave any comments about further improvements to the notebook! Any feedback or constructive criticism is greatly appreciated. Thank you guys!