Create a multivariate prediction model and perform data analyis of a dataset you choose. This can be either a linear regression, or a classification. This includes:
Data came from the UCI Machine Learning Repository. The data contains records of absenteeism at work for a courier company in Brazil.
Citation: Martiniano, A., Ferreira, R. P., Sassi, R. J., & Affonso, C. (2012). Application of a neuro fuzzy network in prediction of absenteeism at work. In Information Systems and Technologies (CISTI), 7th Iberian Conference on (pp. 1-4). IEEE.
# Load libraries
library(lubridate)
library(tidyverse)
library(modeest)
library(forcats)
# set.seed for reproducible results
set.seed(1222)
#import data
url <-"https://archive.ics.uci.edu/ml/machine-learning-databases/00445/Absenteeism_at_work_AAA.zip"
temp <- tempfile()
temp2 <- tempfile()
download.file(url, temp)
unzip(zipfile = temp, exdir = temp2)
df <- read_csv2(file.path(temp2, "Absenteeism_at_work.csv"))
names(df) <- df %>% names() %>% str_replace_all(' |/', '_')
df %>% write_csv("Absenteeism_at_work.csv")
#review data
df %>% glimpse()
Create a new data frame(s) with appropriate data types and data cleaning for the data.
#change data types and column names
df <- df %>% mutate(PK = row_number(),
Employee_ID = ID,
Reason = as.factor(Reason_for_absence),
Month = as.factor(Month_of_absence),
Day_of_week = as.character(Day_of_the_week),
Day_of_week = str_replace(Day_of_week, "2", "Monday"),
Day_of_week = str_replace(Day_of_week, "3", "Tuesday"),
Day_of_week = str_replace(Day_of_week, "4", "Wednesday"),
Day_of_week = str_replace(Day_of_week, "5", "Thursday"),
Day_of_week = str_replace(Day_of_week, "6", "Friday"),
Day_of_week = as.factor(Day_of_week),
Day_of_week = fct_relevel(Day_of_week, "Monday", "Tuesday", "Wednesday","Thursday", "Friday"),
Seasons = as.character(Seasons),
Seasons = str_replace(Seasons, "1", "Summer"),
Seasons = str_replace(Seasons, "2", "Autumn"),
Seasons = str_replace(Seasons, "3", "Winter"),
Seasons = str_replace(Seasons, "4", "Spring"),
Seasons = as.factor(Seasons),
Seasons = fct_relevel(Seasons, "Summer", "Autumn", "Winter", "Spring"),
Commute_distance = Distance_from_Residence_to_Work,
Disciplinary_failure = as.logical(Disciplinary_failure),
Education = as.character(Education),
Education = str_replace(Education, "1", "High_school"),
Education = str_replace(Education, "2", "Undergraduate"),
Education = str_replace(Education, "3", "Graduate"),
Education = str_replace(Education, "4", "Doctorate"),
Education = as.ordered(Education),
Education = fct_relevel(Education, "High_school", "Undergraduate", "Graduate", "Doctorate"),
Num_children = Son,
Social_drinker = as.logical(Social_drinker),
Social_smoker = as.logical(Social_smoker),
Num_pets = Pet,
BMI = Body_mass_index,
Absenteeism_hours = Absenteeism_time_in_hours)
df <- df %>% select(PK, Employee_ID, Reason, Month, Day_of_week, Seasons, Age, Education, Weight, Height, BMI,
Transportation_expense, Commute_distance, Service_time, Work_load_Average_day, Hit_target,
Disciplinary_failure, Num_children, Num_pets, Social_drinker, Social_smoker, Absenteeism_hours)
#look for basic trends
y_var <- "Absenteeism_hours"
gg_scatter <- function(data, x_col, y_var, color) {
if(is.numeric(data[[x_col]])){
plt <- data %>% ggplot(mapping=aes_string(x_col, y_var))+
geom_jitter(alpha=0.5)+
geom_smooth(method="lm", se=FALSE)+
labs(title=str_c("Absenteeism Hours: ", x_col, " by ", y_var), subtitle="Points jittered and alpha blended")
plt %>% print()
}
}
y_col <- c("Absenteeism_hours|PK")
x_cols <- df %>% names()
x_cols <- x_cols[!str_detect(x_cols, y_col)]
x_cols %>% walk(gg_scatter, data=df, y_var=y_var)
#create new data frame for variables with trends
df <- df %>% select(PK, Education, Reason, Day_of_week, Seasons, BMI, Transportation_expense, Service_time, Commute_distance,
Age, Hit_target, Disciplinary_failure, Num_children, Num_pets, Social_drinker, Social_smoker,
Absenteeism_hours)
#group BMI values
normal <- df %>% filter(BMI >= 19, BMI <= 24) %>%
select(PK, BMI) %>% rename(normal=BMI)
overweight <- df %>% filter(BMI >= 25, BMI <= 29) %>%
select(PK, BMI) %>% rename(overweight=BMI)
obese <- df %>% filter(BMI >= 30, BMI <= 39) %>%
select(PK, BMI) %>% rename(obese=BMI)
df <- df %>%
left_join(normal, by="PK") %>%
left_join(overweight, by="PK") %>%
left_join(obese, by="PK")
df <- df %>%
gather(key=BMI_status, value=bmi, normal:obese) %>%
drop_na()
df <- df %>%
mutate(BMI_status = as.ordered(BMI_status) %>%
fct_relevel(c("normal", "overweight", "obese"))) %>%
select(-BMI, -bmi)
#view central tendency statistics
mean <- mean(df$Absenteeism_hours)
cat(str_c(y_var, " mean: ", mean), "\n")
median <- median(df$Absenteeism_hours)
cat(str_c(y_var, " median: ", median), "\n")
mode <- mfv(df$Absenteeism_hours)
cat(str_c(y_var, " mode: ", mode), "\n")
cat(str_c(y_var, " variance: ", var(df$Absenteeism_hours)), "\n")
cat(str_c(y_var, " std. deviation: ", sd(df$Absenteeism_hours)), "\n")
cat(str_c(y_var, " std. error: ", sd(df$Absenteeism_hours) /
sqrt(length(df$Absenteeism_hours))), "\n")
#look for more patterns/trends
gg_facet <- function(data, facet) {
if(!is.numeric(data[[facet]])){
plt <- data %>% ggplot(mapping=aes_string("Absenteeism_hours"))+
geom_density(fill="blue")+
facet_wrap(paste("~", facet))+
labs(title="Density of Absenteeism hours",
subtitle=str_c("facet by ", facet))
plt %>% print()
}
}
facets <- df %>% names()
facets <- facets[!str_detect(facets, y_col)]
for(facet in facets){
gg_facet(data=df, facet)
}
#Would like to understand the correlation between age, BMI and absenteeism hours
df %>% ggplot(aes(Age, Absenteeism_hours, color=BMI_status))+
geom_jitter(alpha=0.5,size=0.2)+
geom_smooth(method="lm", se=FALSE)+
ylim(0,50)+
labs(title="Absenteeism hours by Age and BMI status", subtitle="points jittered, alpha blended, and size reduced")
For normal BMI status, absenteeism hours increase as age increases. For overweight and obese status, absenteeism hours tend to decrease as age increases. The different BMI status lines intersect at around 37 years old, meaning if someone is older than 37 with a normal BMI, they have an increased risk for greater absenteeism hours, compared to being younger or overweight/obese.
#Would like to know which day of the week and seasons have the highest total absenteeism hours
df%>%ggplot(aes(x=Day_of_week,y=Absenteeism_hours,fill=Seasons))+
geom_col(width=0.7)+
labs(title="Sum of Abseentism Hours by Day_of_week and Seasons")
df%>%ggplot(aes(x=Seasons,y=Absenteeism_hours,fill=Seasons))+
geom_col(width=0.7)+
labs(title="Sum of Abseentism Hours by Seasons")
Mondays have the highest total absenteeism hours while Thursdays have the lowest. Winter has the highest absenteeism hours, at almost 1500 hours. Autumn has the lowest around 1150.
#Would like to know correlation between Absenteeism hours and Education level
df %>% ggplot(aes(x=Education, y=Absenteeism_hours))+
geom_boxplot()+
ylim(0,50)+
geom_jitter(alpha=0.5, color="orange", size=0.3)+
geom_hline(yintercept=mean, color="red")+
geom_hline(yintercept=median, color="blue",
linetype="dashed")+
labs(title="Absenteeism hours by Education", subtitle="Showing mean(red), median(blue)")
People with lower degrees do not have the higher average absenteeism hours. There does not appear to be a trend that as Education increases, absenteeism hours decrease.
Further hypothesis testing will be done to understand whether mean absenteeism hours between different education levels are statistically significant.
df_High_school <- df %>% filter(Education == "High_school")
df_High_school%>%glimpse()
df_undergrad<- df %>% filter(Education == "Undergraduate")
df_undergrad%>%glimpse
df_grad <- df %>% filter(Education == "Graduate")
df_grad%>%glimpse
cat(str_c("High_school: mean = ", df_High_school$Absenteeism_hours %>% mean() %>% round(1)))
cat(str_c("Undergrad: mean = ", df_undergrad$Absenteeism_hours %>% mean() %>% round(1)))
cat(str_c("Graduate: mean = ", df_grad$Absenteeism_hours %>% mean() %>% round(1)))
t.test(df_undergrad$Absenteeism_hours, df_grad$Absenteeism_hours,conf.level=0.95,alternative="two.sided")
As p-value of the t-test is greater than 0.05, we cannot reject the null hypothesis. Which means, according to these samples of the whole population, there is no statistically significant difference of average absenteeism hours between people with undergrad and graduate degrees.¶
t.test(df_High_school$Absenteeism_hours, df_grad$Absenteeism_hours,conf.level=0.95,alternative="greater")
As p-value of the t-test is smaller than 0.05, we reject the null hypothesis. Which means, according to these samples of the whole population, the average absenteeism hours of those with high school degrees is higher than people with graduate degrees¶
#Loop plot the density of features
plot.hists <- function(col, df, bins = 20){
p1 <- ggplot(df, aes_string(col)) +
geom_histogram(aes(y = ..density..), bins = bins,
alpha = 0.3, color = 'blue') +
geom_density(size = 1) +
labs(title=str_c("Histogram and density function \n for ", col),
x=str_c("value of ", col))
p1 %>% print()
}
cols <- c('Absenteeism_hours', 'Age', 'Commute_distance','Num_children', 'Num_pets','Hit_target', 'Transportation_expense',
'Service_time')
cols %>% walk(plot.hists, df)
#Update the existing data frame
df_mutate <- df%>%
mutate(Absenteeism_hours.log = log(Absenteeism_hours),
Absenteeism_hours.sqr = sqrt(Absenteeism_hours),
Age.log=log(Age),
Age.sqr=sqrt(Age),
Hit_target.log=log(Hit_target),
Hit_target.sqr=sqrt(Hit_target),
Num_children.log=log(Num_children),
Num_children.sqr=sqrt(Num_children),
Num_pets.log=log(Num_pets),
Num_pets.sqr=sqrt(Num_pets),
Transportation_expense.log=log(Transportation_expense),
Transportation_expense.sqr=sqrt(Transportation_expense))
df_mutate %>% glimpse() %>% summary()
cols <- c("Absenteeism_hours.log", "Absenteeism_hours.sqr", "Age.log", "Age.sqr", "Num_children.log", "Num_children.sqr",
"Num_pets.log", "Num_pets.sqr", "Hit_target.log","Hit_target.sqr", "Transportation_expense.log",
"Transportation_expense.sqr")
cols %>% walk(plot.hists, df=df_mutate)
Compared with the above groups of charts, Absenteeism_hours.sqr is more symmetric than Absenteeism_hours, so we will use Absenteeism_hours.sqr as our label. Others variables do not have significant changes after mutatation so we will try modeling with the different variations.¶
# normalize function
normalize <- function(x) (x - mean(x))/sd(x)
# normalized dataframe
df_mutate_norm <- df_mutate %>%
mutate(Age = normalize(Age),
Age.log = normalize(Age.log),
Num_children = normalize(Num_children),
Num_pets = normalize(Num_pets),
Num_pets.sqr = normalize(Num_pets.sqr),
Hit_target = normalize(Hit_target),
Hit_target.log = normalize(Hit_target.log),
Hit_target.sqr = normalize(Hit_target.sqr),
Commute_distance = normalize(Commute_distance),
Transportation_expense=normalize(Transportation_expense),
Transportation_expense.log = normalize(Transportation_expense.log),
Transportation_expense.sqr = normalize(Transportation_expense.sqr),
Service_time=normalize(Service_time)) %>%
select(-PK, -Age.sqr, -Num_pets.log, -Num_children.log, -Num_children.sqr, -Absenteeism_hours.log)
df_mutate_norm %>% glimpse()
set.seed(2345)
df_train <- df_mutate_norm %>% sample_frac(0.8)
df_test <- df_mutate_norm %>% setdiff(df_train)
df_train %>% glimpse()
df_test %>% glimpse()
#plot features
plot.feature = function(col, df){
p1 = ggplot(df, aes_string(x = col, y = 'Absenteeism_hours.sqr')) +
geom_point() +
geom_smooth(size = 1, color = 'red', method="lm") +
labs(title=str_c("Relationship between ", col, " and sqrt Absenteesim_hours"),
x=col, y="Absenteesim_hours.sqr")
p1 %>% print()
}
cols = c('Age', 'Age.log', 'Commute_distance','Num_children','Num_pets', 'Num_pets.sqr', "Hit_target.log","Hit_target.sqr",
'Hit_target', 'Transportation_expense', 'Transportation_expense.log', 'Transportation_expense.sqr', 'Service_time')
cols %>% walk(plot.feature, df_train)
model_1<- lm(Absenteeism_hours.sqr ~ Commute_distance+Num_children+Transportation_expense+Transportation_expense.log+
Transportation_expense.sqr, data=df_train)
model_1%>%summary
model_2<- lm(Absenteeism_hours.sqr ~ Num_children, data=df_train)
model_2%>%summary
#predict function
predict_score <- function(mod, data){
data %>%
mutate(score = predict(mod, newdata=data),
resids=Absenteeism_hours.sqr - score,
predicted.Absenteeism = (score*score))
}
#plot residuals function
resids_plot <- function(data){
hd <- data %>% ggplot(aes(resids, ..density..))+
geom_histogram(bins=20, alpha=0.3)+
geom_density()+
labs(title="Histogram and density plot for residuals", x="Residual value", subtitle="using test set")
qq <- data %>% ggplot(aes(sample=resids))+
geom_qq()+
labs(title="Quantile-quantile Normal plot of residuals", subtitle="using test set")
scat <- data %>% ggplot(aes(score, resids))+
geom_point()+
geom_smooth(method="loess")+
labs(title="Residuals vs. fitted values", x="Fitted values", y="Residuals", subtitle="using test set")
hd %>% print()
qq %>% print()
scat %>% print()
}
#model 1 performance
df_test_predict1 <- predict_score(model_1, df_test)
df_test_predict1 %>% select(Absenteeism_hours, score, resids, predicted.Absenteeism) %>% return()
df_test_predict1 %>% resids_plot()
model_1$coef
#model 2 performance
df_test_predict2 <- predict_score(model_2, df_test)
df_test_predict2 %>% select(Absenteeism_hours, score, resids, predicted.Absenteeism) %>% return()
df_test_predict2 %>% resids_plot()
model_2$coef
Just doing a basic visual analysis to determine potential features with Absenteeism_hours, it appears that there are trends with Education, BMI, Commute_distance, Age, Hit_target, Num_children, Num_pets, Transportation_expense, Service_time, Social_drinker, and Social_smoker.
Chart 1 does a slightly deeper dive and compares Age and BMI_status to Absenteeism_hours. The data indicates that as someone of a normal BMI_status ages, they are likely to miss more work. Conversely, those that are overweight or obese, tend to miss less work as they age.
The second group of charts shows the relationship that Day_of_week and Seasons has on Absenteeism_hours. Mondays have the highest total absenteesim while Thursdays have the lowest. Companies should get prepared to have the most absences on Mondays, and maybe arrange more team work on Thursday. Winter has the highest absenteesim hours, at almost 1500 hours. Autumn has the lowest around 1150.
Chart 3 is interesting in that those with high school(1) and postgraduate(3) educations appear to miss the least amount of work at a time, but those with an undergraduate degree(2) have the highest mean and appear to miss the most amount of work at a time. However, those with a high school education have more single instances of missing work.
Our visual charts led us to perform t-tests with Education since the results seemed so weird. The results of these tests indicated that there is not a significant difference between those with undergrad and graduate degrees. Alternatively, the average absenteeism hours of people with high school degrees is higher than people with graduate degrees.
Model 1 shows that Absenteeism_hours.sqr is modeled by Commute_distance, Num_children, Transportation_expense, Transportation_expense.log and Transportation_expense.sqr. The R-squared value is very low, however, which suggests a lot of noise in the model. The residuals plot is a fairly normal distribution with a skew to the left. The quantile quantile plot is fairly straight, except at the higher end. The fact that these plots are skewed shows that there are likely to be some incorrect predictions with our model. Our residuals compared to fitted values are in a fairly straight line and close to 0 except at the higher end. Despite this, the model is considered statistically significant.
Model 2 shows that Absenteeism_hours.sqr is modeled by Num_Children. The R-squared value is also very low, however the Num_children is slightly more significant than the features in Model 1. The residuals plot and quantile quantile plot also share the same behavior in Model 1. The residuals compared to fitted values has a lot of variation at the lower end, suggesting that our first model might be slightly better. Regardless, this model is still considered statistically significant.
In summary, the factors that negatively impact absenteeism at work are "normal" BMI status around age 37 and higher, the day of the week being Monday and season being Winter, having an undergraduate degree, a shorter commute distance, a higher transportation expense, and having more children.