Work Absenteeism

Catherine Williams et al.

March 29, 2019

Class Group Project

Project Description

Create a multivariate prediction model and perform data analyis of a dataset you choose. This can be either a linear regression, or a classification. This includes:

  • Create 2 charts exploring the data with respect to the prediction variable (label)
  • Create a hypothesis and perform a t.test to reject or fail to reject that hypothesis
  • Split the data into training set and testing set for each model and wrangle the data as desired
  • Create 2 prediction models of your chosen type (regression | classification), with at least one multivariate model including visualizing the results of each model
  • Compare the performance of your models
  • Include a written analysis of your prediction referencing using data to support your conclusions.


Data came from the UCI Machine Learning Repository. The data contains records of absenteeism at work for a courier company in Brazil.

Citation: Martiniano, A., Ferreira, R. P., Sassi, R. J., & Affonso, C. (2012). Application of a neuro fuzzy network in prediction of absenteeism at work. In Information Systems and Technologies (CISTI), 7th Iberian Conference on (pp. 1-4). IEEE.

In [1]:


# Load libraries

# set.seed for reproducible results

Load and explore data structure

In [2]:
#import data
url <-""

temp <- tempfile()
temp2 <- tempfile()

download.file(url, temp)
unzip(zipfile = temp, exdir = temp2)
df <- read_csv2(file.path(temp2, "Absenteeism_at_work.csv"))

names(df) <- df %>% names() %>% str_replace_all(' |/', '_')
df %>% write_csv("Absenteeism_at_work.csv")

#review data
df %>% glimpse()
Observations: 740
Variables: 21
$ ID                              <int> 11, 36, 3, 7, 11, 3, 10, 20, 14, 1,...
$ Reason_for_absence              <int> 26, 0, 23, 7, 23, 23, 22, 23, 19, 2...
$ Month_of_absence                <int> 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,...
$ Day_of_the_week                 <int> 3, 3, 4, 5, 5, 6, 6, 6, 2, 2, 2, 3,...
$ Seasons                         <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
$ Transportation_expense          <int> 289, 118, 179, 279, 289, 179, 361, ...
$ Distance_from_Residence_to_Work <int> 36, 13, 51, 5, 36, 51, 52, 50, 12, ...
$ Service_time                    <int> 13, 18, 18, 14, 13, 18, 3, 11, 14, ...
$ Age                             <int> 33, 50, 38, 39, 33, 38, 28, 36, 34,...
$ Work_load_Average_day           <dbl> 239554, 239554, 239554, 239554, 239...
$ Hit_target                      <int> 97, 97, 97, 97, 97, 97, 97, 97, 97,...
$ Disciplinary_failure            <int> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
$ Education                       <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 1, 1,...
$ Son                             <int> 2, 1, 0, 2, 2, 0, 1, 4, 2, 1, 4, 4,...
$ Social_drinker                  <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1,...
$ Social_smoker                   <int> 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,...
$ Pet                             <int> 1, 0, 0, 0, 1, 0, 4, 0, 0, 1, 0, 0,...
$ Weight                          <int> 90, 98, 89, 68, 90, 89, 80, 65, 95,...
$ Height                          <int> 172, 178, 170, 168, 172, 170, 172, ...
$ Body_mass_index                 <int> 30, 31, 31, 24, 30, 31, 27, 23, 25,...
$ Absenteeism_time_in_hours       <int> 4, 0, 2, 4, 2, 2, 8, 4, 40, 8, 8, 8...

Data processing

Create a new data frame(s) with appropriate data types and data cleaning for the data.

In [3]:
#change data types and column names
df <- df %>% mutate(PK = row_number(),
                      Employee_ID = ID,
                      Reason = as.factor(Reason_for_absence),
                      Month = as.factor(Month_of_absence),
                      Day_of_week = as.character(Day_of_the_week),
                      Day_of_week = str_replace(Day_of_week, "2", "Monday"),
                      Day_of_week = str_replace(Day_of_week, "3", "Tuesday"),
                      Day_of_week = str_replace(Day_of_week, "4", "Wednesday"),
                      Day_of_week = str_replace(Day_of_week, "5", "Thursday"),
                      Day_of_week = str_replace(Day_of_week, "6", "Friday"),
                      Day_of_week = as.factor(Day_of_week),
                      Day_of_week = fct_relevel(Day_of_week, "Monday", "Tuesday", "Wednesday","Thursday", "Friday"),
                      Seasons = as.character(Seasons),
                      Seasons = str_replace(Seasons, "1", "Summer"),
                      Seasons = str_replace(Seasons, "2", "Autumn"),
                      Seasons = str_replace(Seasons, "3", "Winter"),
                      Seasons = str_replace(Seasons, "4", "Spring"),
                      Seasons = as.factor(Seasons),
                      Seasons = fct_relevel(Seasons, "Summer", "Autumn", "Winter", "Spring"),
                      Commute_distance = Distance_from_Residence_to_Work,
                      Disciplinary_failure = as.logical(Disciplinary_failure),
                      Education = as.character(Education),
                      Education = str_replace(Education, "1", "High_school"),
                      Education = str_replace(Education, "2", "Undergraduate"),
                      Education = str_replace(Education, "3", "Graduate"),
                      Education = str_replace(Education, "4", "Doctorate"),
                      Education = as.ordered(Education),
                      Education = fct_relevel(Education, "High_school", "Undergraduate", "Graduate", "Doctorate"),
                      Num_children = Son,
                      Social_drinker = as.logical(Social_drinker),
                      Social_smoker = as.logical(Social_smoker),
                      Num_pets = Pet,
                      BMI = Body_mass_index,
                      Absenteeism_hours = Absenteeism_time_in_hours)

df <- df %>% select(PK, Employee_ID, Reason, Month, Day_of_week, Seasons, Age, Education, Weight, Height, BMI, 
                    Transportation_expense, Commute_distance, Service_time, Work_load_Average_day, Hit_target, 
                    Disciplinary_failure, Num_children, Num_pets, Social_drinker, Social_smoker, Absenteeism_hours)

#look for basic trends
y_var <- "Absenteeism_hours"

gg_scatter <- function(data, x_col, y_var, color) {
      plt <- data %>% ggplot(mapping=aes_string(x_col, y_var))+
        geom_smooth(method="lm", se=FALSE)+
        labs(title=str_c("Absenteeism Hours: ", x_col, " by ", y_var), subtitle="Points jittered and alpha blended")
      plt %>% print()

y_col <- c("Absenteeism_hours|PK")
x_cols <- df %>% names()
x_cols <- x_cols[!str_detect(x_cols, y_col)]

x_cols %>% walk(gg_scatter, data=df, y_var=y_var)

#create new data frame for variables with trends
df <- df %>% select(PK, Education, Reason, Day_of_week, Seasons, BMI, Transportation_expense, Service_time, Commute_distance, 
                    Age, Hit_target, Disciplinary_failure, Num_children, Num_pets, Social_drinker, Social_smoker, 

#group BMI values
normal <- df %>% filter(BMI >= 19, BMI <= 24) %>% 
  select(PK, BMI) %>% rename(normal=BMI)
overweight <- df %>% filter(BMI >= 25, BMI <= 29) %>% 
  select(PK, BMI) %>% rename(overweight=BMI)
obese <- df %>% filter(BMI >= 30, BMI <= 39) %>% 
  select(PK, BMI) %>% rename(obese=BMI)

df <- df %>%
  left_join(normal, by="PK") %>%
  left_join(overweight, by="PK") %>%
  left_join(obese, by="PK")

df <- df %>% 
  gather(key=BMI_status, value=bmi, normal:obese) %>% 

df <- df %>% 
  mutate(BMI_status = as.ordered(BMI_status) %>% 
           fct_relevel(c("normal", "overweight", "obese"))) %>% 
  select(-BMI, -bmi)

#view central tendency statistics
mean <- mean(df$Absenteeism_hours)
cat(str_c(y_var, " mean: ", mean), "\n")
median <- median(df$Absenteeism_hours)
cat(str_c(y_var, " median: ", median), "\n")
mode <- mfv(df$Absenteeism_hours)
cat(str_c(y_var, " mode: ", mode), "\n")
cat(str_c(y_var, " variance: ", var(df$Absenteeism_hours)), "\n")
cat(str_c(y_var, " std. deviation: ", sd(df$Absenteeism_hours)), "\n")
cat(str_c(y_var, " std. error: ", sd(df$Absenteeism_hours) / 
            sqrt(length(df$Absenteeism_hours))), "\n")

#look for more patterns/trends
gg_facet <- function(data, facet) {
    plt <- data %>% ggplot(mapping=aes_string("Absenteeism_hours"))+
       facet_wrap(paste("~", facet))+
       labs(title="Density of Absenteeism hours", 
         subtitle=str_c("facet by ", facet))
     plt %>% print()

facets <- df %>% names()
facets <- facets[!str_detect(facets, y_col)]

for(facet in facets){
  gg_facet(data=df, facet)
Absenteeism_hours mean: 6.92432432432432 
Absenteeism_hours median: 3 
Absenteeism_hours mode: 8 
Absenteeism_hours variance: 177.715510368284 
Absenteeism_hours std. deviation: 13.3309981009782 
Absenteeism_hours std. error: 0.490057236547198 

Explore prediction variable

Chart 1

In [4]:
#Would like to understand the correlation between age, BMI and absenteeism hours 
df %>% ggplot(aes(Age, Absenteeism_hours, color=BMI_status))+
  geom_smooth(method="lm", se=FALSE)+
  labs(title="Absenteeism hours by Age and BMI status", subtitle="points jittered, alpha blended, and size reduced")

Analysis of Chart 1

For normal BMI status, absenteeism hours increase as age increases. For overweight and obese status, absenteeism hours tend to decrease as age increases. The different BMI status lines intersect at around 37 years old, meaning if someone is older than 37 with a normal BMI, they have an increased risk for greater absenteeism hours, compared to being younger or overweight/obese.

Chart 2

In [5]:
#Would like to know which day of the week and seasons have the highest total absenteeism hours 
labs(title="Sum of Abseentism Hours by Day_of_week and Seasons")

labs(title="Sum of Abseentism Hours by Seasons")

Analysis of Chart 2

Mondays have the highest total absenteeism hours while Thursdays have the lowest. Winter has the highest absenteeism hours, at almost 1500 hours. Autumn has the lowest around 1150.

Chart 3

In [6]:
#Would like to know correlation between Absenteeism hours and Education level
df %>% ggplot(aes(x=Education, y=Absenteeism_hours))+
  geom_jitter(alpha=0.5, color="orange", size=0.3)+
  geom_hline(yintercept=mean, color="red")+
  geom_hline(yintercept=median, color="blue", 
  labs(title="Absenteeism hours by Education", subtitle="Showing mean(red), median(blue)")

Analysis of Chart 3

People with lower degrees do not have the higher average absenteeism hours. There does not appear to be a trend that as Education increases, absenteeism hours decrease.

Further hypothesis testing will be done to understand whether mean absenteeism hours between different education levels are statistically significant.

Hypothesis Testing

Data frame preparation to do T-test among Education Levels

In [7]:
df_High_school <- df %>% filter(Education == "High_school")
df_undergrad<- df %>% filter(Education == "Undergraduate")
df_grad <- df %>% filter(Education == "Graduate")

cat(str_c("High_school: mean = ", df_High_school$Absenteeism_hours %>% mean() %>% round(1)))

cat(str_c("Undergrad: mean = ", df_undergrad$Absenteeism_hours %>% mean() %>% round(1))) 

cat(str_c("Graduate: mean = ", df_grad$Absenteeism_hours %>% mean() %>% round(1)))
Compared with the above groups of charts, Absenteeism_hours.sqr is more symmetric than Absenteeism_hours, so we will use Absenteeism_hours.sqr as our label. Others variables do not have significant changes after mutatation so we will try modeling with the different variations.

In [12]:
# normalize function
normalize <- function(x) (x - mean(x))/sd(x)

# normalized dataframe
df_mutate_norm <- df_mutate %>% 
  mutate(Age = normalize(Age),
         Age.log = normalize(Age.log), 
         Num_children = normalize(Num_children),
         Num_pets = normalize(Num_pets),
         Num_pets.sqr = normalize(Num_pets.sqr), 
         Hit_target = normalize(Hit_target),
         Hit_target.log = normalize(Hit_target.log),
         Hit_target.sqr = normalize(Hit_target.sqr),
         Commute_distance = normalize(Commute_distance),
         Transportation_expense.log = normalize(Transportation_expense.log), 
         Transportation_expense.sqr = normalize(Transportation_expense.sqr),
         Service_time=normalize(Service_time)) %>%  
    select(-PK, -Age.sqr, -Num_pets.log, -Num_children.log, -Num_children.sqr, -Absenteeism_hours.log)

df_mutate_norm %>% glimpse()
Observations: 740
Variables: 23
$ Education                  <ord> High_school, High_school, High_school, H...
$ Reason                     <fct> 7, 23, 1, 1, 11, 14, 28, 28, 11, 23, 23,...
$ Day_of_week                <fct> Thursday, Friday, Monday, Tuesday, Wedne...
$ Seasons                    <fct> Summer, Summer, Summer, Summer, Summer, ...
$ Transportation_expense     <dbl> 0.86136453, 0.57758008, 0.57758008, 0.57...
$ Service_time               <dbl> 0.3297577, -0.3544125, -0.3544125, -0.35...
$ Commute_distance           <dbl> -1.6601356, 1.3728658, 1.3728658, 1.3728...
$ Age                        <dbl> 0.3935931, -0.0694576, -0.0694576, -0.06...
$ Hit_target                 <dbl> 0.6382541, 0.6382541, 0.6382541, 0.63825...
$ Disciplinary_failure       <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE...
$ Num_children               <dbl> 0.89311870, 2.71380144, 2.71380144, 2.71...
$ Num_pets                   <dbl> -0.5658572, -0.5658572, -0.5658572, -0.5...
$ Social_drinker             <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE...
$ Social_smoker              <lgl> TRUE, FALSE, FALSE, FALSE, FALSE, FALSE,...
$ Absenteeism_hours          <int> 4, 4, 8, 8, 8, 8, 4, 4, 4, 4, 2, 8, 0, 0...
$ BMI_status                 <ord> normal, normal, normal, normal, normal, ...
$ Absenteeism_hours.sqr      <dbl> 2.000000, 2.000000, 2.828427, 2.828427, ...
$ Age.log                    <dbl> 0.47823017, 0.01589293, 0.01589293, 0.01...
$ Hit_target.log             <dbl> 0.6326562, 0.6326562, 0.6326562, 0.63265...
$ Hit_target.sqr             <dbl> 0.6356619, 0.6356619, 0.6356619, 0.63566...
$ Num_pets.sqr               <dbl> -0.7205817, -0.7205817, -0.7205817, -0.7...
$ Transportation_expense.log <dbl> 0.8913514, 0.6659129, 0.6659129, 0.66591...
$ Transportation_expense.sqr <dbl> 0.8854488, 0.6289573, 0.6289573, 0.62895...

Split the training and testing set

In [13]:

df_train <- df_mutate_norm %>% sample_frac(0.8)

df_test <- df_mutate_norm %>% setdiff(df_train)
df_train %>% glimpse()
df_test %>% glimpse()
Observations: 592
Variables: 23
$ Education                  <ord> High_school, High_school, High_school, U...
$ Reason                     <fct> 19, 0, 23, 21, 28, 7, 18, 23, 27, 0, 1, ...
$ Day_of_week                <fct> Wednesday, Thursday, Wednesday, Thursday...
$ Seasons                    <fct> Winter, Winter, Spring, Spring, Summer, ...
$ Transportation_expense     <dbl> 2.34003089, 0.36847574, -0.63223785, -0....
$ Service_time               <dbl> -0.3544125, 0.7858713, 1.2419848, 1.0139...
$ Commute_distance           <dbl> 1.30546573, -0.31213501, 1.44026580, -0....
$ Age                        <dbl> -0.06945760, 0.70229353, 0.23924285, 0.5...
$ Hit_target                 <dbl> 0.3736558, 0.9028525, -0.6847376, -0.420...
$ Disciplinary_failure       <lgl> FALSE, TRUE, FALSE, FALSE, FALSE, FALSE,...
$ Num_children               <dbl> 0.89311870, -0.92756405, -0.92756405, 0....
$ Num_pets                   <dbl> 2.4684495, -0.5658572, -0.5658572, -0.56...
$ Social_drinker             <lgl> FALSE, TRUE, TRUE, FALSE, TRUE, FALSE, T...
$ Social_smoker              <lgl> TRUE, FALSE, FALSE, TRUE, FALSE, FALSE, ...
$ Absenteeism_hours          <int> 8, 0, 3, 8, 4, 8, 80, 3, 1, 0, 5, 2, 2, ...
$ BMI_status                 <ord> normal, normal, obese, normal, overweigh...
$ Absenteeism_hours.sqr      <dbl> 2.828427, 0.000000, 1.732051, 2.828427, ...
$ Age.log                    <dbl> 0.01589293, 0.76709696, 0.32819233, 0.62...
$ Hit_target.log             <dbl> 0.3805936, 0.8821334, -0.6546186, -0.391...
$ Hit_target.sqr             <dbl> 0.3772863, 0.8927091, -0.6699252, -0.406...
$ Num_pets.sqr               <dbl> 2.1323803, -0.7205817, -0.7205817, -0.72...
$ Transportation_expense.log <dbl> 1.8620241, 0.4889951, -0.5272680, -0.527...
$ Transportation_expense.sqr <dbl> 2.0992345, 0.4339097, -0.5877083, -0.587...
Observations: 139
Variables: 23
$ Education                  <ord> High_school, High_school, Graduate, High...
$ Reason                     <fct> 23, 28, 23, 19, 23, 13, 28, 11, 28, 23, ...
$ Day_of_week                <fct> Friday, Friday, Thursday, Thursday, Wedn...
$ Seasons                    <fct> Summer, Summer, Spring, Spring, Spring, ...
$ Transportation_expense     <dbl> 0.57758008, 0.57758008, -0.63223785, 0.5...
$ Service_time               <dbl> -0.3544125, -0.3544125, -0.8105260, -0.3...
$ Commute_distance           <dbl> 1.3728658, 1.3728658, -0.2447350, 1.3728...
$ Age                        <dbl> -0.0694576, -0.0694576, -0.9955590, -0.0...
$ Hit_target                 <dbl> 0.6382541, -0.6847376, -0.4201393, -0.42...
$ Disciplinary_failure       <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE...
$ Num_children               <dbl> 2.71380144, 2.71380144, -0.92756405, 2.7...
$ Num_pets                   <dbl> -0.5658572, -0.5658572, -0.5658572, -0.5...
$ Social_drinker             <lgl> TRUE, TRUE, FALSE, TRUE, FALSE, FALSE, F...
$ Social_smoker              <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE...
$ Absenteeism_hours          <int> 4, 4, 1, 16, 1, 3, 2, 8, 2, 2, 8, 16, 8,...
$ BMI_status                 <ord> normal, normal, normal, normal, normal, ...
$ Absenteeism_hours.sqr      <dbl> 2.000000, 2.000000, 1.000000, 4.000000, ...
$ Age.log                    <dbl> 0.01589293, 0.01589293, -1.03722044, 0.0...
$ Hit_target.log             <dbl> 0.6326562, -0.6546186, -0.3916559, -0.39...
$ Hit_target.sqr             <dbl> 0.6356619, -0.6699252, -0.4060221, -0.40...
$ Num_pets.sqr               <dbl> -0.7205817, -0.7205817, -0.7205817, -0.7...
$ Transportation_expense.log <dbl> 0.6659129, 0.6659129, -0.5272680, 0.6659...
$ Transportation_expense.sqr <dbl> 0.6289573, 0.6289573, -0.5877083, 0.6289...
In [14]:
#plot features
plot.feature = function(col, df){
  p1 = ggplot(df, aes_string(x = col, y = 'Absenteeism_hours.sqr')) + 
    geom_point() + 
    geom_smooth(size = 1, color = 'red', method="lm") + 
    labs(title=str_c("Relationship between ", col, " and sqrt Absenteesim_hours"),
         x=col, y="Absenteesim_hours.sqr")
  p1 %>% print()

cols = c('Age', 'Age.log', 'Commute_distance','Num_children','Num_pets', 'Num_pets.sqr', "Hit_target.log","Hit_target.sqr", 
         'Hit_target', 'Transportation_expense', 'Transportation_expense.log', 'Transportation_expense.sqr', 'Service_time')

cols %>% walk(plot.feature, df_train)

Train the model with training set

Model 1

In [15]:
model_1<- lm(Absenteeism_hours.sqr ~ Commute_distance+Num_children+Transportation_expense+Transportation_expense.log+
             Transportation_expense.sqr, data=df_train) 
lm(formula = Absenteeism_hours.sqr ~ Commute_distance + Num_children + 
    Transportation_expense + Transportation_expense.log + Transportation_expense.sqr, 
    data = df_train)

    Min      1Q  Median      3Q     Max 
-3.1933 -0.7614 -0.3385  0.5429  8.9331 

                            Estimate Std. Error t value Pr(>|t|)    
(Intercept)                  2.15833    0.06107  35.341  < 2e-16 ***
Commute_distance            -0.16740    0.07055  -2.373  0.01798 *  
Num_children                 0.23603    0.06774   3.485  0.00053 ***
Transportation_expense      14.69546    6.71927   2.187  0.02913 *  
Transportation_expense.log  14.54706    6.71939   2.165  0.03080 *  
Transportation_expense.sqr -29.01346   13.35915  -2.172  0.03027 *  
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.485 on 586 degrees of freedom
Multiple R-squared:  0.04149,	Adjusted R-squared:  0.03331 
F-statistic: 5.073 on 5 and 586 DF,  p-value: 0.0001459

Model 2

In [16]:
model_2<- lm(Absenteeism_hours.sqr ~ Num_children, data=df_train) 
lm(formula = Absenteeism_hours.sqr ~ Num_children, data = df_train)

    Min      1Q  Median      3Q     Max 
-2.8379 -0.7412 -0.1958  0.6731  8.7991 

             Estimate Std. Error t value Pr(>|t|)    
(Intercept)   2.15968    0.06133  35.216  < 2e-16 ***
Num_children  0.24990    0.06225   4.015 6.72e-05 ***
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.492 on 590 degrees of freedom
Multiple R-squared:  0.02659,	Adjusted R-squared:  0.02494 
F-statistic: 16.12 on 1 and 590 DF,  p-value: 6.72e-05

Model Performance Comparison

In [17]:
#predict function
predict_score <- function(mod, data){
  data %>%
    mutate(score = predict(mod, newdata=data),
           resids=Absenteeism_hours.sqr - score,
           predicted.Absenteeism = (score*score))

#plot residuals function
resids_plot <- function(data){
  hd <- data %>% ggplot(aes(resids, ..density..))+
    geom_histogram(bins=20, alpha=0.3)+
    labs(title="Histogram and density plot for residuals", x="Residual value", subtitle="using test set")
  qq <- data %>% ggplot(aes(sample=resids))+
    labs(title="Quantile-quantile Normal plot of residuals", subtitle="using test set")
  scat <- data %>% ggplot(aes(score, resids))+
    labs(title="Residuals vs. fitted values", x="Fitted values", y="Residuals", subtitle="using test set")
  hd %>% print()
  qq %>% print()
  scat %>% print()

#model 1 performance
df_test_predict1 <- predict_score(model_1, df_test)
df_test_predict1 %>% select(Absenteeism_hours, score, resids, predicted.Absenteeism) %>% return()

df_test_predict1 %>% resids_plot()


#model 2 performance
df_test_predict2 <- predict_score(model_2, df_test)
df_test_predict2 %>% select(Absenteeism_hours, score, resids, predicted.Absenteeism) %>% return()

df_test_predict2 %>% resids_plot()

Analysis and Conclusions

Just doing a basic visual analysis to determine potential features with Absenteeism_hours, it appears that there are trends with Education, BMI, Commute_distance, Age, Hit_target, Num_children, Num_pets, Transportation_expense, Service_time, Social_drinker, and Social_smoker.

Chart 1 does a slightly deeper dive and compares Age and BMI_status to Absenteeism_hours. The data indicates that as someone of a normal BMI_status ages, they are likely to miss more work. Conversely, those that are overweight or obese, tend to miss less work as they age.

The second group of charts shows the relationship that Day_of_week and Seasons has on Absenteeism_hours. Mondays have the highest total absenteesim while Thursdays have the lowest. Companies should get prepared to have the most absences on Mondays, and maybe arrange more team work on Thursday. Winter has the highest absenteesim hours, at almost 1500 hours. Autumn has the lowest around 1150.

Chart 3 is interesting in that those with high school(1) and postgraduate(3) educations appear to miss the least amount of work at a time, but those with an undergraduate degree(2) have the highest mean and appear to miss the most amount of work at a time. However, those with a high school education have more single instances of missing work.

Our visual charts led us to perform t-tests with Education since the results seemed so weird. The results of these tests indicated that there is not a significant difference between those with undergrad and graduate degrees. Alternatively, the average absenteeism hours of people with high school degrees is higher than people with graduate degrees.

Model 1 shows that Absenteeism_hours.sqr is modeled by Commute_distance, Num_children, Transportation_expense, Transportation_expense.log and Transportation_expense.sqr. The R-squared value is very low, however, which suggests a lot of noise in the model. The residuals plot is a fairly normal distribution with a skew to the left. The quantile quantile plot is fairly straight, except at the higher end. The fact that these plots are skewed shows that there are likely to be some incorrect predictions with our model. Our residuals compared to fitted values are in a fairly straight line and close to 0 except at the higher end. Despite this, the model is considered statistically significant.

Model 2 shows that Absenteeism_hours.sqr is modeled by Num_Children. The R-squared value is also very low, however the Num_children is slightly more significant than the features in Model 1. The residuals plot and quantile quantile plot also share the same behavior in Model 1. The residuals compared to fitted values has a lot of variation at the lower end, suggesting that our first model might be slightly better. Regardless, this model is still considered statistically significant.

In summary, the factors that negatively impact absenteeism at work are "normal" BMI status around age 37 and higher, the day of the week being Monday and season being Winter, having an undergraduate degree, a shorter commute distance, a higher transportation expense, and having more children.