1 Problem

IMDB has immense amounts of data regarding movies. This data can be used to predict movie awards (Oscar, Golden Globes, Bafta, etc). This will help to understand which combination of feature values contribute to bringing a film to success and consequently winning awards. This will almost certainly lead to an increase of profit for those involved in making the film as well.

2 Dataset

36 predictor variables

28 discrete
8 continuous

Label is continuous

2.1 Ground Truths

Number of Wins

nrOfWins - Number of prizes won by the movie. It is an indicative number that certifies how many are the prizes won even after the first year of release (Oscar, Golden Globe, Bafta, …)

2.2 Source

https://www.kaggle.com/gabrielegalimberti/movies-example-for-machine-learning-activities/

3 Setup

3.1 Load libraries

ipak <- function(pkg){
  new.pkg <- pkg[!(pkg %in% installed.packages()[, "Package"])]
if (length(new.pkg))
    install.packages(new.pkg, dependencies = TRUE)
sapply(pkg, require, character.only = TRUE)
}

packages <- c("tidyverse","caret","performance","Amelia","rpart","rpart.plot","randomForest","fastDummies")
ipak(packages)

theme_set(theme_classic()) #applies classic theme to all charts

3.2 Import Data

movies <- read.csv2("MACHINE_LEARNING_FINAL.csv")

3.3 Change Data Types

#change variables to appropriate data types. this will be important for charting later
movies <- movies %>% mutate_at(vars(Action:Western),as.logical)
movies <- movies %>% mutate_at(vars(ratingInteger,nrOfGenre),as.factor)

4 Exploratory Data Analysis

Overall, this is high quality data. There are no missing values and there are not too many strong correlations. However, some variables are not normally distributed- they are skewed to the left.

4.1 View Data

4.2 Missing Values

There are no missing values

#Visualize missing values
missmap(movies, main = "Missing Values vs. Observed")

4.3 Correlations

There are some variables with high correlations (>|0.5|) that will need to have feature engineering. There are also some moderate correlations (|0.3| to |0.5|) that should be monitored but do not necessarily need to be excluded initially.

#view correlations, drop the insignificant relationships, sort by highest to lowest, and visualize results graphically
corr_simple <- function(data=movies,drop="nrOfWins"){
  df_cor <- data %>% mutate_if(is.factor, as.numeric) %>% select(-drop)
  
  corr <- cor(df_cor)
  corr[lower.tri(corr,diag=TRUE)] <- NA  #Prepare to drop duplicates and correlations of 1
  corr[corr == 1] <- NA #drop perfect correlations
  corr <- as.data.frame(as.table(corr)) #Turn into a 3-column table
  corr <- na.omit(corr) #remove the NA values from above
  corr <- subset(corr, abs(Freq) > 0.3) #select significant values
  corr <- corr[order(-abs(corr$Freq)),] #Sort by highest correlation
  print(corr)

  #turn corr back into matrix in order to plot with corrplot
  mtx_corr <- reshape2::acast(corr, Var1~Var2, value.var="Freq")
  
  #plot correlations visually
  corrplot::corrplot(mtx_corr, is.corr=FALSE, tl.col="black", na.label=" ")
}

corr_simple()

##                  Var1             Var2       Freq
## 329       ratingCount  nrOfUserReviews  0.7760283
## 147    lifetime_gross      ratingCount  0.6449900
## 293       ratingCount nrOfNewsArticles  0.5837222
## 332        nrOfPhotos  nrOfUserReviews  0.5616223
## 327    lifetime_gross  nrOfUserReviews  0.5584999
## 255    lifetime_gross       nrOfPhotos  0.5553002
## 221       ratingCount  nrOfNominations  0.5403624
## 257       ratingCount       nrOfPhotos  0.5385229
## 291    lifetime_gross nrOfNewsArticles  0.5237748
## 331   nrOfNominations  nrOfUserReviews  0.5152443
## 333  nrOfNewsArticles  nrOfUserReviews  0.4669062
## 295   nrOfNominations nrOfNewsArticles  0.4521882
## 296        nrOfPhotos nrOfNewsArticles  0.4286632
## 148     ratingInteger      ratingCount  0.4147697
## 219    lifetime_gross  nrOfNominations  0.4146209
## 220     ratingInteger  nrOfNominations  0.4088316
## 259   nrOfNominations       nrOfPhotos  0.4008487
## 184     ratingInteger         duration  0.3238626
## 518         Adventure        Animation  0.3206551
## 690          duration            Drama  0.3167713
## 471    lifetime_gross        Adventure  0.3089379
## 1205           Comedy         Thriller -0.3061384
## 222          duration  nrOfNominations  0.3033252

4.4 Variable Trends

Conclusions made from these charts:

Recent years have more data points. This could be due to increased user adoption on IMDB.
It appears there are positive relationships between the number of awards and lifetime gross, rating count, number of pictures, news articles, and user reviews. However, most of the data is clustered at lower values so these may not be very strong predictors.
The higher the rating, the higher the number of awards won with a few outliers. There is a large increase in number of awards with ratings of 8 and 9.
Movie duration doesn't necessarily relate to number of awards.
Unsurprisingly, the number of nominations has a very strong positive relationship with the number of awards.
Movie genres do not appear to play a big role in number of awards except for Biography and War films.

gg_plot <- function(x_col, y_col=movies$nrOfWins, data=movies){
  if(is.numeric(data[[x_col]])){
      p1 <- data %>% ggplot(mapping=aes_string(x_col, y_col))+
        geom_jitter(alpha=0.5)+
        geom_smooth(method="loess", se=FALSE)+
        labs(title=str_c("Awards vs ", x_col), y="label")
      p1 %>% print()
      h <- hist(data[[x_col]], breaks = "FD", plot = FALSE) #histogram with Freedman-Diaconis rule for binwidth
      p2 <- ggplot(data, aes_string(x_col))+
        geom_histogram(aes(y = ..density..), breaks = h$breaks, alpha = 0.3, col = "white")+
        geom_density(size = 1) +
        labs(title=str_c("Histogram and density for ", x_col))
      p2 %>% print()
    }
  else{
    p3 <- ggplot(data, aes_string(x_col, y_col))+
      geom_boxplot()+
      geom_hline(yintercept=mean(y_col), color="red")+
      geom_hline(yintercept=median(y_col), color="blue", linetype="dashed")+
      labs(title=str_c("Awards: Number of Wins by ", x_col), subtitle="Showing mean(red), median(blue)")
    p3 %>% print()
    }
}

cols <- movies %>% select(-1) %>% names()
cols %>% walk(gg_plot)

5 Data Wrangling

5.1 Data Cleaning

Excluded columns that don't have any useful data
Removed outliers for year, ratingCount, and nrOfPhotos (these all improved model performance)

#remove movie title and columns that do not have more than 1 unique value (Adult, RealityTV, TalkShow)
movies <- movies[, sapply(movies, function(col) length(unique(col))) > 1] %>% select(-1)

#Exclude outliers for year
outliers <- boxplot(movies$year, main="Boxplot for year (with outliers)")$out

movies <- movies[-which(movies$year %in% outliers),]
#view new boxplot after outliers have been removed
boxplot(movies$year, main="Boxplot for year")

#Exclude outliers for ratingCount
outliers <- boxplot(movies$ratingCount, main="Boxplot for ratingCount (with outliers)")$out

movies <- movies[-which(movies$ratingCount %in% outliers),]
#view new boxplot after outliers have been removed
boxplot(movies$ratingCount, main="Boxplot for ratingCount")

#Exclude outliers for nrOfPhotos
outliers <- boxplot(movies$nrOfPhotos, main="Boxplot for nrOfPhotos (with outliers)")$out

movies <- movies[-which(movies$nrOfPhotos %in% outliers),]
#view new boxplot after outliers have been removed
boxplot(movies$nrOfPhotos, main="Boxplot for nrOfPhotos")

5.2 Feature Engineering

Popularity was created to remove some correlated variables.

Logarithmic and square root transformations were performed to improve the distributions of numeric variables. Overall, logarithmic transformations performed better but if there were zeros in the data, square root transformations were used instead to avoid having infinite values.

movies <- movies %>% mutate(Popularity = ratingCount+nrOfUserReviews+nrOfNewsArticles+nrOfPhotos) %>% 
                    select(-ratingCount,-nrOfUserReviews,-nrOfNewsArticles,-nrOfPhotos)

movies <- movies %>% 
   mutate(year.log=log(year),
          duration.log=log(duration),
          Popularity.log=log(Popularity),
          lifetime_gross.log=log(lifetime_gross),
          nrOfWins.sqr = sqrt(nrOfWins),
          nrOfNominations.sqr = sqrt(nrOfNominations)) %>% 
  select(-year,-duration,-Popularity,-nrOfNominations,-lifetime_gross)

movies %>% glimpse()

## Observations: 3,234
## Variables: 31
## $ ratingInteger       <fct> 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, ...
## $ nrOfWins            <int> 20, 18, 10, 20, 13, 22, 26, 21, 10, 26, 0,...
## $ nrOfGenre           <fct> 3, 3, 3, 3, 2, 1, 2, 2, 2, 3, 3, 2, 2, 3, ...
## $ Action              <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, ...
## $ Adventure           <lgl> FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, F...
## $ Animation           <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, ...
## $ Biography           <lgl> FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, F...
## $ Comedy              <lgl> TRUE, FALSE, TRUE, FALSE, FALSE, FALSE, TR...
## $ Crime               <lgl> FALSE, TRUE, TRUE, FALSE, TRUE, FALSE, FAL...
## $ Documentary         <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, ...
## $ Drama               <lgl> TRUE, FALSE, TRUE, TRUE, TRUE, TRUE, FALSE...
## $ Family              <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, ...
## $ Fantasy             <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, ...
## $ Horror              <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, ...
## $ Music               <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, ...
## $ Musical             <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, ...
## $ Mystery             <lgl> FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, F...
## $ News                <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, ...
## $ Romance             <lgl> TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, T...
## $ SciFi               <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, ...
## $ Short               <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, ...
## $ Sport               <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, ...
## $ Thriller            <lgl> FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, F...
## $ War                 <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, ...
## $ Western             <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, ...
## $ year.log            <dbl> 7.584265, 7.585281, 7.587311, 7.587817, 7....
## $ duration.log        <dbl> 8.757784, 8.794825, 8.954157, 8.961879, 8....
## $ Popularity.log      <dbl> 11.99842, 11.62560, 11.76139, 11.98895, 11...
## $ lifetime_gross.log  <dbl> 18.46895, 18.44351, 18.86537, 17.18968, 17...
## $ nrOfWins.sqr        <dbl> 4.472136, 4.242641, 3.162278, 4.472136, 3....
## $ nrOfNominations.sqr <dbl> 3.605551, 3.464102, 2.449490, 4.690416, 4....

cols <- movies %>% select(26:31) %>% names()
cols %>% walk(gg_plot, y_col=movies$nrOfWins.sqr)

5.3 Check Correlations Again

There are still some moderate correlations, in particular between the engineered Popularity.log and lifetime_gross.log. lifetime_gross.log will be dropped.

corr_simple(drop=c("nrOfWins","nrOfWins.sqr"))

##               Var1                Var2       Freq
## 810 Popularity.log  lifetime_gross.log  0.6168315
## 813  ratingInteger nrOfNominations.sqr  0.4215055
## 735          Drama        duration.log  0.3661610
## 839 Popularity.log nrOfNominations.sqr  0.3648694
## 120      Adventure           Animation  0.3443654
## 616         Comedy            Thriller -0.3125063
## 262  ratingInteger               Drama  0.3100243

movies <- movies %>% select(-lifetime_gross.log)

5.4 Normalize Data

Normalize the data frame so that model is not skewed by different types of measurements.

#normalize dataframe
normalize <- function(x)(x - mean(x, na.rm=T))/sd(x, na.rm=T)

movies <- movies %>% mutate_at(vars(year.log,duration.log,Popularity.log,nrOfNominations.sqr), normalize)

movies %>% glimpse()

## Observations: 3,234
## Variables: 30
## $ ratingInteger       <fct> 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, ...
## $ nrOfWins            <int> 20, 18, 10, 20, 13, 22, 26, 21, 10, 26, 0,...
## $ nrOfGenre           <fct> 3, 3, 3, 3, 2, 1, 2, 2, 2, 3, 3, 2, 2, 3, ...
## $ Action              <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, ...
## $ Adventure           <lgl> FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, F...
## $ Animation           <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, ...
## $ Biography           <lgl> FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, F...
## $ Comedy              <lgl> TRUE, FALSE, TRUE, FALSE, FALSE, FALSE, TR...
## $ Crime               <lgl> FALSE, TRUE, TRUE, FALSE, TRUE, FALSE, FAL...
## $ Documentary         <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, ...
## $ Drama               <lgl> TRUE, FALSE, TRUE, TRUE, TRUE, TRUE, FALSE...
## $ Family              <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, ...
## $ Fantasy             <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, ...
## $ Horror              <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, ...
## $ Music               <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, ...
## $ Musical             <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, ...
## $ Mystery             <lgl> FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, F...
## $ News                <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, ...
## $ Romance             <lgl> TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, T...
## $ SciFi               <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, ...
## $ Short               <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, ...
## $ Sport               <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, ...
## $ Thriller            <lgl> FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, F...
## $ War                 <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, ...
## $ Western             <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, ...
## $ year.log            <dbl> -3.0481693, -2.8506383, -2.4561776, -2.357...
## $ duration.log        <dbl> -0.03526202, 0.18373084, 1.12572274, 1.171...
## $ Popularity.log      <dbl> 1.5243233, 1.2211173, 1.3315538, 1.5166245...
## $ nrOfWins.sqr        <dbl> 4.472136, 4.242641, 3.162278, 4.472136, 3....
## $ nrOfNominations.sqr <dbl> 1.17673661, 1.09519946, 0.51033725, 1.8020...

5.5 Training and Testing Sets

Using 70% data to train the model and 30% to test

#for use with linear regression
#convert to numeric dummy variables
movies_num <- dummy_cols(movies) %>% select(-ratingInteger,-nrOfGenre)
movies_num <- movies_num %>% mutate_if(is.logical,as.integer)

set.seed(123)
train_num <- movies_num %>% sample_frac(0.7)
test_num <- movies_num %>% setdiff(train_num)

#for use with regression trees
set.seed(123)
train <- movies %>% sample_frac(0.7)
test <- movies %>% setdiff(train)

6 Models

6.1 Linear Regression (by AIC)

R-squared: 76.16%

#Linear Regression Model
lmodel <- lm(nrOfWins.sqr ~ . -nrOfWins, data = train_num)

#best model by AIC
lmodel_AIC <- lmodel %>% step() #automatically steps through and removes features based on getting the lowest AIC value for the model

lmodel_AIC %>% summary()

## 
## Call:
## lm(formula = nrOfWins.sqr ~ Action + Adventure + Animation + 
##     Comedy + Crime + Documentary + Family + Fantasy + Musical + 
##     Sport + Thriller + year.log + nrOfNominations.sqr + ratingInteger_8 + 
##     ratingInteger_9 + ratingInteger_7 + ratingInteger_3, data = train_num)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.6783 -0.2975 -0.0806  0.2379  5.2776 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          1.30080    0.03954  32.900  < 2e-16 ***
## Action              -0.09596    0.04243  -2.262 0.023800 *  
## Adventure           -0.16690    0.04962  -3.364 0.000782 ***
## Animation           -0.24769    0.08371  -2.959 0.003119 ** 
## Comedy              -0.05812    0.03571  -1.628 0.103743    
## Crime               -0.06572    0.04167  -1.577 0.114915    
## Documentary          0.29848    0.10263   2.908 0.003671 ** 
## Family              -0.17882    0.06086  -2.938 0.003335 ** 
## Fantasy             -0.16028    0.06083  -2.635 0.008477 ** 
## Musical             -0.32488    0.12400  -2.620 0.008849 ** 
## Sport               -0.15691    0.08348  -1.879 0.060307 .  
## Thriller            -0.16025    0.04687  -3.419 0.000641 ***
## year.log            -0.06703    0.01646  -4.072 4.83e-05 ***
## nrOfNominations.sqr  1.17918    0.01832  64.367  < 2e-16 ***
## ratingInteger_8      0.72878    0.05804  12.557  < 2e-16 ***
## ratingInteger_9      1.13361    0.37559   3.018 0.002571 ** 
## ratingInteger_7      0.09801    0.03670   2.671 0.007626 ** 
## ratingInteger_3      0.53320    0.26342   2.024 0.043075 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7395 on 2246 degrees of freedom
## Multiple R-squared:  0.7616, Adjusted R-squared:  0.7598 
## F-statistic: 422.1 on 17 and 2246 DF,  p-value: < 2.2e-16

#nrOfWins.sqr ~ Adventure + Biography + Crime + Documentary + Drama + Family + Music + Romance + SciFi + War + Western + year.log + duration.log + Popularity.log + lifetime_gross.log + nrOfNominations.sqr + ratingInteger_8 + ratingInteger_9 + ratingInteger_3 + nrOfGenre_2 + nrOfGenre_1

#variable importance sorted from highest to lowest
varImp_sort <- function(mod){
  imp <- as.data.frame(varImp(mod))
  imp <- data.frame(names   = rownames(imp),
                  overall = imp$Overall)
  imp[order(imp$overall,decreasing = T),]
}

varImp_sort(lmodel_AIC)

##                  names   overall
## 13 nrOfNominations.sqr 64.367270
## 14     ratingInteger_8 12.556771
## 12            year.log  4.071592
## 11            Thriller  3.418661
## 2            Adventure  3.363628
## 15     ratingInteger_9  3.018198
## 3            Animation  2.959006
## 7               Family  2.938143
## 6          Documentary  2.908228
## 16     ratingInteger_7  2.670608
## 8              Fantasy  2.634791
## 9              Musical  2.620138
## 1               Action  2.261886
## 17     ratingInteger_3  2.024129
## 10               Sport  1.879488
## 4               Comedy  1.627632
## 5                Crime  1.577093

model_performance(lmodel_AIC, metrics=c("AIC","R2","RMSE"))

##        AIC        R2 R2_adjusted     RMSE
## 1 5078.372 0.7616017   0.7597972 0.736547

6.2 Linear Regression (by p-value)

R-squared: 76.08%

#best model by p-value
lmodel_p <- lm(nrOfWins.sqr ~ Action + Adventure + Animation + Documentary + Family + Fantasy + Musical + Thriller + year.log + nrOfNominations.sqr + ratingInteger_8 + ratingInteger_9 + ratingInteger_7 + ratingInteger_3, data=train_num)

summary(lmodel_p)

## 
## Call:
## lm(formula = nrOfWins.sqr ~ Action + Adventure + Animation + 
##     Documentary + Family + Fantasy + Musical + Thriller + year.log + 
##     nrOfNominations.sqr + ratingInteger_8 + ratingInteger_9 + 
##     ratingInteger_7 + ratingInteger_3, data = train_num)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.6465 -0.2846 -0.0875  0.2446  5.2499 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          1.24363    0.03007  41.355  < 2e-16 ***
## Action              -0.09489    0.04077  -2.328  0.02002 *  
## Adventure           -0.14146    0.04863  -2.909  0.00366 ** 
## Animation           -0.24827    0.08371  -2.966  0.00305 ** 
## Documentary          0.32374    0.10165   3.185  0.00147 ** 
## Family              -0.17522    0.06080  -2.882  0.00399 ** 
## Fantasy             -0.14238    0.06046  -2.355  0.01861 *  
## Musical             -0.30486    0.12381  -2.462  0.01388 *  
## Thriller            -0.12759    0.04371  -2.919  0.00355 ** 
## year.log            -0.06741    0.01647  -4.093  4.4e-05 ***
## nrOfNominations.sqr  1.18655    0.01812  65.472  < 2e-16 ***
## ratingInteger_8      0.73589    0.05730  12.842  < 2e-16 ***
## ratingInteger_9      1.15512    0.37548   3.076  0.00212 ** 
## ratingInteger_7      0.09993    0.03624   2.758  0.00587 ** 
## ratingInteger_3      0.51859    0.26350   1.968  0.04918 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7402 on 2249 degrees of freedom
## Multiple R-squared:  0.7608, Adjusted R-squared:  0.7593 
## F-statistic: 510.9 on 14 and 2249 DF,  p-value: < 2.2e-16

varImp_sort(lmodel_p)

##                  names   overall
## 10 nrOfNominations.sqr 65.472472
## 11     ratingInteger_8 12.841734
## 9             year.log  4.093424
## 4          Documentary  3.184796
## 12     ratingInteger_9  3.076355
## 3            Animation  2.965857
## 8             Thriller  2.918901
## 2            Adventure  2.908852
## 5               Family  2.881881
## 13     ratingInteger_7  2.757768
## 7              Musical  2.462326
## 6              Fantasy  2.354938
## 1               Action  2.327691
## 14     ratingInteger_3  1.968060

model_performance(lmodel_p, metrics=c("AIC","R2","RMSE"))

##        AIC       R2 R2_adjusted      RMSE
## 1 5079.953 0.760802    0.759313 0.7377813

6.3 Linear Regression (by important)

R-squared: 76.08%

#best model by important features
lmodel_i <- lm(nrOfWins.sqr ~ Action + Biography + War + Adventure + Animation + Documentary + Family + Fantasy + Musical + Thriller + year.log + nrOfNominations.sqr + ratingInteger_8 + ratingInteger_9 + Popularity.log, data=train_num)

summary(lmodel_i)

## 
## Call:
## lm(formula = nrOfWins.sqr ~ Action + Biography + War + Adventure + 
##     Animation + Documentary + Family + Fantasy + Musical + Thriller + 
##     year.log + nrOfNominations.sqr + ratingInteger_8 + ratingInteger_9 + 
##     Popularity.log, data = train_num)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.7362 -0.2886 -0.0887  0.2452  5.2733 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          1.28665    0.02418  53.206  < 2e-16 ***
## Action              -0.10255    0.04149  -2.472 0.013518 *  
## Biography            0.11062    0.06860   1.613 0.106968    
## War                  0.15325    0.10614   1.444 0.148924    
## Adventure           -0.14242    0.04880  -2.919 0.003549 ** 
## Animation           -0.22866    0.08379  -2.729 0.006403 ** 
## Documentary          0.35859    0.10397   3.449 0.000573 ***
## Family              -0.18839    0.06066  -3.105 0.001923 ** 
## Fantasy             -0.13555    0.06081  -2.229 0.025914 *  
## Musical             -0.28054    0.12486  -2.247 0.024742 *  
## Thriller            -0.12513    0.04448  -2.813 0.004948 ** 
## year.log            -0.07228    0.01689  -4.279 1.96e-05 ***
## nrOfNominations.sqr  1.19355    0.01833  65.123  < 2e-16 ***
## ratingInteger_8      0.65351    0.05220  12.519  < 2e-16 ***
## ratingInteger_9      1.09466    0.37544   2.916 0.003585 ** 
## Popularity.log       0.01135    0.01862   0.609 0.542377    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7414 on 2248 degrees of freedom
## Multiple R-squared:  0.7601, Adjusted R-squared:  0.7585 
## F-statistic: 474.9 on 15 and 2248 DF,  p-value: < 2.2e-16

varImp_sort(lmodel_i)

##                  names    overall
## 12 nrOfNominations.sqr 65.1232511
## 13     ratingInteger_8 12.5190632
## 11            year.log  4.2785257
## 6          Documentary  3.4491320
## 7               Family  3.1054819
## 4            Adventure  2.9187415
## 14     ratingInteger_9  2.9156455
## 10            Thriller  2.8131797
## 5            Animation  2.7289316
## 1               Action  2.4717679
## 9              Musical  2.2469220
## 8              Fantasy  2.2289763
## 2            Biography  1.6126184
## 3                  War  1.4438378
## 15      Popularity.log  0.6093147

model_performance(lmodel_p, metrics=c("AIC","R2","RMSE"))

##        AIC       R2 R2_adjusted      RMSE
## 1 5079.953 0.760802    0.759313 0.7377813

6.4 Regression Tree

R-squared: 80.17 %

#building a decision tree
dtModel <- rpart(nrOfWins.sqr~.-nrOfWins, data = train, method="anova", cp=.001)

#view results
plotcp(dtModel) #visualize cross validation results

varImp_sort(dtModel) #view variable importance

##                  names    overall
## 13 nrOfNominations.sqr 2.72303897
## 15       ratingInteger 1.64837122
## 18            year.log 1.57368578
## 9         duration.log 1.54856353
## 2            Adventure 1.06531127
## 8                Drama 0.98672139
## 14      Popularity.log 0.90774993
## 3            Animation 0.71807295
## 16             Romance 0.37244225
## 12           nrOfGenre 0.31369780
## 1               Action 0.29138194
## 7          Documentary 0.25872023
## 17            Thriller 0.18495035
## 4            Biography 0.15721218
## 5               Comedy 0.12583402
## 6                Crime 0.04990904
## 10             Fantasy 0.03593187
## 11             Musical 0.01555749
## 19              Family 0.00000000
## 20              Horror 0.00000000
## 21               Music 0.00000000
## 22             Mystery 0.00000000
## 23                News 0.00000000
## 24               SciFi 0.00000000
## 25               Short 0.00000000
## 26               Sport 0.00000000
## 27                 War 0.00000000
## 28             Western 0.00000000

printcp(dtModel) #display results

## 
## Regression tree:
## rpart(formula = nrOfWins.sqr ~ . - nrOfWins, data = train, method = "anova", 
##     cp = 0.001)
## 
## Variables actually used in tree construction:
##  [1] Action              Adventure           Documentary        
##  [4] Drama               duration.log        nrOfGenre          
##  [7] nrOfNominations.sqr Popularity.log      ratingInteger      
## [10] year.log           
## 
## Root node error: 5152/2264 = 2.2756
## 
## n= 2264 
## 
##           CP nsplit rel error  xerror     xstd
## 1  0.4433977      0   1.00000 1.00098 0.049720
## 2  0.1180394      1   0.55660 0.59417 0.031568
## 3  0.1147302      2   0.43856 0.41742 0.025002
## 4  0.0385680      3   0.32383 0.36435 0.023607
## 5  0.0234411      4   0.28526 0.31046 0.016329
## 6  0.0109068      5   0.26182 0.28589 0.014945
## 7  0.0089815      6   0.25092 0.27505 0.014540
## 8  0.0033703      7   0.24194 0.26154 0.014258
## 9  0.0032102      9   0.23519 0.26095 0.014439
## 10 0.0032071     10   0.23198 0.26022 0.014391
## 11 0.0028306     11   0.22878 0.25964 0.014416
## 12 0.0027696     12   0.22595 0.25807 0.014297
## 13 0.0023371     13   0.22318 0.25602 0.014012
## 14 0.0023242     14   0.22084 0.25635 0.014049
## 15 0.0022595     15   0.21852 0.25648 0.014064
## 16 0.0018168     16   0.21626 0.25133 0.013882
## 17 0.0015824     18   0.21262 0.25347 0.014268
## 18 0.0015476     19   0.21104 0.25133 0.014172
## 19 0.0015176     20   0.20949 0.25149 0.014190
## 20 0.0013843     21   0.20798 0.25380 0.014951
## 21 0.0013237     22   0.20659 0.25495 0.015035
## 22 0.0012868     23   0.20527 0.25406 0.015015
## 23 0.0011940     24   0.20398 0.25408 0.014860
## 24 0.0011877     25   0.20279 0.25783 0.015078
## 25 0.0011719     26   0.20160 0.25857 0.015080
## 26 0.0010611     27   0.20043 0.26011 0.015089
## 27 0.0010000     29   0.19830 0.25771 0.014955

#plot variable importance
impvar <- varImp_sort(dtModel) %>% filter(overall>0) %>% arrange(overall) %>% mutate(names = factor(names, unique(names)))
impvar %>% ggplot(aes(names, overall))+
  geom_bar(stat = "identity")+
  coord_flip()+
  labs(title="Regression Tree Variable Importance")

#gather R-squared value
tmp <- printcp(dtModel)

## 
## Regression tree:
## rpart(formula = nrOfWins.sqr ~ . - nrOfWins, data = train, method = "anova", 
##     cp = 0.001)
## 
## Variables actually used in tree construction:
##  [1] Action              Adventure           Documentary        
##  [4] Drama               duration.log        nrOfGenre          
##  [7] nrOfNominations.sqr Popularity.log      ratingInteger      
## [10] year.log           
## 
## Root node error: 5152/2264 = 2.2756
## 
## n= 2264 
## 
##           CP nsplit rel error  xerror     xstd
## 1  0.4433977      0   1.00000 1.00098 0.049720
## 2  0.1180394      1   0.55660 0.59417 0.031568
## 3  0.1147302      2   0.43856 0.41742 0.025002
## 4  0.0385680      3   0.32383 0.36435 0.023607
## 5  0.0234411      4   0.28526 0.31046 0.016329
## 6  0.0109068      5   0.26182 0.28589 0.014945
## 7  0.0089815      6   0.25092 0.27505 0.014540
## 8  0.0033703      7   0.24194 0.26154 0.014258
## 9  0.0032102      9   0.23519 0.26095 0.014439
## 10 0.0032071     10   0.23198 0.26022 0.014391
## 11 0.0028306     11   0.22878 0.25964 0.014416
## 12 0.0027696     12   0.22595 0.25807 0.014297
## 13 0.0023371     13   0.22318 0.25602 0.014012
## 14 0.0023242     14   0.22084 0.25635 0.014049
## 15 0.0022595     15   0.21852 0.25648 0.014064
## 16 0.0018168     16   0.21626 0.25133 0.013882
## 17 0.0015824     18   0.21262 0.25347 0.014268
## 18 0.0015476     19   0.21104 0.25133 0.014172
## 19 0.0015176     20   0.20949 0.25149 0.014190
## 20 0.0013843     21   0.20798 0.25380 0.014951
## 21 0.0013237     22   0.20659 0.25495 0.015035
## 22 0.0012868     23   0.20527 0.25406 0.015015
## 23 0.0011940     24   0.20398 0.25408 0.014860
## 24 0.0011877     25   0.20279 0.25783 0.015078
## 25 0.0011719     26   0.20160 0.25857 0.015080
## 26 0.0010611     27   0.20043 0.26011 0.015089
## 27 0.0010000     29   0.19830 0.25771 0.014955

rsq.val <- 1-tmp[nrow(tmp),3]   #extract R-squared from final split
cat("\nR-squared: ", round(rsq.val*100,2),"% \n")

## 
## R-squared:  80.17 %

#plot tree
rpart.plot(dtModel, extra=1, varlen=-1, digits=1, main="Regression Tree for Movie Awards")

6.5 Random Forest

R-squared: 76.76%

set.seed(123)
rfModel <- randomForest(nrOfWins.sqr ~ .-nrOfWins, ntree=400, mtry=15, data=train, importance=TRUE, corr.bias=TRUE)
rfModel

## 
## Call:
##  randomForest(formula = nrOfWins.sqr ~ . - nrOfWins, data = train,      ntree = 400, mtry = 15, importance = TRUE, corr.bias = TRUE) 
##                Type of random forest: regression
##                      Number of trees: 400
## No. of variables tried at each split: 15
## 
##           Mean of squared residuals: 0.5288754
##                     % Var explained: 76.76
##   Bias correction applied:
##   Intercept:  -0.0122175 
##       Slope:  1.003082

plot(rfModel)

varImp_sort(rfModel)

##                  names      overall
## 28 nrOfNominations.sqr 129.24593114
## 1        ratingInteger  35.29635020
## 27      Popularity.log  16.65149160
## 25            year.log  15.66764158
## 9          Documentary  11.80645162
## 26        duration.log  11.42987778
## 10               Drama   9.65046160
## 3               Action   8.99498244
## 4            Adventure   7.75123052
## 7               Comedy   6.66576224
## 13              Horror   6.34186205
## 24             Western   5.02248371
## 11              Family   4.46829261
## 8                Crime   4.15115262
## 18             Romance   3.54580046
## 2            nrOfGenre   2.54821846
## 22            Thriller   2.22670331
## 23                 War   2.11520398
## 12             Fantasy   1.90159669
## 19               SciFi   1.80570831
## 5            Animation   1.18555968
## 15             Musical   1.10543935
## 21               Sport   0.43425805
## 16             Mystery   0.04092723
## 17                News   0.00000000
## 20               Short   0.00000000
## 14               Music  -0.01011599
## 6            Biography  -0.73679521

varImpPlot(rfModel, n.var=10, main="Random Forest Top 10 Important Variables")

7 Model Performance

Performance is being measured by R-squared. This value is derived from using the square of the correlation between the actual values and predicted values.

The following R-squared values were calculated using the test set: * lmodel_AIC - 69.27% * lmodel_p - 69.57% * lmodel_i - 69.77% * dtModel - 61.48% * rfModel - 64.85%

#prediction and accuracy function
predict_score <- function(mod, data){
  mod_name <- deparse(substitute(mod))
  
  #predict results
  data <- data %>% mutate(score = predict(mod, newdata=data), #predict results using the model
                         resids = nrOfWins.sqr - score,
                         predictedwins = round(score*score)) #unsquare the results and round to whole number- there cannot be partial awards
  
  #return accuracy
  df_cor <- data %>% select(predictedwins,nrOfWins)
  corr <- cor(df_cor)^2
  corr[lower.tri(corr,diag=TRUE)] <- NA  #Prepare to drop duplicates and correlations of 1
  corr <- as.data.frame(as.table(corr)) #Turn into a 3-column table
  corr <- na.omit(corr) #remove the NA values from above
  cat(mod_name, "R-squared: ", round(corr$Freq*100, 2),"% \n") #print accuracy
  
  #plot results
  h <- hist(data$resids, breaks = "FD", plot = FALSE) #histogram with Freedman-Diaconis rule for binwidth
  hd <- data %>% ggplot(aes(resids, ..density..))+
    geom_histogram(breaks=h$breaks)+
    geom_density(color="red", size=1)+
    labs(title=str_c(mod_name, " histogram and density plot for residuals"), x="Residual value", subtitle="using test set")
  qq <- data %>% ggplot(aes(sample=resids))+
    geom_qq()+
    geom_qq_line(color="red")+
    labs(title=str_c(mod_name, " quantile-quantile Normal plot of residuals"), subtitle="using test set")
  fit <- data %>% ggplot(aes(score, resids))+
    geom_point()+
    geom_smooth(method="loess", color="red")+
    labs(title=str_c(mod_name, " residuals vs. fitted values"), x="Fitted values", y="Residuals", subtitle="using test set")
  hd %>% print()
  qq %>% print()
  fit %>% print()
}

#predict with the test data
AIC_pred <- predict_score(lmodel_AIC, test_num)

## lmodel_AIC R-squared:  69.27 %

p_pred <- predict_score(lmodel_p, test_num)

## lmodel_p R-squared:  69.57 %

i_pred <- predict_score(lmodel_i, test_num)

## lmodel_i R-squared:  69.77 %

dt_pred <- predict_score(dtModel, test)

## dtModel R-squared:  61.48 %

rf_pred <- predict_score(rfModel, test)

## rfModel R-squared:  64.85 %

8 Conclusion

lmodel_i is the best model with an R-squared of 69.77% against the test data. It shows that nrOfWins.sqr is modeled by nrOfNominations.sqr, ratingInteger_8, year.log, Documentary, Family, Adventure, ratingInteger_9, Thriller, Animation, Action, Musical, Fantasy, Biography, War, and Popularity.log. The residuals compared to fitted values are in a fairly straight line and close to 0 except at the higher end. The residuals plot is a fairly normal distribution. The quantile quantile plot is reasonably straight, except at the ends where there is more noise.

All of the other models have very similar results with the decision tree being the worst at 61.28%.

To improve accuracy results, nrOfWins could have been turned into a binary variable for whether an award was earned or not. Doing a really quick generalized linear model of this (not shown since it is out of scope of this project) resulted in ~95% accuracy against the test set. This project is predicting number of awards, however, and not whether an award was received or not.

Another thing that could have been done to improve the models would be to gather more movie data including actors/actresses, producer, etc. These basic movie stats are likely not enough to explain all the variance with number of movie awards.

Movie Awards

Catherine Williams et al.

June 5, 2019

Class Group Project