IMDB has immense amounts of data regarding movies. This data can be used to predict movie awards (Oscar, Golden Globes, Bafta, etc). This will help to understand which combination of feature values contribute to bringing a film to success and consequently winning awards. This will almost certainly lead to an increase of profit for those involved in making the film as well.
36 predictor variables
Label is continuous
Number of Wins
nrOfWins - Number of prizes won by the movie. It is an indicative number that certifies how many are the prizes won even after the first year of release (Oscar, Golden Globe, Bafta, …)
ipak <- function(pkg){
new.pkg <- pkg[!(pkg %in% installed.packages()[, "Package"])]
if (length(new.pkg))
install.packages(new.pkg, dependencies = TRUE)
sapply(pkg, require, character.only = TRUE)
}
packages <- c("tidyverse","caret","performance","Amelia","rpart","rpart.plot","randomForest","fastDummies")
ipak(packages)
theme_set(theme_classic()) #applies classic theme to all charts
movies <- read.csv2("MACHINE_LEARNING_FINAL.csv")
#change variables to appropriate data types. this will be important for charting later
movies <- movies %>% mutate_at(vars(Action:Western),as.logical)
movies <- movies %>% mutate_at(vars(ratingInteger,nrOfGenre),as.factor)
Overall, this is high quality data. There are no missing values and there are not too many strong correlations. However, some variables are not normally distributed- they are skewed to the left.
There are no missing values
#Visualize missing values
missmap(movies, main = "Missing Values vs. Observed")
There are some variables with high correlations (>|0.5|) that will need to have feature engineering. There are also some moderate correlations (|0.3| to |0.5|) that should be monitored but do not necessarily need to be excluded initially.
#view correlations, drop the insignificant relationships, sort by highest to lowest, and visualize results graphically
corr_simple <- function(data=movies,drop="nrOfWins"){
df_cor <- data %>% mutate_if(is.factor, as.numeric) %>% select(-drop)
corr <- cor(df_cor)
corr[lower.tri(corr,diag=TRUE)] <- NA #Prepare to drop duplicates and correlations of 1
corr[corr == 1] <- NA #drop perfect correlations
corr <- as.data.frame(as.table(corr)) #Turn into a 3-column table
corr <- na.omit(corr) #remove the NA values from above
corr <- subset(corr, abs(Freq) > 0.3) #select significant values
corr <- corr[order(-abs(corr$Freq)),] #Sort by highest correlation
print(corr)
#turn corr back into matrix in order to plot with corrplot
mtx_corr <- reshape2::acast(corr, Var1~Var2, value.var="Freq")
#plot correlations visually
corrplot::corrplot(mtx_corr, is.corr=FALSE, tl.col="black", na.label=" ")
}
corr_simple()
## Var1 Var2 Freq
## 329 ratingCount nrOfUserReviews 0.7760283
## 147 lifetime_gross ratingCount 0.6449900
## 293 ratingCount nrOfNewsArticles 0.5837222
## 332 nrOfPhotos nrOfUserReviews 0.5616223
## 327 lifetime_gross nrOfUserReviews 0.5584999
## 255 lifetime_gross nrOfPhotos 0.5553002
## 221 ratingCount nrOfNominations 0.5403624
## 257 ratingCount nrOfPhotos 0.5385229
## 291 lifetime_gross nrOfNewsArticles 0.5237748
## 331 nrOfNominations nrOfUserReviews 0.5152443
## 333 nrOfNewsArticles nrOfUserReviews 0.4669062
## 295 nrOfNominations nrOfNewsArticles 0.4521882
## 296 nrOfPhotos nrOfNewsArticles 0.4286632
## 148 ratingInteger ratingCount 0.4147697
## 219 lifetime_gross nrOfNominations 0.4146209
## 220 ratingInteger nrOfNominations 0.4088316
## 259 nrOfNominations nrOfPhotos 0.4008487
## 184 ratingInteger duration 0.3238626
## 518 Adventure Animation 0.3206551
## 690 duration Drama 0.3167713
## 471 lifetime_gross Adventure 0.3089379
## 1205 Comedy Thriller -0.3061384
## 222 duration nrOfNominations 0.3033252
Conclusions made from these charts:
gg_plot <- function(x_col, y_col=movies$nrOfWins, data=movies){
if(is.numeric(data[[x_col]])){
p1 <- data %>% ggplot(mapping=aes_string(x_col, y_col))+
geom_jitter(alpha=0.5)+
geom_smooth(method="loess", se=FALSE)+
labs(title=str_c("Awards vs ", x_col), y="label")
p1 %>% print()
h <- hist(data[[x_col]], breaks = "FD", plot = FALSE) #histogram with Freedman-Diaconis rule for binwidth
p2 <- ggplot(data, aes_string(x_col))+
geom_histogram(aes(y = ..density..), breaks = h$breaks, alpha = 0.3, col = "white")+
geom_density(size = 1) +
labs(title=str_c("Histogram and density for ", x_col))
p2 %>% print()
}
else{
p3 <- ggplot(data, aes_string(x_col, y_col))+
geom_boxplot()+
geom_hline(yintercept=mean(y_col), color="red")+
geom_hline(yintercept=median(y_col), color="blue", linetype="dashed")+
labs(title=str_c("Awards: Number of Wins by ", x_col), subtitle="Showing mean(red), median(blue)")
p3 %>% print()
}
}
cols <- movies %>% select(-1) %>% names()
cols %>% walk(gg_plot)
#remove movie title and columns that do not have more than 1 unique value (Adult, RealityTV, TalkShow)
movies <- movies[, sapply(movies, function(col) length(unique(col))) > 1] %>% select(-1)
#Exclude outliers for year
outliers <- boxplot(movies$year, main="Boxplot for year (with outliers)")$out
movies <- movies[-which(movies$year %in% outliers),]
#view new boxplot after outliers have been removed
boxplot(movies$year, main="Boxplot for year")
#Exclude outliers for ratingCount
outliers <- boxplot(movies$ratingCount, main="Boxplot for ratingCount (with outliers)")$out
movies <- movies[-which(movies$ratingCount %in% outliers),]
#view new boxplot after outliers have been removed
boxplot(movies$ratingCount, main="Boxplot for ratingCount")
#Exclude outliers for nrOfPhotos
outliers <- boxplot(movies$nrOfPhotos, main="Boxplot for nrOfPhotos (with outliers)")$out
movies <- movies[-which(movies$nrOfPhotos %in% outliers),]
#view new boxplot after outliers have been removed
boxplot(movies$nrOfPhotos, main="Boxplot for nrOfPhotos")
Popularity was created to remove some correlated variables.
Logarithmic and square root transformations were performed to improve the distributions of numeric variables. Overall, logarithmic transformations performed better but if there were zeros in the data, square root transformations were used instead to avoid having infinite values.
movies <- movies %>% mutate(Popularity = ratingCount+nrOfUserReviews+nrOfNewsArticles+nrOfPhotos) %>%
select(-ratingCount,-nrOfUserReviews,-nrOfNewsArticles,-nrOfPhotos)
movies <- movies %>%
mutate(year.log=log(year),
duration.log=log(duration),
Popularity.log=log(Popularity),
lifetime_gross.log=log(lifetime_gross),
nrOfWins.sqr = sqrt(nrOfWins),
nrOfNominations.sqr = sqrt(nrOfNominations)) %>%
select(-year,-duration,-Popularity,-nrOfNominations,-lifetime_gross)
movies %>% glimpse()
## Observations: 3,234
## Variables: 31
## $ ratingInteger <fct> 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, ...
## $ nrOfWins <int> 20, 18, 10, 20, 13, 22, 26, 21, 10, 26, 0,...
## $ nrOfGenre <fct> 3, 3, 3, 3, 2, 1, 2, 2, 2, 3, 3, 2, 2, 3, ...
## $ Action <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, ...
## $ Adventure <lgl> FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, F...
## $ Animation <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, ...
## $ Biography <lgl> FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, F...
## $ Comedy <lgl> TRUE, FALSE, TRUE, FALSE, FALSE, FALSE, TR...
## $ Crime <lgl> FALSE, TRUE, TRUE, FALSE, TRUE, FALSE, FAL...
## $ Documentary <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, ...
## $ Drama <lgl> TRUE, FALSE, TRUE, TRUE, TRUE, TRUE, FALSE...
## $ Family <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, ...
## $ Fantasy <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, ...
## $ Horror <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, ...
## $ Music <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, ...
## $ Musical <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, ...
## $ Mystery <lgl> FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, F...
## $ News <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, ...
## $ Romance <lgl> TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, T...
## $ SciFi <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, ...
## $ Short <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, ...
## $ Sport <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, ...
## $ Thriller <lgl> FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, F...
## $ War <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, ...
## $ Western <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, ...
## $ year.log <dbl> 7.584265, 7.585281, 7.587311, 7.587817, 7....
## $ duration.log <dbl> 8.757784, 8.794825, 8.954157, 8.961879, 8....
## $ Popularity.log <dbl> 11.99842, 11.62560, 11.76139, 11.98895, 11...
## $ lifetime_gross.log <dbl> 18.46895, 18.44351, 18.86537, 17.18968, 17...
## $ nrOfWins.sqr <dbl> 4.472136, 4.242641, 3.162278, 4.472136, 3....
## $ nrOfNominations.sqr <dbl> 3.605551, 3.464102, 2.449490, 4.690416, 4....
cols <- movies %>% select(26:31) %>% names()
cols %>% walk(gg_plot, y_col=movies$nrOfWins.sqr)
There are still some moderate correlations, in particular between the engineered Popularity.log and lifetime_gross.log. lifetime_gross.log will be dropped.
corr_simple(drop=c("nrOfWins","nrOfWins.sqr"))
## Var1 Var2 Freq
## 810 Popularity.log lifetime_gross.log 0.6168315
## 813 ratingInteger nrOfNominations.sqr 0.4215055
## 735 Drama duration.log 0.3661610
## 839 Popularity.log nrOfNominations.sqr 0.3648694
## 120 Adventure Animation 0.3443654
## 616 Comedy Thriller -0.3125063
## 262 ratingInteger Drama 0.3100243
movies <- movies %>% select(-lifetime_gross.log)
Normalize the data frame so that model is not skewed by different types of measurements.
#normalize dataframe
normalize <- function(x)(x - mean(x, na.rm=T))/sd(x, na.rm=T)
movies <- movies %>% mutate_at(vars(year.log,duration.log,Popularity.log,nrOfNominations.sqr), normalize)
movies %>% glimpse()
## Observations: 3,234
## Variables: 30
## $ ratingInteger <fct> 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, ...
## $ nrOfWins <int> 20, 18, 10, 20, 13, 22, 26, 21, 10, 26, 0,...
## $ nrOfGenre <fct> 3, 3, 3, 3, 2, 1, 2, 2, 2, 3, 3, 2, 2, 3, ...
## $ Action <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, ...
## $ Adventure <lgl> FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, F...
## $ Animation <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, ...
## $ Biography <lgl> FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, F...
## $ Comedy <lgl> TRUE, FALSE, TRUE, FALSE, FALSE, FALSE, TR...
## $ Crime <lgl> FALSE, TRUE, TRUE, FALSE, TRUE, FALSE, FAL...
## $ Documentary <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, ...
## $ Drama <lgl> TRUE, FALSE, TRUE, TRUE, TRUE, TRUE, FALSE...
## $ Family <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, ...
## $ Fantasy <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, ...
## $ Horror <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, ...
## $ Music <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, ...
## $ Musical <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, ...
## $ Mystery <lgl> FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, F...
## $ News <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, ...
## $ Romance <lgl> TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, T...
## $ SciFi <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, ...
## $ Short <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, ...
## $ Sport <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, ...
## $ Thriller <lgl> FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, F...
## $ War <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, ...
## $ Western <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, ...
## $ year.log <dbl> -3.0481693, -2.8506383, -2.4561776, -2.357...
## $ duration.log <dbl> -0.03526202, 0.18373084, 1.12572274, 1.171...
## $ Popularity.log <dbl> 1.5243233, 1.2211173, 1.3315538, 1.5166245...
## $ nrOfWins.sqr <dbl> 4.472136, 4.242641, 3.162278, 4.472136, 3....
## $ nrOfNominations.sqr <dbl> 1.17673661, 1.09519946, 0.51033725, 1.8020...
Using 70% data to train the model and 30% to test
#for use with linear regression
#convert to numeric dummy variables
movies_num <- dummy_cols(movies) %>% select(-ratingInteger,-nrOfGenre)
movies_num <- movies_num %>% mutate_if(is.logical,as.integer)
set.seed(123)
train_num <- movies_num %>% sample_frac(0.7)
test_num <- movies_num %>% setdiff(train_num)
#for use with regression trees
set.seed(123)
train <- movies %>% sample_frac(0.7)
test <- movies %>% setdiff(train)
R-squared: 76.16%
#Linear Regression Model
lmodel <- lm(nrOfWins.sqr ~ . -nrOfWins, data = train_num)
#best model by AIC
lmodel_AIC <- lmodel %>% step() #automatically steps through and removes features based on getting the lowest AIC value for the model
lmodel_AIC %>% summary()
##
## Call:
## lm(formula = nrOfWins.sqr ~ Action + Adventure + Animation +
## Comedy + Crime + Documentary + Family + Fantasy + Musical +
## Sport + Thriller + year.log + nrOfNominations.sqr + ratingInteger_8 +
## ratingInteger_9 + ratingInteger_7 + ratingInteger_3, data = train_num)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.6783 -0.2975 -0.0806 0.2379 5.2776
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.30080 0.03954 32.900 < 2e-16 ***
## Action -0.09596 0.04243 -2.262 0.023800 *
## Adventure -0.16690 0.04962 -3.364 0.000782 ***
## Animation -0.24769 0.08371 -2.959 0.003119 **
## Comedy -0.05812 0.03571 -1.628 0.103743
## Crime -0.06572 0.04167 -1.577 0.114915
## Documentary 0.29848 0.10263 2.908 0.003671 **
## Family -0.17882 0.06086 -2.938 0.003335 **
## Fantasy -0.16028 0.06083 -2.635 0.008477 **
## Musical -0.32488 0.12400 -2.620 0.008849 **
## Sport -0.15691 0.08348 -1.879 0.060307 .
## Thriller -0.16025 0.04687 -3.419 0.000641 ***
## year.log -0.06703 0.01646 -4.072 4.83e-05 ***
## nrOfNominations.sqr 1.17918 0.01832 64.367 < 2e-16 ***
## ratingInteger_8 0.72878 0.05804 12.557 < 2e-16 ***
## ratingInteger_9 1.13361 0.37559 3.018 0.002571 **
## ratingInteger_7 0.09801 0.03670 2.671 0.007626 **
## ratingInteger_3 0.53320 0.26342 2.024 0.043075 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7395 on 2246 degrees of freedom
## Multiple R-squared: 0.7616, Adjusted R-squared: 0.7598
## F-statistic: 422.1 on 17 and 2246 DF, p-value: < 2.2e-16
#nrOfWins.sqr ~ Adventure + Biography + Crime + Documentary + Drama + Family + Music + Romance + SciFi + War + Western + year.log + duration.log + Popularity.log + lifetime_gross.log + nrOfNominations.sqr + ratingInteger_8 + ratingInteger_9 + ratingInteger_3 + nrOfGenre_2 + nrOfGenre_1
#variable importance sorted from highest to lowest
varImp_sort <- function(mod){
imp <- as.data.frame(varImp(mod))
imp <- data.frame(names = rownames(imp),
overall = imp$Overall)
imp[order(imp$overall,decreasing = T),]
}
varImp_sort(lmodel_AIC)
## names overall
## 13 nrOfNominations.sqr 64.367270
## 14 ratingInteger_8 12.556771
## 12 year.log 4.071592
## 11 Thriller 3.418661
## 2 Adventure 3.363628
## 15 ratingInteger_9 3.018198
## 3 Animation 2.959006
## 7 Family 2.938143
## 6 Documentary 2.908228
## 16 ratingInteger_7 2.670608
## 8 Fantasy 2.634791
## 9 Musical 2.620138
## 1 Action 2.261886
## 17 ratingInteger_3 2.024129
## 10 Sport 1.879488
## 4 Comedy 1.627632
## 5 Crime 1.577093
model_performance(lmodel_AIC, metrics=c("AIC","R2","RMSE"))
## AIC R2 R2_adjusted RMSE
## 1 5078.372 0.7616017 0.7597972 0.736547
R-squared: 76.08%
#best model by p-value
lmodel_p <- lm(nrOfWins.sqr ~ Action + Adventure + Animation + Documentary + Family + Fantasy + Musical + Thriller + year.log + nrOfNominations.sqr + ratingInteger_8 + ratingInteger_9 + ratingInteger_7 + ratingInteger_3, data=train_num)
summary(lmodel_p)
##
## Call:
## lm(formula = nrOfWins.sqr ~ Action + Adventure + Animation +
## Documentary + Family + Fantasy + Musical + Thriller + year.log +
## nrOfNominations.sqr + ratingInteger_8 + ratingInteger_9 +
## ratingInteger_7 + ratingInteger_3, data = train_num)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.6465 -0.2846 -0.0875 0.2446 5.2499
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.24363 0.03007 41.355 < 2e-16 ***
## Action -0.09489 0.04077 -2.328 0.02002 *
## Adventure -0.14146 0.04863 -2.909 0.00366 **
## Animation -0.24827 0.08371 -2.966 0.00305 **
## Documentary 0.32374 0.10165 3.185 0.00147 **
## Family -0.17522 0.06080 -2.882 0.00399 **
## Fantasy -0.14238 0.06046 -2.355 0.01861 *
## Musical -0.30486 0.12381 -2.462 0.01388 *
## Thriller -0.12759 0.04371 -2.919 0.00355 **
## year.log -0.06741 0.01647 -4.093 4.4e-05 ***
## nrOfNominations.sqr 1.18655 0.01812 65.472 < 2e-16 ***
## ratingInteger_8 0.73589 0.05730 12.842 < 2e-16 ***
## ratingInteger_9 1.15512 0.37548 3.076 0.00212 **
## ratingInteger_7 0.09993 0.03624 2.758 0.00587 **
## ratingInteger_3 0.51859 0.26350 1.968 0.04918 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7402 on 2249 degrees of freedom
## Multiple R-squared: 0.7608, Adjusted R-squared: 0.7593
## F-statistic: 510.9 on 14 and 2249 DF, p-value: < 2.2e-16
varImp_sort(lmodel_p)
## names overall
## 10 nrOfNominations.sqr 65.472472
## 11 ratingInteger_8 12.841734
## 9 year.log 4.093424
## 4 Documentary 3.184796
## 12 ratingInteger_9 3.076355
## 3 Animation 2.965857
## 8 Thriller 2.918901
## 2 Adventure 2.908852
## 5 Family 2.881881
## 13 ratingInteger_7 2.757768
## 7 Musical 2.462326
## 6 Fantasy 2.354938
## 1 Action 2.327691
## 14 ratingInteger_3 1.968060
model_performance(lmodel_p, metrics=c("AIC","R2","RMSE"))
## AIC R2 R2_adjusted RMSE
## 1 5079.953 0.760802 0.759313 0.7377813
R-squared: 76.08%
#best model by important features
lmodel_i <- lm(nrOfWins.sqr ~ Action + Biography + War + Adventure + Animation + Documentary + Family + Fantasy + Musical + Thriller + year.log + nrOfNominations.sqr + ratingInteger_8 + ratingInteger_9 + Popularity.log, data=train_num)
summary(lmodel_i)
##
## Call:
## lm(formula = nrOfWins.sqr ~ Action + Biography + War + Adventure +
## Animation + Documentary + Family + Fantasy + Musical + Thriller +
## year.log + nrOfNominations.sqr + ratingInteger_8 + ratingInteger_9 +
## Popularity.log, data = train_num)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.7362 -0.2886 -0.0887 0.2452 5.2733
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.28665 0.02418 53.206 < 2e-16 ***
## Action -0.10255 0.04149 -2.472 0.013518 *
## Biography 0.11062 0.06860 1.613 0.106968
## War 0.15325 0.10614 1.444 0.148924
## Adventure -0.14242 0.04880 -2.919 0.003549 **
## Animation -0.22866 0.08379 -2.729 0.006403 **
## Documentary 0.35859 0.10397 3.449 0.000573 ***
## Family -0.18839 0.06066 -3.105 0.001923 **
## Fantasy -0.13555 0.06081 -2.229 0.025914 *
## Musical -0.28054 0.12486 -2.247 0.024742 *
## Thriller -0.12513 0.04448 -2.813 0.004948 **
## year.log -0.07228 0.01689 -4.279 1.96e-05 ***
## nrOfNominations.sqr 1.19355 0.01833 65.123 < 2e-16 ***
## ratingInteger_8 0.65351 0.05220 12.519 < 2e-16 ***
## ratingInteger_9 1.09466 0.37544 2.916 0.003585 **
## Popularity.log 0.01135 0.01862 0.609 0.542377
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7414 on 2248 degrees of freedom
## Multiple R-squared: 0.7601, Adjusted R-squared: 0.7585
## F-statistic: 474.9 on 15 and 2248 DF, p-value: < 2.2e-16
varImp_sort(lmodel_i)
## names overall
## 12 nrOfNominations.sqr 65.1232511
## 13 ratingInteger_8 12.5190632
## 11 year.log 4.2785257
## 6 Documentary 3.4491320
## 7 Family 3.1054819
## 4 Adventure 2.9187415
## 14 ratingInteger_9 2.9156455
## 10 Thriller 2.8131797
## 5 Animation 2.7289316
## 1 Action 2.4717679
## 9 Musical 2.2469220
## 8 Fantasy 2.2289763
## 2 Biography 1.6126184
## 3 War 1.4438378
## 15 Popularity.log 0.6093147
model_performance(lmodel_p, metrics=c("AIC","R2","RMSE"))
## AIC R2 R2_adjusted RMSE
## 1 5079.953 0.760802 0.759313 0.7377813
R-squared: 80.17 %
#building a decision tree
dtModel <- rpart(nrOfWins.sqr~.-nrOfWins, data = train, method="anova", cp=.001)
#view results
plotcp(dtModel) #visualize cross validation results
varImp_sort(dtModel) #view variable importance
## names overall
## 13 nrOfNominations.sqr 2.72303897
## 15 ratingInteger 1.64837122
## 18 year.log 1.57368578
## 9 duration.log 1.54856353
## 2 Adventure 1.06531127
## 8 Drama 0.98672139
## 14 Popularity.log 0.90774993
## 3 Animation 0.71807295
## 16 Romance 0.37244225
## 12 nrOfGenre 0.31369780
## 1 Action 0.29138194
## 7 Documentary 0.25872023
## 17 Thriller 0.18495035
## 4 Biography 0.15721218
## 5 Comedy 0.12583402
## 6 Crime 0.04990904
## 10 Fantasy 0.03593187
## 11 Musical 0.01555749
## 19 Family 0.00000000
## 20 Horror 0.00000000
## 21 Music 0.00000000
## 22 Mystery 0.00000000
## 23 News 0.00000000
## 24 SciFi 0.00000000
## 25 Short 0.00000000
## 26 Sport 0.00000000
## 27 War 0.00000000
## 28 Western 0.00000000
printcp(dtModel) #display results
##
## Regression tree:
## rpart(formula = nrOfWins.sqr ~ . - nrOfWins, data = train, method = "anova",
## cp = 0.001)
##
## Variables actually used in tree construction:
## [1] Action Adventure Documentary
## [4] Drama duration.log nrOfGenre
## [7] nrOfNominations.sqr Popularity.log ratingInteger
## [10] year.log
##
## Root node error: 5152/2264 = 2.2756
##
## n= 2264
##
## CP nsplit rel error xerror xstd
## 1 0.4433977 0 1.00000 1.00098 0.049720
## 2 0.1180394 1 0.55660 0.59417 0.031568
## 3 0.1147302 2 0.43856 0.41742 0.025002
## 4 0.0385680 3 0.32383 0.36435 0.023607
## 5 0.0234411 4 0.28526 0.31046 0.016329
## 6 0.0109068 5 0.26182 0.28589 0.014945
## 7 0.0089815 6 0.25092 0.27505 0.014540
## 8 0.0033703 7 0.24194 0.26154 0.014258
## 9 0.0032102 9 0.23519 0.26095 0.014439
## 10 0.0032071 10 0.23198 0.26022 0.014391
## 11 0.0028306 11 0.22878 0.25964 0.014416
## 12 0.0027696 12 0.22595 0.25807 0.014297
## 13 0.0023371 13 0.22318 0.25602 0.014012
## 14 0.0023242 14 0.22084 0.25635 0.014049
## 15 0.0022595 15 0.21852 0.25648 0.014064
## 16 0.0018168 16 0.21626 0.25133 0.013882
## 17 0.0015824 18 0.21262 0.25347 0.014268
## 18 0.0015476 19 0.21104 0.25133 0.014172
## 19 0.0015176 20 0.20949 0.25149 0.014190
## 20 0.0013843 21 0.20798 0.25380 0.014951
## 21 0.0013237 22 0.20659 0.25495 0.015035
## 22 0.0012868 23 0.20527 0.25406 0.015015
## 23 0.0011940 24 0.20398 0.25408 0.014860
## 24 0.0011877 25 0.20279 0.25783 0.015078
## 25 0.0011719 26 0.20160 0.25857 0.015080
## 26 0.0010611 27 0.20043 0.26011 0.015089
## 27 0.0010000 29 0.19830 0.25771 0.014955
#plot variable importance
impvar <- varImp_sort(dtModel) %>% filter(overall>0) %>% arrange(overall) %>% mutate(names = factor(names, unique(names)))
impvar %>% ggplot(aes(names, overall))+
geom_bar(stat = "identity")+
coord_flip()+
labs(title="Regression Tree Variable Importance")
#gather R-squared value
tmp <- printcp(dtModel)
##
## Regression tree:
## rpart(formula = nrOfWins.sqr ~ . - nrOfWins, data = train, method = "anova",
## cp = 0.001)
##
## Variables actually used in tree construction:
## [1] Action Adventure Documentary
## [4] Drama duration.log nrOfGenre
## [7] nrOfNominations.sqr Popularity.log ratingInteger
## [10] year.log
##
## Root node error: 5152/2264 = 2.2756
##
## n= 2264
##
## CP nsplit rel error xerror xstd
## 1 0.4433977 0 1.00000 1.00098 0.049720
## 2 0.1180394 1 0.55660 0.59417 0.031568
## 3 0.1147302 2 0.43856 0.41742 0.025002
## 4 0.0385680 3 0.32383 0.36435 0.023607
## 5 0.0234411 4 0.28526 0.31046 0.016329
## 6 0.0109068 5 0.26182 0.28589 0.014945
## 7 0.0089815 6 0.25092 0.27505 0.014540
## 8 0.0033703 7 0.24194 0.26154 0.014258
## 9 0.0032102 9 0.23519 0.26095 0.014439
## 10 0.0032071 10 0.23198 0.26022 0.014391
## 11 0.0028306 11 0.22878 0.25964 0.014416
## 12 0.0027696 12 0.22595 0.25807 0.014297
## 13 0.0023371 13 0.22318 0.25602 0.014012
## 14 0.0023242 14 0.22084 0.25635 0.014049
## 15 0.0022595 15 0.21852 0.25648 0.014064
## 16 0.0018168 16 0.21626 0.25133 0.013882
## 17 0.0015824 18 0.21262 0.25347 0.014268
## 18 0.0015476 19 0.21104 0.25133 0.014172
## 19 0.0015176 20 0.20949 0.25149 0.014190
## 20 0.0013843 21 0.20798 0.25380 0.014951
## 21 0.0013237 22 0.20659 0.25495 0.015035
## 22 0.0012868 23 0.20527 0.25406 0.015015
## 23 0.0011940 24 0.20398 0.25408 0.014860
## 24 0.0011877 25 0.20279 0.25783 0.015078
## 25 0.0011719 26 0.20160 0.25857 0.015080
## 26 0.0010611 27 0.20043 0.26011 0.015089
## 27 0.0010000 29 0.19830 0.25771 0.014955
rsq.val <- 1-tmp[nrow(tmp),3] #extract R-squared from final split
cat("\nR-squared: ", round(rsq.val*100,2),"% \n")
##
## R-squared: 80.17 %
#plot tree
rpart.plot(dtModel, extra=1, varlen=-1, digits=1, main="Regression Tree for Movie Awards")
R-squared: 76.76%
set.seed(123)
rfModel <- randomForest(nrOfWins.sqr ~ .-nrOfWins, ntree=400, mtry=15, data=train, importance=TRUE, corr.bias=TRUE)
rfModel
##
## Call:
## randomForest(formula = nrOfWins.sqr ~ . - nrOfWins, data = train, ntree = 400, mtry = 15, importance = TRUE, corr.bias = TRUE)
## Type of random forest: regression
## Number of trees: 400
## No. of variables tried at each split: 15
##
## Mean of squared residuals: 0.5288754
## % Var explained: 76.76
## Bias correction applied:
## Intercept: -0.0122175
## Slope: 1.003082
plot(rfModel)
varImp_sort(rfModel)
## names overall
## 28 nrOfNominations.sqr 129.24593114
## 1 ratingInteger 35.29635020
## 27 Popularity.log 16.65149160
## 25 year.log 15.66764158
## 9 Documentary 11.80645162
## 26 duration.log 11.42987778
## 10 Drama 9.65046160
## 3 Action 8.99498244
## 4 Adventure 7.75123052
## 7 Comedy 6.66576224
## 13 Horror 6.34186205
## 24 Western 5.02248371
## 11 Family 4.46829261
## 8 Crime 4.15115262
## 18 Romance 3.54580046
## 2 nrOfGenre 2.54821846
## 22 Thriller 2.22670331
## 23 War 2.11520398
## 12 Fantasy 1.90159669
## 19 SciFi 1.80570831
## 5 Animation 1.18555968
## 15 Musical 1.10543935
## 21 Sport 0.43425805
## 16 Mystery 0.04092723
## 17 News 0.00000000
## 20 Short 0.00000000
## 14 Music -0.01011599
## 6 Biography -0.73679521
varImpPlot(rfModel, n.var=10, main="Random Forest Top 10 Important Variables")
Performance is being measured by R-squared. This value is derived from using the square of the correlation between the actual values and predicted values.
The following R-squared values were calculated using the test set: * lmodel_AIC - 69.27% * lmodel_p - 69.57% * lmodel_i - 69.77% * dtModel - 61.48% * rfModel - 64.85%
#prediction and accuracy function
predict_score <- function(mod, data){
mod_name <- deparse(substitute(mod))
#predict results
data <- data %>% mutate(score = predict(mod, newdata=data), #predict results using the model
resids = nrOfWins.sqr - score,
predictedwins = round(score*score)) #unsquare the results and round to whole number- there cannot be partial awards
#return accuracy
df_cor <- data %>% select(predictedwins,nrOfWins)
corr <- cor(df_cor)^2
corr[lower.tri(corr,diag=TRUE)] <- NA #Prepare to drop duplicates and correlations of 1
corr <- as.data.frame(as.table(corr)) #Turn into a 3-column table
corr <- na.omit(corr) #remove the NA values from above
cat(mod_name, "R-squared: ", round(corr$Freq*100, 2),"% \n") #print accuracy
#plot results
h <- hist(data$resids, breaks = "FD", plot = FALSE) #histogram with Freedman-Diaconis rule for binwidth
hd <- data %>% ggplot(aes(resids, ..density..))+
geom_histogram(breaks=h$breaks)+
geom_density(color="red", size=1)+
labs(title=str_c(mod_name, " histogram and density plot for residuals"), x="Residual value", subtitle="using test set")
qq <- data %>% ggplot(aes(sample=resids))+
geom_qq()+
geom_qq_line(color="red")+
labs(title=str_c(mod_name, " quantile-quantile Normal plot of residuals"), subtitle="using test set")
fit <- data %>% ggplot(aes(score, resids))+
geom_point()+
geom_smooth(method="loess", color="red")+
labs(title=str_c(mod_name, " residuals vs. fitted values"), x="Fitted values", y="Residuals", subtitle="using test set")
hd %>% print()
qq %>% print()
fit %>% print()
}
#predict with the test data
AIC_pred <- predict_score(lmodel_AIC, test_num)
## lmodel_AIC R-squared: 69.27 %
p_pred <- predict_score(lmodel_p, test_num)
## lmodel_p R-squared: 69.57 %
i_pred <- predict_score(lmodel_i, test_num)
## lmodel_i R-squared: 69.77 %
dt_pred <- predict_score(dtModel, test)
## dtModel R-squared: 61.48 %
rf_pred <- predict_score(rfModel, test)
## rfModel R-squared: 64.85 %
lmodel_i is the best model with an R-squared of 69.77% against the test data. It shows that nrOfWins.sqr is modeled by nrOfNominations.sqr, ratingInteger_8, year.log, Documentary, Family, Adventure, ratingInteger_9, Thriller, Animation, Action, Musical, Fantasy, Biography, War, and Popularity.log. The residuals compared to fitted values are in a fairly straight line and close to 0 except at the higher end. The residuals plot is a fairly normal distribution. The quantile quantile plot is reasonably straight, except at the ends where there is more noise.
All of the other models have very similar results with the decision tree being the worst at 61.28%.
To improve accuracy results, nrOfWins could have been turned into a binary variable for whether an award was earned or not. Doing a really quick generalized linear model of this (not shown since it is out of scope of this project) resulted in ~95% accuracy against the test set. This project is predicting number of awards, however, and not whether an award was received or not.
Another thing that could have been done to improve the models would be to gather more movie data including actors/actresses, producer, etc. These basic movie stats are likely not enough to explain all the variance with number of movie awards.