Rotten Tomatoes ratings for 30,000+ movies explained with Machine Learning
SHAP values for director, genre, rating, and more
SHAP values for director, genre, rating, and more
In this article, I use the dataset containing extensive information for about 140,000 unique movies from the Rotten Tomatoes website collected by April 2023. The dataset is publicly available on Kaggle. Full details of the analysis can be found in this public Kaggle notebook.
Step 1 — data preprocessing
Here, data preprocessing consists of the following steps:
selecting movies with identified labels — rating scores from both users (audience score) and professional critics (tomato-meter score) — scaled from 0 to 100;
converting movie release dates to decades;
grouping movie runtime lengths into larger (20-minute) bins;
extracting movie genres, directors, and sound mix columns and encoding them with at least 25 records present in the dataset;
removing unused columns;
finally, encoding rare categorical variables such as movie ratings, distributors, original languages, runtime lengths, and release decades with no more than 60 different categories in each column and at least 100 records in each category.
Finally, we have received the dataset with more than 30,000 movies, with defined ratings both from users and from professional critics, and selected them for the subsequent analysis.
Step 2 — setting a Machine Learning model to predict user rating scores
The data prepared with the previous step are randomly split between training and test samples,and modelled with the CatBoostRegressor model that explicitly takes into account categorical features. The root mean squared error (RMSE) of the resulting model is about 18.3 percentage points, an improvement compared to the baseline model RMSE of about 20.9 points (assuming the same score of about 62.2 points for every movie).
Step 3 — explanation of the obtained Machine Learning model
Here, we are using the SHapley Additive exPlanations (SHAP) method, one of the most common to explore the explainability of Machine Learning models. The units of SHAP value are hence in percentage points.
First, we look into the span of SHAP values for top features of our interest:
![](https://substackcdn.com/image/fetch/w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6bb5ec6d-69b6-4d47-8742-e43ea9034ee1_776x930.png)
As we see, the most important features to predict user rating scores for Rotten Tomatoes movies are the movie genre, runtime, and release date.
Now, we look at individual features.
About movie genres, the highest audience scores are associated with Stand-up, Documentary, Anime, and Animation movies:
![](https://substackcdn.com/image/fetch/w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1f1abe97-d21a-47a6-955f-6c62211a51d7_800x685.png)
Regarding movie directors, remarkably, the highest audience scores are associated with Akira Kurosawa, Ingmar Bergman, Rainer Werner Fassbinder, Ridley Scott, Steven Spielberg, and Billy Wilder:
![](https://substackcdn.com/image/fetch/w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F220f3266-57c3-49de-9579-7a25b44b1b31_800x727.png)
About movie ratings, the highest audience scores are associated with the PG (Parental Guidance) rating:
![](https://substackcdn.com/image/fetch/w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F30663af0-ccd2-4626-994f-b058b04037f3_800x619.png)
Regarding the movie release date, the highest audience scores are associated with the movies released from the 1920s to 1960s:
![](https://substackcdn.com/image/fetch/w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F524a2d2b-ba9b-4c4c-bd10-3ac45f67eb26_800x619.png)
About movie runtimes, remarkably, the highest audience scores are associated with movies having 150–190 minute runtimes:
![](https://substackcdn.com/image/fetch/w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9a5e3b34-b891-40ea-a408-8f578c6b6fed_800x619.png)
Finally, about movie distributors, the highest audience scores are associated with the movies distributed by Sony Pictures Classics, followed by New Yorker Films, Netflix, United Artists, Focus Features, Walt Disney, and Miramax Films:
![](https://substackcdn.com/image/fetch/w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62f0c75b-956a-4324-81c4-4abcce36c958_800x742.png)
Step 4 — modelling of ratings from critics and their explanations in terms of SHAP values
Here, I look for more details about the other rating provided in the dataset, namely the averaged ratings from some of the world’s most respected critics (also known as tomato-meter). Similar to movie ratings from users, the scores are scaled between 0 and 100.
About movie genres, the highest critic scores are associated with Documentary, Anime, and Stand-up genres:
![](https://substackcdn.com/image/fetch/w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F601581ab-da40-4958-957f-c42f7f33827e_800x680.png)
Regarding movie directors, remarkably, the highest critic scores are associated with Stephen Frears, followed by Steven Spielberg, Spike Lee, Jonathan Demme, and Claude Chabrol:
![](https://substackcdn.com/image/fetch/w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feaf5e88f-3e8c-4dfd-b5dc-f0ef0f1c15df_800x727.png)
About movie ratings, the highest critic scores are associated with no rating, followed by R (Restricted) rating:
![](https://substackcdn.com/image/fetch/w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ae310a7-081d-4103-8268-52339df845a2_800x619.png)
Regarding the movie release date, the highest critic scores are associated with the movies released from the 1920s to 1940s:
![](https://substackcdn.com/image/fetch/w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd0f43f1-e525-4cab-add3-d78821a27839_800x619.png)
About movie runtimes, remarkably, the highest critic scores are associated with movies having 30–50 minute runtimes:
![](https://substackcdn.com/image/fetch/w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fac016d4f-ae80-4869-bc9e-05d63664f0c2_800x619.png)
Finally, about movie distributors, the highest audience scores are associated with the movies distributed by Music Box Films, followed by Sony Pictures Classics, A24, and Kino Lorber:
![](https://substackcdn.com/image/fetch/w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff0b7d817-aafb-4f02-976c-21b2df8e0eb0_800x740.png)
I hope these results can be useful for you. In case of questions/comments, do not hesitate to write in the comments below or reach me directly through LinkedIn or Twitter.
You can also subscribe to my new articles, or become a referred Medium member.