Movie Profitability


This is a case study of a dataset of 45,000 movies with metadata taken from MoviesLens.

The movie metadata was analyzed using Excel, R, and Tableau to identify trends among highly profitable movies based on movie genre and language.

Data Cleaning 

The dataset was downloaded from Kaggle, see  The dataset includes over 45,000 movies and their metadata.  This includes user ratings, genres, runtimes, release dates, spoken languages, among other data points.

To clean the dataset, I used Excel to navigate the file.  Since it was only about 45,000, Excel was able to handle the dataset with ease.   The metadata listed ISO 639-1 language codes for the spoken language of each movie.  I used VLOOKUP() to find the corresponding language for each code.  I deleted any rows with no listed language.  The dataset also listed movies in various stages of production.  I'm looking for profitability trends, so I deleted any movies that were canceled, in production, planned, in post production, or rumored.  Only released movies are relevant.  Also, since I wanted to look into the correlation between runtime and profitability, I deleted movies with 0 or blank runtimes.  I also manually checked movies over 400 minutes, finding miniseries and various film collections.  These won't count as single movies, so I removed them from the dataset too.


After cleaning, I exported the dataset into R to conduct my analysis. See my R Markdown notebook for details of the steps taken in R to analyze the data.  I used the Tidyverse and some of its packages, such as Tidyr and Dplyr, in order to clean and analyze the data. There were many movies in the dataset with either no budget or no revenue listed.  In order to properly analyze movie profitability trends, I filtered these out of the dataset.  I then looked for the movies with the highest budget and highest revenue, so I grabbed the top 100 of each category.

Finally, to look into the correlations between movie genre, movie language, and profitability, I grouped the dataset accordingly.  The Dplyr package in R was very useful for this, since I was able to use filter(), group_by(), and summarize() functions in order to group movies together by genre or language, then extract statistics from those groups, such as average ratings, budgets, and revenues.

In R, I created sample visualizations from the gathered data to try and see if I could identify certain trends.  I generated histograms of the dataset on various parameters, such as runtimes, ratings, and release years.  I also plotted the top 100 movies for budget and revenue as a scatter plot to see the spread of data. After concluding the analysis in R, I exported my findings to Tableau in order to visualize the data.


In Tableau, I created an interactive dashboard to visualize the movie metadata.  The first page of the dashboard shows the movie genre data.  I included a pie chart to convey the amounts of each genre contained in the dataset.  The pie chart is accompanied by a panel showing all statistics for each genre, including the number of movies for each genre, the average budget, revenue,  popularity, rating, and runtime.  Finally, a bar chart visualizating every statistic lines the bottom of the dashboard.  The viewer can choose which statistic is visualized using the filters on the right.

The next page in the dashboard visualizes the movie data by spoken language.  This page follows the same structure as the first page.  There is a pie chart, a table with all statistics, and a bar chart visualization.  Finally, there is a scatter plot of the top 100 movies with the highest budget, plotting each movie's budget with regards to its revenue.


I displayed my findings in a Google Slides presentationFirst, I walked through my findings in movie genre trends.  The highest-rated genres turned out to be western, history, and documentary movies.  However, when looking at budgets and revenue, these genres were nowhere near the top.  Animation, adventure, and family movies had the highest budgets overall.  Inversely, the higher-rated genres actually had lower budgets on average.  The same three highest-budget genres also had the highest revenues, with animation movies taking the throne.

Next, I looked at the correlation between spoken language and move profitability.  Movies in Russian had the lowest average ratings at 5.48, not factoring in the one Vietnamese movie in the dataset with a 5.0 rating.  As expected, movies in English had the highest budgets, with Chinese and Japanese movies trailing behind them.  Continuing this trend, English, Chinese and Japanese movies also had the highest revenues to justify their larger budgets.  

My analysis of the data was eye-opening.  It became clear that higher ratings for movies did not translate to higher revenues.  In fact, the highest rated genres of movies made some of the lowest revenues.  Along with that, animated movies reigned supreme in profitability with regards to runtime.  They made the highest revenue while boasting the lowest runtimes among any movie.  Another movie market investors could look into is the Asian movie market.  Chinese, Japanese, and even Indian movies trailed English movies in profitability.


An R Markdown notebook detailing my analysis process in R.

An interactive Tableau dashboard to browse the visualized data.

A presentation to display my findings.