Movies play an important role in today's popular culture and media, as they are one of the largest forms of entertainment and one of the most profitable industries in our society.
From large movie studios to movie analysts, many people have tried to analyze what makes a given movie earn more money.
In this tutorial, we shall find a few of the most important features of movies, and use those features to predict how much money a movie will make. We will first collect our data, tidy it, explore possible trends, then perform an analysis and make prediction models. And not only will we see that there are some features that are very strong predictors of movie success, but a simple linear model can outperform machine learning algorithms like random forests.
We will be using the following libraries
pandas - For storing datanumpy - For manipulating datare - For regex string processingnumpy - For data tidyingmatplotlib - For plotting datasklearn - For data analysisscipy - For data analysis testwarnings - For warningsimport pandas as pd
import numpy as np
import re
import matplotlib.pyplot as plt
from matplotlib.pyplot import figure
from sklearn import linear_model
from sklearn import model_selection
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
import scipy.stats as stats
import warnings
To begin the first part of the data science lifecycle, we shall start by using a Kaggle dataset found below, and then cleaning up some of the data it was in. Since this was a prepared dataset, it would seem like we'd have little work to do, but there were a few quirks to this dataset which required some work. We shall describe below.
We will be using data from Kaggle, which is one of the most popular online sources for datasets for data science and machine learning.
Load the .csv files which we downloaded from kaggle, and display the first few rows to have a general understanding of the data.
df = pd.read_csv('tmdb_5000_movies.csv')
cast = pd.read_csv('tmdb_5000_credits.csv')
display(df.head())
display(cast.head())
Join the two tables by movie title, so that all of the data is in one table.
df = df.set_index('original_title').join(cast.set_index('title'))
df = df.reset_index()
display(df.head())
As seen in the table above, some of the columns (ie. genres, production companies, production countries, spoken languages, cast, and crew) have more than one element per cell. We will be converting those cells into data structures to make it easier to access all of the data in those cells.
First we will modify the genres. As shown below, each movie can have multiple genres, and each genre contains an id and name. We will make it so that for each movie, all of their genres are stored in an array, and each genre element will be a dictionary with keys for its id and name.
df.iloc[0]["genres"]
#go through each row
for index, row in df.iterrows():
genres = []
line = row["genres"]
cur_genres = line.split('},')
del cur_genres[-1]
#transform info for each genre
for genre in cur_genres:
m = re.search("(\d+), \"name\": \"(\w+)", genre)
cur_id = m.groups()[0]
cur_name = m.groups()[1]
cur_info = {}
cur_info["id"] = cur_id
cur_info["name"] = cur_name
genres.append(cur_info)
df.at[index, "genres"] = genres
Next we will modify the production companies. Each movie can have multiple production companies, and each production company has a name and id. We will make it so that for each movie, all of their production companies are stored in an array, and each production company element will be a dictionary with keys for its name and id.
df.iloc[0]["production_companies"]
#go through each row
for index, row in df.iterrows():
p_comps = []
line = row["production_companies"]
cur_p_comps = line.split('},')
del cur_p_comps[-1]
#transform info for each genre
for p_comp in cur_p_comps:
m = re.search("name\": \"(.+)\", \"id\": (\d+)", p_comp)
cur_name = m.groups()[0]
cur_id = m.groups()[1]
cur_info = {}
cur_info["name"] = cur_name
cur_info["id"] = cur_id
p_comps.append(cur_info)
df.at[index, "production_companies"] = p_comps
Now we will modify the production countries. Each movie can have multiple production countries, and each production country has an iso_3166_1 and a name.
df.iloc[0]["production_countries"]
#go through each row
for index, row in df.iterrows():
p_countries = []
line = row["production_countries"]
cur_p_countries = line.split('},')
del cur_p_countries[-1]
#transform info for each genre
for p_country in cur_p_countries:
m = re.search("\"iso_3166_1\": \"(.+)\", \"name\": \"(.+)\"", p_country)
cur_iso_3166_1 = m.groups()[0]
cur_name = m.groups()[1]
cur_info = {}
cur_info["iso_3166_1"] = cur_iso_3166_1
cur_info["name"] = cur_name
p_countries.append(cur_info)
df.at[index, "production_countries"] = p_countries
After this, we will modify the spoken languages. Each movie can have multiple spoken languages, and each spoken language has an iso_639_1 and a name.
df.iloc[0]["spoken_languages"]
#go through each row
for index, row in df.iterrows():
s_langs = []
line = row["spoken_languages"]
cur_s_langs = line.split('},')
del cur_s_langs[-1]
#transform info for each genre
for s_lang in cur_s_langs:
m = re.search("\"iso_639_1\": \"(.*)\", \"name\": \"(.*)\"", s_lang)
cur_iso_639_1 = m.groups()[0]
cur_name = m.groups()[1]
cur_info = {}
cur_info["iso_639_1"] = cur_iso_639_1
cur_info["name"] = cur_name
s_langs.append(cur_info)
df.at[index, "spoken_languages"] = s_langs
display(df.head())
Subsequently, we will modify the cast. Each movie has multiple cast members, and each member has a cast_id, a character, a credit_id, a gender, an id, a name, and an order. We will convert this string into an array of hashes.
df.iloc[0]["cast"]
#go through each row
for index, row in df.iterrows():
casts = []
line = str(row["cast"])
cur_casts = line.split('},')
del cur_casts[-1]
#transform info for each genre
for cast in cur_casts:
m = re.search("cast_id\": (\d*), \"character\": \"(.*)\", \"credit_id\": \"(.*)\", \"gender\": (.*), \"id\": (\d*), \"name\": \"(.*)\", \"order\": (.*)", cast)
cur_cast_id = m.groups()[0]
cur_character = m.groups()[1]
cur_credit_id = m.groups()[2]
cur_gender = m.groups()[3]
cur_id = m.groups()[4]
cur_name = m.groups()[5]
cur_order = m.groups()[6]
cur_info = {}
cur_info["cast_id"] = cur_cast_id
cur_info["character"] = cur_character
cur_info["credit_id"] = cur_credit_id
cur_info["gender"] = cur_gender
cur_info["id"] = cur_id
cur_info["name"] = cur_name
cur_info["order"] = cur_order
casts.append(cur_info)
df.at[index, "production_countries"] = casts
Lastly, we will modify the crew. Crews can have multiple members, and each member has a credit_id, a department, a gender, an id, a job, and a name.
df.iloc[0]["crew"]
#go through each row
for index, row in df.iterrows():
crews = []
line = str(row["crew"])
cur_crews = line.split('},')
del cur_crews[-1]
#transform info for each genre
for crew in cur_crews:
m = re.search("credit_id\": (.*), \"department\": \"(.*)\", \"gender\": (.*), \"id\": (.*), \"job\": (.*), \"name\": (.*)\"", crew)
cur_credit_id = m.groups()[0]
cur_department = m.groups()[1]
cur_gender = m.groups()[2]
cur_id = m.groups()[3]
cur_job = m.groups()[4]
cur_name = m.groups()[5]
cur_info = {}
cur_info["credit_id"] = cur_credit_id
cur_info["department"] = cur_department
cur_info["gender"] = cur_gender
cur_info["id"] = cur_id
cur_info["job"] = cur_job
cur_info["name"] = cur_name
crews.append(cur_info)
df.at[index, "crew"] = crews
The final tidied table is shown below. Now we will being doing exploratory analysis on our data.
display(df.head())
To start off our data exploration, we will create a correlation table to view the general trends and relationships within our data.
We will only be using the key variables in our table, and we will also include a variable for missing data.
df['missing'] = 0
genres = []
revenues = []
#get the genre and revenue data from the table
for index, row in df.iterrows():
l = list(map(lambda e: e['name'], row['genres']))
genres += l;
revenues += [row['revenue']] * len(row['genres'])
#find the missing data in the table
for index, row in df.iterrows():
missing = 0
#if the budget is missing
if row['budget'] == 0:
missing += 1
#if the genres are missing
if row['genres'] == []:
missing += 1
#if the homepage is missing
if row['homepage'] == np.NaN:
missing += 1
#if the keywords are missing
if row['keywords'] == []:
missing += 1
#if the production companies are missing
if row['production_companies'] == []:
missing += 1
#if the production countries are missing
if row['production_countries'] == []:
missing += 1
#if the spoken languages are missing
if row['spoken_languages'] == []:
missing += 1
#if the revenue is missing
if row['revenue'] == 0:
missing += 1
#if the tagline is missing
if row['tagline'] == np.NaN:
missing += 1
#if the cast info is missing
if row['cast'] == np.NaN:
missing += 1
#if the crew info is missing
if row['crew'] == np.NaN:
missing += 1
df.at[index, 'missing'] = missing
display(df.corr())
From the correlation table, it is clear that there are noticeable relationships between several of the variables, such as one between revenue and budget. It is also interesting to note how there appears to be a relationship between revenue and missing data, as it appears that if a movie earns more money, it is less likely to have missing data in the table. This implies that our missing data is likely missing not at random (MNAR). For now, we'll deal with the data by just dropping missing data from our analysis, but we'd really want to impute them in the future.
We will explore these relationships below, excluding the relationships that are inconsequential such as the correlation between revenue and runtime or id.
First, we start by replacing any zero values in the budget and revenue columns to nans to avoid having the 0s influencing our regression. In the future, we can also impute the nans in some order.
columns = ['budget', 'revenue']
df[columns] = df[columns].replace(0, np.nan)
Now we will explore the relationship between revenue and budget.
Before we start our data exploration, we will turn off warnings for plots, as they do not affect our results.
warnings.filterwarnings('ignore')
We will begin by creating a histogram of the movies' revenues on a log scaling.
plt.figure(figsize=(10,6))
display()
n, bins, patches = plt.hist(df['revenue'], log=True)
plt.xlabel('Revenue')
plt.ylabel('Counts')
plt.title('Distribution of Revenue')
As shown in the histogram, the majority of movies have low revenues, and only a few have high revenues.
Now we will create a histogram for the movies' budgets, again on a log scale.
plt.figure(figsize=(10,6))
display()
n, bins, patches = plt.hist(df['budget'], log=True)
plt.xlabel('Budget')
plt.ylabel('Counts')
plt.title('Distribution of Budget')
Just like the revenues, most of the movies have low budgets. As shown in the histogram, as the budget increases, there are fewer movies.
Now we will create a histogram that shows the counts for both revenue and budget to compare them to analyze their similarities and differences.
sort = df.sort_values('revenue', ascending=False )
plt.figure(figsize=(10,6))
plt.hist(df['revenue'], label = 'Revenue', log=True)
plt.hist(df['budget'], label = 'Budget', log=True)
plt.xlabel('Revenue or Budget')
plt.ylabel('Counts')
plt.title('Revenue and Budget Distribution')
plt.legend()
It is clear that the revenue and budget have very similar trends, as their graphs are nearly identical. This demonstrates how there is a clear correlation between movies' revenues and budgets, and how the majority of movies have low budgets and therefore produce low revenues.
To make the trend even more clear, we will create a scatterplot of revenue vs. budget.
plt.figure(figsize=(10,6))
plt.scatter(df['budget'], df['revenue'])
plt.xlabel('Budget')
plt.ylabel('Revenue')
plt.title('Revenue vs. Budget')
As shown in the scatterplot, there is a positive linear correlation between the revenue and budget. As the budget increases, the revenue increases as well. This shows how the budget plays a critical role in how much a movie ends up earning.
We will create a scatterplot between revenue and imdb vote average to see the relationship between these two variables.
plt.figure(figsize=(10,6))
plt.scatter(df['vote_average'], df['revenue'])
plt.xlabel('Vote average')
plt.ylabel('Revenue')
plt.title('Revenue vs. Vote Average')
This scatterplot has unexpected results, because there doesn't seem to be a clear relationship between vote average and revenue. Even though it would seem like a movie would have a higher vote average if it has more revenue, it appears that the vote average is not completely related to how much it ends up making. It seems like the majority of vote averages hovers in the 6-8 range regardless of a movie's budget.
This information is good to know, because now we know that vote average does not play an important role in how much a movie makes.
plt.figure(figsize=(10,6))
plt.scatter(df['popularity'], df['revenue'])
plt.xlabel('Popularity')
plt.ylabel('Revenue')
plt.title('Revenue vs. Popularity')
As shown in the scatterplot, there is a positive correlation between the revenue and popularity. In general, a higher popularity indicates a higher revenue, which demonstrates how a movie's popularity plays a role in how much it ends up earning.
Now, we will create a scatter plot between the imdb vote count (# of votes a movie has) and movie revenue
plt.figure(figsize=(10,6))
plt.scatter(df['vote_count'], df['revenue'])
plt.xlabel('Vote Count')
plt.ylabel('Revenue')
plt.title('Revenue vs. Vote Count')
As shown in the scatterplot, there is a positive linear correlation between the revenue and vote count. As the vote count increases, the revenue increases as well. This shows how the vote count is one of the indicators of how much a movie ends up earning.
We will create a bar chart for the average revenue vs. genre, to see if there is a correlation between a movie's genre and how much movie they make.
Before we create the chart, we have to find all of the unique genres in the data and then calculate the average revenue for each genre.
genre_revenue = {}
genre_count = {}
data = {'genre' : genres, 'revenue' : revenues}
data = pd.DataFrame(data)
#get the data
for index, row in data.iterrows():
cur_genre = row["genre"]
cur_revenue = row["revenue"]
#if has seen the genre before
if cur_genre in genre_revenue:
genre_revenue[cur_genre] += cur_revenue
genre_count[cur_genre] += 1
#if haven't seen the genre before
else:
genre_revenue[cur_genre] = cur_revenue
genre_count[cur_genre] = 1
genre_avg_revenue = {}
#calculate the average revenue for each genre
for genre in genre_revenue.keys():
genre_avg_revenue[genre] = genre_revenue[genre] / genre_count[genre]
Now that we've calculated the average revenue of movies for each genre, we can create the bar chart.
figure(num=None, figsize=(20, 8), dpi=80, facecolor='w', edgecolor='k')
objects = genre_avg_revenue.keys()
y_pos = np.arange(len(objects))
performance = genre_avg_revenue.values()
plt.bar(y_pos, performance, align='center', alpha=1)
plt.xticks(y_pos, objects)
plt.ylabel('Average Revenue')
plt.title('Average Revenue vs. Genre')
plt.show()
Some of the observations we can get from the graph are that animation, adventure, and fantasy are the genres that have the highest average revenue per movie. The genres that have the lowest average revenue per movie are TV, foreign, and documentary.
From this bar chart, it is clear that a movie's genre impacts how much revenue they end up earning, as shown by how certain genres earn much more than other genres.
First we will create a histogram of how many movies production companies make.
all_prod_comps = {}
#go through each row
for index, row in df.iterrows():
prod_comps = row["production_companies"]
#go through current production companies
for p_c in prod_comps:
#if has seen production company before
if p_c["name"] in all_prod_comps:
all_prod_comps[p_c["name"]] += 1
#if hasn't seen production company before
else:
all_prod_comps[p_c["name"]] = 1
all_prod_comps
reduced = dict( (key, value) for (key, value) in all_prod_comps.items() if value >= 10 )
figure(num=None, figsize=(10, 6), dpi=80, facecolor='w', edgecolor='k')
n, bins, patches = plt.hist(reduced.values())
plt.xlabel('Number of Production Companies')
plt.ylabel('Movies Produced')
plt.title('Movies Produced vs. Number of Production Companies')
As shown in the graph, there are only a few production companies that produce many movies, while the majority of companies only produce a small number of movies.
Now we will create a bar plot to find the relationship between average revenue and the number of movies a production company has produced.
x = []
y = []
#go through each bin
for i in range(len(bins) - 1):
start = bins[i]
end = bins[i+1]
x.append(start)
reduced = list(value for (key, value) in all_prod_comps.items() if value >= start and value < end)
y.append(np.mean(reduced))
plt.figure(figsize=(10,6))
plt.bar(x,y, width=15)
plt.xlabel('Number of Movies A Production Company Produced')
plt.ylabel('Average Revenue')
plt.title('Average Revenue vs. Number of Movies A Production Company Produced')
From this graph, it is clear that if a movie is produced by a production company that has made many movies in the past, it will have a higher average revenue. This shows that a movie's production company/companies will impact its revenue.
Now lets analyze our data!
In our analysis, we will create two regression models, linear regression and random forests regression, to make predictions for revenue. We will only be using budget, popularity, and vote count as the dependent variables, because our models won't be able to handle too many variables and those are some of the most critical variables as discovered in the exploratory section. We will use training data to train our models, and then use test data to predict values for revenue.
Afterwards, we will perform t-tests with 10-fold validation to compare the results from these two regression models, to observe which model has the least amount of error.
First we will create the linear regression model and predict values for revenue. For more information on sklearn's linear regression: here
feature_list = ['budget', 'popularity', 'vote_count']
X = []
Y = []
dt = df.dropna()
#get the dependent and independent variables
for index, row in dt.iterrows():
X.append([row["budget"], row["popularity"], row["vote_count"]])
Y.append(row["revenue"])
X_train, X_test, Y_train, Y_test = model_selection.train_test_split(X, Y, test_size=0.2)
lm = linear_model.LinearRegression()
model = lm.fit(X_train,Y_train)
pred = lm.predict(X_test)
# Plot results. Red line to show actual data points
plt.figure(figsize=(10,6))
plt.title("Linear Regression: Predicted vs. Actual Revenue", fontsize=16)
plt.scatter(Y_test,pred)
plt.plot(Y_test, Y_test, color = "Red")
plt.xlabel("Actual Revenue")
plt.ylabel("Predicted Revenue")
plt.show()
As shown in the graph, the linear regression model predicts the revenue for the data relatively well, as it is able to obtain the same trend as the actual revenue.
Now we will create the random forests model and predict values for revenue. For more information on sklearn's random forests regressor: here
rfr = RandomForestRegressor(100)
model = rfr.fit(X_train, Y_train)
pred = rfr.predict(X_test)
# Plot performance
plt.figure(figsize=(10,6))
plt.title("Random Forests: Predicted vs. Actual Revenue", fontsize=16)
plt.scatter(Y_test,pred)
plt.plot(Y_test, Y_test, color = "Red")
plt.xlabel("Actual Revenue")
plt.ylabel("Predicted Revenue")
plt.show()
As shown in the graph, the random forests model also predicts the revenue for the data well, but it is already clear that its approximation isn't as accurate as the linear regression model's.
Now that we have our two models, we will perform a t-test with 10-fold validation to compare the results with each other.
classifiers = {'Linear Regression' : lm, 'Random Forests Regression' : rfr}
n = len(Y)
error = []
stderror = []
i = 0
#go through the regressioons
for name,regressor in classifiers.items():
score = model_selection.cross_val_score(regressor, X, Y, cv=10, scoring='neg_mean_squared_error')
error.append(-score)
stderror.append(np.mean(error[i] / np.sqrt(n)))
print("{} error: {}, MSE: {}\n".format(name, error[i], stderror[i]))
i = i + 1
diff = error[1] - error[0]
diffBar = np.mean(diff)
sdDiff = np.sqrt(np.sum(np.square(diff - diffBar)) / 9)
seDiff = sdDiff / np.sqrt(n)
t = diffBar / seDiff
pval = stats.t.sf(t, 9)
print("Null Hypothesis: LR_Error = SVR_Error\nAlternative Hypothesis: LR_Error < SVR_Error")
print("\nMean Difference: {}, Test statistic t: {}, P-Val: {}".format(diffBar, t, pval))
As shown in the results, the linear regression model performed better than the random forests model, as expected. The linear regression model's predicted revenue had less error compared to the random forests one, and from this we can conclude that the linear regression model is a more reliable predicting model for revenue.
From our analysis, it is clear that budget, popularity, and vote count are factors that influence how much a movie earns in terms of revenue. Even though we only used a few dependent variables, there are many more variables that also affect a movie's revenue that we didn't take into account in our models. This includes the other variables that we analyzed in our exploratory section, as well as other categorial attributes such as what a movie's plot is. Even though we weren't able to take all of the possible variables into account, from our analysis we have still gained a better understanding of how movies' revenues work.
As for missing data in our data set, because we chose to remove any rows with no values for budget and revenue, the missing data is missing not at random (MNAR). In other projects and in the future, we can instead use imputations to replace the missing data, rather than just removing them altogether. This could possible lead to getting more accurate results and having the models predict revenue more accurately.
Through this project, we hope to have created a better understanding of how movies' earnings work and highlight the multiple factors that go into play in determining a movie's revenue.
Our work only provides a small glimpse into the complexity of how the movie industry works, and we hope that our study was an informative and interesting read!