Project: Investigate a Dataset - [TMDB movie data]

Table of Contents

Introduction

Dataset Description

Dataset Columns

  • id: Movie id in dataset
  • imdb_id: The movie's IMDb id
  • popularity: The movie's popularity rating
  • budget: The movie's budget
  • revenue: The movie's revenue
  • original_title: Original movie title
  • cast: The movie's cast
  • homepage: The movie's homepage
  • director: The movie's director
  • tagline: Tagline describing the movie
  • keywords: Keywords describing the movie
  • overview: Overview of the movie's story
  • runtime: Movie's runtime
  • genres: Movie genre
  • production_companies: Movie's production companies
  • release_date: Movie's release date
  • vote_count: How many people reviewed the movie
  • vote_average: Movie's average rating
  • release_year: Movie's release year
  • budget_adj: Adjusted movie budget based on inflation
  • revenue_adj: Adjusted movie revenue based on inflation

Question(s) for Analysis

  • Do popular movies generate more revenue?
  • Do popular movies receive higher ratings?
  • Do movies with a higher budget generate more revenue?
  • Which release months generate more revenue for movies?
  • Which release years generated more revenue for movies?
  • Which genre generated more revenue for movies?
  • Which genre was more popular?
In [72]:
 # import statements for all of the packages used

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
% matplotlib inline
In [19]:
# Upgrade pandas to use dataframe.explode() function. 
!pip install --upgrade pandas==0.25.0;
Requirement already up-to-date: pandas==0.25.0 in /opt/conda/lib/python3.6/site-packages (0.25.0)
Requirement already satisfied, skipping upgrade: python-dateutil>=2.6.1 in /opt/conda/lib/python3.6/site-packages (from pandas==0.25.0) (2.6.1)
Requirement already satisfied, skipping upgrade: numpy>=1.13.3 in /opt/conda/lib/python3.6/site-packages (from pandas==0.25.0) (1.19.5)
Requirement already satisfied, skipping upgrade: pytz>=2017.2 in /opt/conda/lib/python3.6/site-packages (from pandas==0.25.0) (2017.3)
Requirement already satisfied, skipping upgrade: six>=1.5 in /opt/conda/lib/python3.6/site-packages (from python-dateutil>=2.6.1->pandas==0.25.0) (1.11.0)

Data Wrangling

General Properties

In [73]:
# Load your data
df = pd.read_csv('tmdb-movies.csv')

# Show a sample of the data
df.head()
Out[73]:
id imdb_id popularity budget revenue original_title cast homepage director tagline ... overview runtime genres production_companies release_date vote_count vote_average release_year budget_adj revenue_adj
0 135397 tt0369610 32.985763 150000000 1513528810 Jurassic World Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi... http://www.jurassicworld.com/ Colin Trevorrow The park is open. ... Twenty-two years after the events of Jurassic ... 124 Action|Adventure|Science Fiction|Thriller Universal Studios|Amblin Entertainment|Legenda... 6/9/15 5562 6.5 2015 1.379999e+08 1.392446e+09
1 76341 tt1392190 28.419936 150000000 378436354 Mad Max: Fury Road Tom Hardy|Charlize Theron|Hugh Keays-Byrne|Nic... http://www.madmaxmovie.com/ George Miller What a Lovely Day. ... An apocalyptic story set in the furthest reach... 120 Action|Adventure|Science Fiction|Thriller Village Roadshow Pictures|Kennedy Miller Produ... 5/13/15 6185 7.1 2015 1.379999e+08 3.481613e+08
2 262500 tt2908446 13.112507 110000000 295238201 Insurgent Shailene Woodley|Theo James|Kate Winslet|Ansel... http://www.thedivergentseries.movie/#insurgent Robert Schwentke One Choice Can Destroy You ... Beatrice Prior must confront her inner demons ... 119 Adventure|Science Fiction|Thriller Summit Entertainment|Mandeville Films|Red Wago... 3/18/15 2480 6.3 2015 1.012000e+08 2.716190e+08
3 140607 tt2488496 11.173104 200000000 2068178225 Star Wars: The Force Awakens Harrison Ford|Mark Hamill|Carrie Fisher|Adam D... http://www.starwars.com/films/star-wars-episod... J.J. Abrams Every generation has a story. ... Thirty years after defeating the Galactic Empi... 136 Action|Adventure|Science Fiction|Fantasy Lucasfilm|Truenorth Productions|Bad Robot 12/15/15 5292 7.5 2015 1.839999e+08 1.902723e+09
4 168259 tt2820852 9.335014 190000000 1506249360 Furious 7 Vin Diesel|Paul Walker|Jason Statham|Michelle ... http://www.furious7.com/ James Wan Vengeance Hits Home ... Deckard Shaw seeks revenge against Dominic Tor... 137 Action|Crime|Thriller Universal Pictures|Original Film|Media Rights ... 4/1/15 2947 7.3 2015 1.747999e+08 1.385749e+09

5 rows × 21 columns

In [74]:
df.shape
Out[74]:
(10866, 21)
In [75]:
df.describe()
Out[75]:
id popularity budget revenue runtime vote_count vote_average release_year budget_adj revenue_adj
count 10866.000000 10866.000000 1.086600e+04 1.086600e+04 10866.000000 10866.000000 10866.000000 10866.000000 1.086600e+04 1.086600e+04
mean 66064.177434 0.646441 1.462570e+07 3.982332e+07 102.070863 217.389748 5.974922 2001.322658 1.755104e+07 5.136436e+07
std 92130.136561 1.000185 3.091321e+07 1.170035e+08 31.381405 575.619058 0.935142 12.812941 3.430616e+07 1.446325e+08
min 5.000000 0.000065 0.000000e+00 0.000000e+00 0.000000 10.000000 1.500000 1960.000000 0.000000e+00 0.000000e+00
25% 10596.250000 0.207583 0.000000e+00 0.000000e+00 90.000000 17.000000 5.400000 1995.000000 0.000000e+00 0.000000e+00
50% 20669.000000 0.383856 0.000000e+00 0.000000e+00 99.000000 38.000000 6.000000 2006.000000 0.000000e+00 0.000000e+00
75% 75610.000000 0.713817 1.500000e+07 2.400000e+07 111.000000 145.750000 6.600000 2011.000000 2.085325e+07 3.369710e+07
max 417859.000000 32.985763 4.250000e+08 2.781506e+09 900.000000 9767.000000 9.200000 2015.000000 4.250000e+08 2.827124e+09
In [76]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10866 entries, 0 to 10865
Data columns (total 21 columns):
id                      10866 non-null int64
imdb_id                 10856 non-null object
popularity              10866 non-null float64
budget                  10866 non-null int64
revenue                 10866 non-null int64
original_title          10866 non-null object
cast                    10790 non-null object
homepage                2936 non-null object
director                10822 non-null object
tagline                 8042 non-null object
keywords                9373 non-null object
overview                10862 non-null object
runtime                 10866 non-null int64
genres                  10843 non-null object
production_companies    9836 non-null object
release_date            10866 non-null object
vote_count              10866 non-null int64
vote_average            10866 non-null float64
release_year            10866 non-null int64
budget_adj              10866 non-null float64
revenue_adj             10866 non-null float64
dtypes: float64(4), int64(6), object(11)
memory usage: 1.7+ MB

Data information

  • We have 10866 rows and 21 columns
  • The average adjusted budget is 17.55 million
  • The average adjusted revenue is 51.36 million
  • The average runtime is 102 minutes
  • The average rating is 5.97, the minimum is 1.5 and the maximum is 9.2
  • The dataset contains movies from 1960 to 2015
  • popularity ranges from 1 to 32.99
  • Several columns have missing data with homepage having the most missing data

Data Cleaning

In [77]:
# Check for duplicate rows
sum(df.duplicated())
Out[77]:
1
In [78]:
# Remove duplicate rows
df.drop_duplicates(inplace=True);

# Change release_date to datetime
df["release_date"] = pd.to_datetime(df["release_date"])

# Extract month from release_date 
df['month'] = pd.DatetimeIndex(df['release_date']).month

# Remove columns that are specific to the data and won't be used in analysis
df.drop(['id', 'imdb_id', 'original_title', 'homepage', 'tagline', 'keywords', 'overview', 'runtime', 'production_companies', 'cast', 'director', 'release_date'], axis = 1, inplace = True)

df.head() # check remaining data
Out[78]:
popularity budget revenue genres vote_count vote_average release_year budget_adj revenue_adj month
0 32.985763 150000000 1513528810 Action|Adventure|Science Fiction|Thriller 5562 6.5 2015 1.379999e+08 1.392446e+09 6
1 28.419936 150000000 378436354 Action|Adventure|Science Fiction|Thriller 6185 7.1 2015 1.379999e+08 3.481613e+08 5
2 13.112507 110000000 295238201 Adventure|Science Fiction|Thriller 2480 6.3 2015 1.012000e+08 2.716190e+08 3
3 11.173104 200000000 2068178225 Action|Adventure|Science Fiction|Fantasy 5292 7.5 2015 1.839999e+08 1.902723e+09 12
4 9.335014 190000000 1506249360 Action|Crime|Thriller 2947 7.3 2015 1.747999e+08 1.385749e+09 4
In [79]:
df.info() # check remaining data
<class 'pandas.core.frame.DataFrame'>
Int64Index: 10865 entries, 0 to 10865
Data columns (total 10 columns):
popularity      10865 non-null float64
budget          10865 non-null int64
revenue         10865 non-null int64
genres          10842 non-null object
vote_count      10865 non-null int64
vote_average    10865 non-null float64
release_year    10865 non-null int64
budget_adj      10865 non-null float64
revenue_adj     10865 non-null float64
month           10865 non-null int64
dtypes: float64(4), int64(5), object(1)
memory usage: 933.7+ KB

Exploratory Data Analysis

In [80]:
# Show histograms of popularity and revenue
fig, axes =plt.subplots(1, 2)
fig.tight_layout()
df.popularity.hist(label='popularity', ax=axes[0]);
df.revenue_adj.hist(label='revenue', ax=axes[1]);
axes[0].set_title('Popularity histogram');
axes[0].set_xlabel('Popularity value');
axes[1].set_title('Revenue (adjusted) histogram');
axes[1].set_xlabel('Revenue (adjusted)');
fig.text(-0.05, 0.5, 'Number of movies', va='center', rotation='vertical');
  • Most popularity ratings are around 0 and 2.
  • There are a lot of cells in the revenue column with a value of 0.
In [81]:
# Remove rows with 0 revenue for analysis
df2 = df[df['revenue_adj']>0]

# Scatter plot of revenue (adjusted) vs popularity
df2.plot(x='revenue_adj', y='popularity', kind='scatter');
plt.xlabel('Revenue (adjusted)');
plt.ylabel('Popularity');
plt.title('Scatter plot of Revenue (adjusted) vs Popularity');
  • There is no correlation between a movie's popularity and revenue.
In [82]:
# Show histograms of vote_average and popularity
fig, axes =plt.subplots(1, 2)
fig.tight_layout()
df.popularity.hist(label='popularity', ax=axes[0]);
df.vote_average.hist(label='vote_average', ax=axes[1]);
axes[0].set_title('Popularity histogram');
axes[0].set_xlabel('Popularity value');
axes[1].set_title('Vote average histogram');
axes[1].set_xlabel('Vote average');
fig.text(-0.05, 0.5, 'Number of movies', va='center', rotation='vertical');
  • Most popularity ratings are around 0 and 2.
  • Most vote ratings are at the center between 4 and 7.5
In [83]:
# Scatter plot of popularity vs vote average 
df.plot(x='vote_average', y='popularity', kind='scatter');
plt.xlabel('Vote average');
plt.ylabel('Popularity');
plt.title('Scatter plot of Vote average vs Popularity');
  • There are 3 outliers with very high popularity. Those can be romoved for a better reading.
In [84]:
# Remove the 3 outliers with very high popularity
df4 = df[df['popularity']<15]
df4.plot(x='vote_average', y='popularity', kind='scatter');
plt.xlabel('Vote average');
plt.ylabel('Popularity');
plt.title('Altered scatter plot of Vote average vs Popularity');
  • Highly popular movies receive higher ratings of above 5 whilst high vote average does not necessarily mean high popularity.

Do movies with a higher budget generate more revenue?

In [85]:
# Remove rows with 0 revenue and 0 budget for analysis
df3 = df2[df2['budget_adj']>0]

# Scatter plot of revenue (adjusted) vs budget (adjusted)
df3.plot(x='budget_adj', y='revenue_adj', kind='scatter');
plt.xlabel('Budget (adjusted)');
plt.ylabel('Revenue (adjusted)');
plt.suptitle('Scatter plot of Revenue (adjusted) vs Budget (adjusted)', y=1);  
  • There is no correlation between a movie's budget and the revenue it generates.

Which release months generate more revenue for movies?

In [86]:
# Group by month and get average revenue
df2.groupby('month').revenue_adj.mean().plot(kind= 'bar');
plt.xlabel('Month');
plt.ylabel('Mean revenue (adjusted)');
plt.title('Bar plot of mean revenue (adjusted) per month');
  • Movies released in June generate the most revenue whilst movies released in September generate the lowest revenue on average. The summer months (May, June, July) and the winter months (November and December) generate more revenue than the remaining months.
In [87]:
# How many movies are released each month
df2.groupby('month').month.count().plot(kind= 'bar');
plt.xlabel('Month');
plt.ylabel('Movie count');
plt.title('Bar plot of movie count per month');
  • More movies are released in September than other months, followed by December.

Which release years generated more revenue for movies?

In [88]:
# Average revenue (adjuseted) generated per year
df2.groupby('release_year').revenue_adj.mean().plot(kind= 'bar', figsize=(14,8));
plt.xlabel('Release year', size=20);
plt.ylabel('Mean revenue (adjusted)', size=20);
plt.title('Bar plot of mean revenue (adjusted) per year', size=20);
  • The early years, before 1980, movies generated more revenue (adjusted) than the latter years, with 1965 in particular generating the highest revenue. After 1980 revenue didn't vary much on average.
In [89]:
# Overall revenue generated per year
df2.groupby('release_year').revenue_adj.sum().plot(kind= 'bar', figsize=(14,8));
plt.xlabel('Release year', size=20);
plt.ylabel('Overall revenue (adjusted)', size=20);
plt.title('Bar plot of overall revenue (adjusted) per year', size=20);
  • The sum of movie revenues (adjusted) kept increasing year on year.
In [90]:
# How many movies were released each year?
df2.groupby('release_year').release_year.count().plot(kind= 'bar', figsize=(14,8));
plt.xlabel('Release year', size=20);
plt.ylabel('Movie Count', size=20);
plt.title('Bar plot of movie count per year', size=20);
  • More movies are releasesd from year to year.

Which genre generated more revenue for movies?

In [91]:
# We have to separate the genres' rows with multiple genres into separate rows

# Make several copies of the dataframe
df5=df2.copy()
df6=df2.copy()
df7=df2.copy()
df8=df2.copy()
df9=df2.copy()
df10=df2.copy()

# Drop genre rows with null values
df5.dropna(axis=0, inplace=True)

# Convert each cell in the the genres column to a list 
df5['genres']=df5['genres'].apply(lambda x: x.split("|"))
df5['genres'].str.len().max() # Shows that the maximum number of genres in a genres' cell is 5

# df6 is a dataframe containing rows with only 1 genre
df6=df6[df5['genres'].str.len()==1]
df_part1 = df6.copy()

# df7 is a dataframe containing rows with 2 genres
df7=df7[df5['genres'].str.len()==2]
# Split each row in df7 into 2 rows in 2 dataframes, each one containing one of the genres
df_part2 = df7.copy()
df_part2 ['genres']=df_part2['genres'].apply(lambda x: x.split("|")[0])
df_part3 = df7.copy()
df_part3 ['genres']=df_part3['genres'].apply(lambda x: x.split("|")[1])

# df8 is a dataframe containing rows with 3 genres
df8=df8[df5['genres'].str.len()==3]
# Split each row in df8 into 3 rows in 3 dataframes, each one containing one of the genres
df_part4 = df8.copy()
df_part4 ['genres']=df_part4['genres'].apply(lambda x: x.split("|")[0])
df_part5 = df8.copy()
df_part5 ['genres']=df_part5['genres'].apply(lambda x: x.split("|")[1])
df_part6 = df8.copy()
df_part6 ['genres']=df_part6['genres'].apply(lambda x: x.split("|")[2])

# df9 is a dataframe containing rows with 4 genres
df9=df9[df5['genres'].str.len()==4]
# Split each row in df9 into 4 rows in 4 dataframes, each one containing one of the genres
df_part7 = df9.copy()
df_part7 ['genres']=df_part7['genres'].apply(lambda x: x.split("|")[0])
df_part8 = df9.copy()
df_part8 ['genres']=df_part8['genres'].apply(lambda x: x.split("|")[1])
df_part9 = df9.copy()
df_part9 ['genres']=df_part9['genres'].apply(lambda x: x.split("|")[2])
df_part10 = df9.copy()
df_part10 ['genres']=df_part10['genres'].apply(lambda x: x.split("|")[3])

# df10 is a dataframe containing rows with 5 genres
df10=df10[df5['genres'].str.len()==5]
# Split each row in df10 into 5 rows in 5 dataframes, each one containing one of the genres
df_part11 = df10.copy()
df_part11 ['genres']=df_part11['genres'].apply(lambda x: x.split("|")[0])
df_part12 = df10.copy()
df_part12 ['genres']=df_part12['genres'].apply(lambda x: x.split("|")[1])
df_part13 = df10.copy()
df_part13 ['genres']=df_part13['genres'].apply(lambda x: x.split("|")[2])
df_part14 = df10.copy()
df_part14 ['genres']=df_part14['genres'].apply(lambda x: x.split("|")[3])
df_part15 = df10.copy()
df_part15 ['genres']=df_part15['genres'].apply(lambda x: x.split("|")[4])

# Create a new dataframe combining all the previous created dataframes with only 1 genre in the genres column
new_df=df_part1.append(df_part2).append(df_part3).append(df_part4).append(df_part5).append(df_part6).append(df_part7).append(df_part8).append(df_part9).append(df_part10).append(df_part11).append(df_part12).append(df_part13).append(df_part14).append(df_part15)
In [92]:
# Show which genre produce a higher average revenue (adjusted)
new_df.groupby('genres').revenue_adj.mean().plot(kind= 'bar', figsize=(12,8));
plt.xlabel('Genre', size = 20);
plt.ylabel('Mean revenue (adjusted)', size = 20);
plt.title('Bar plot of mean revenue (adjusted) per genre', size=20);
  • Animation and action movies generated more revenue per movie followed by fantasy and family. On the other hand documentary and foreign movies generated the lowest revenue per movie.

Which genre was more popular?

In [93]:
# Which genres were more popular?
new_df.groupby('genres').popularity.mean().plot(kind= 'bar', figsize=(12,8));
plt.xlabel('Genre', size = 20);
plt.ylabel('Popularity', size = 20);
plt.title('Bar plot of popularity per genre', size=20);
  • Adventure and science fiction were the most popular genres followed by fantasy, animation and action. Documentary, foreign and tv movies were the least popular genres.

  • Comparing this plot with the previous one shows that popular genres and the revenue associated with that genre are correlated.

In [94]:
# Which genre had more movies released? 
new_df.groupby('genres').genres.count().plot(kind= 'bar', figsize=(12,8));
plt.xlabel('Genre', size = 20);
plt.ylabel('Movie count', size = 20);
plt.title('Bar plot of movie count per genre', size=20);
  • Drama movies were the genre with the most releases followed by comedy. TV movie and foreign had the fewest releases followed by western and documentary.

Conclusions

In this project we be analyzed a data set containing information about 1000 movies collected from The Movie Database (TMDb). We were interested in finding correlations and trends between movie revenues, popularity and different properties associated with them.

The data was first cleaned by removing duplicate rows and unnecessary columns for the analysis. The date column was also converted to datetime and the month extracted to a separate column.

Results

The analysis showed that there was no correlation between a movie's popularity or budget and the revenue it generated. The analysis also showed that popular movies tend to have a rating higher than 5 but that a high rating did not necessarily mean a movie being popular.

Movies released in June generate the most revenue whilst movies released in September generate the lowest revenue on average. There is not much variation in average movie revenues (adjusted) throughout the years after the year 1960. More movies and overall revenue is generated throughout the years.

Animation and action movies generated more revenue per movie whereas documentary and foreign movies generated the lowest revenue per movie. Adventure and science fiction were the most popular genres whereas documentary, foreign and tv movies were the least popular genres. There is a correlation between popular genres and the revenue they generate.

Limitations

There was no correlation found between popularity and revenue. A deeper understanding of the value 'popularity' and where it came from would be beneficial.

There was a lot of missing budget and revenue values. They are provided as zeros. I had to remove this data as any assumptions about it would only distort results and analysis. Having this data would improve analysis.

Several cells also had missing values including directors, production companies and genres. I removed the missing genres rows as they were only a few. However several director and production company fields are missing and this can affect future analysis on their correlation with movies revenue and popularity.

Further Analysis

Further analysis can be performed to find out whether certain directors or cast members affect a movie's popularity and revenue. Further investegation of production companies can also be performed with regards to popularity and revenues. The popularity and revenues of certain genres throughout the years can also be analysed.

In [95]:
from subprocess import call
call(['python', '-m', 'nbconvert', 'Investigate_a_Dataset.ipynb'])
Out[95]:
0