Project: Investigate a Dataset - [TMDB movie data]¶

Table of Contents¶

Introduction
Data Wrangling
Exploratory Data Analysis
Conclusions

¶

Introduction¶

Dataset Description¶

In this project we’ll be analyzing a data set containing information about 1000 movies collected from The Movie Database (TMDb). We’ll be interested In finding correlations and trends between movie revenues, popularity and different properties associated with them.¶

Dataset Columns¶

id: Movie id in dataset
imdb_id: The movie's IMDb id
popularity: The movie's popularity rating
budget: The movie's budget
revenue: The movie's revenue
original_title: Original movie title
cast: The movie's cast
homepage: The movie's homepage
director: The movie's director
tagline: Tagline describing the movie
keywords: Keywords describing the movie
overview: Overview of the movie's story
runtime: Movie's runtime
genres: Movie genre
production_companies: Movie's production companies
release_date: Movie's release date
vote_count: How many people reviewed the movie
vote_average: Movie's average rating
release_year: Movie's release year
budget_adj: Adjusted movie budget based on inflation
revenue_adj: Adjusted movie revenue based on inflation

Question(s) for Analysis¶

Do popular movies generate more revenue?
Do popular movies receive higher ratings?
Do movies with a higher budget generate more revenue?
Which release months generate more revenue for movies?
Which release years generated more revenue for movies?
Which genre generated more revenue for movies?
Which genre was more popular?

 # import statements for all of the packages used

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
% matplotlib inline

# Upgrade pandas to use dataframe.explode() function. 
!pip install --upgrade pandas==0.25.0;

Requirement already up-to-date: pandas==0.25.0 in /opt/conda/lib/python3.6/site-packages (0.25.0)
Requirement already satisfied, skipping upgrade: python-dateutil>=2.6.1 in /opt/conda/lib/python3.6/site-packages (from pandas==0.25.0) (2.6.1)
Requirement already satisfied, skipping upgrade: numpy>=1.13.3 in /opt/conda/lib/python3.6/site-packages (from pandas==0.25.0) (1.19.5)
Requirement already satisfied, skipping upgrade: pytz>=2017.2 in /opt/conda/lib/python3.6/site-packages (from pandas==0.25.0) (2017.3)
Requirement already satisfied, skipping upgrade: six>=1.5 in /opt/conda/lib/python3.6/site-packages (from python-dateutil>=2.6.1->pandas==0.25.0) (1.11.0)

Data Wrangling¶

General Properties¶

# Load your data
df = pd.read_csv('tmdb-movies.csv')

# Show a sample of the data
df.head()

df.shape

(10866, 21)

df.describe()

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10866 entries, 0 to 10865
Data columns (total 21 columns):
id                      10866 non-null int64
imdb_id                 10856 non-null object
popularity              10866 non-null float64
budget                  10866 non-null int64
revenue                 10866 non-null int64
original_title          10866 non-null object
cast                    10790 non-null object
homepage                2936 non-null object
director                10822 non-null object
tagline                 8042 non-null object
keywords                9373 non-null object
overview                10862 non-null object
runtime                 10866 non-null int64
genres                  10843 non-null object
production_companies    9836 non-null object
release_date            10866 non-null object
vote_count              10866 non-null int64
vote_average            10866 non-null float64
release_year            10866 non-null int64
budget_adj              10866 non-null float64
revenue_adj             10866 non-null float64
dtypes: float64(4), int64(6), object(11)
memory usage: 1.7+ MB

Data information¶

We have 10866 rows and 21 columns
The average adjusted budget is 17.55 million
The average adjusted revenue is 51.36 million
The average runtime is 102 minutes
The average rating is 5.97, the minimum is 1.5 and the maximum is 9.2
The dataset contains movies from 1960 to 2015
popularity ranges from 1 to 32.99
Several columns have missing data with homepage having the most missing data

Data Cleaning¶

# Check for duplicate rows
sum(df.duplicated())

1

# Remove duplicate rows
df.drop_duplicates(inplace=True);

# Change release_date to datetime
df["release_date"] = pd.to_datetime(df["release_date"])

# Extract month from release_date 
df['month'] = pd.DatetimeIndex(df['release_date']).month

# Remove columns that are specific to the data and won't be used in analysis
df.drop(['id', 'imdb_id', 'original_title', 'homepage', 'tagline', 'keywords', 'overview', 'runtime', 'production_companies', 'cast', 'director', 'release_date'], axis = 1, inplace = True)

df.head() # check remaining data

df.info() # check remaining data

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10865 entries, 0 to 10865
Data columns (total 10 columns):
popularity      10865 non-null float64
budget          10865 non-null int64
revenue         10865 non-null int64
genres          10842 non-null object
vote_count      10865 non-null int64
vote_average    10865 non-null float64
release_year    10865 non-null int64
budget_adj      10865 non-null float64
revenue_adj     10865 non-null float64
month           10865 non-null int64
dtypes: float64(4), int64(5), object(1)
memory usage: 933.7+ KB

Exploratory Data Analysis¶

Do popular movies generate more revenue?¶

# Show histograms of popularity and revenue
fig, axes =plt.subplots(1, 2)
fig.tight_layout()
df.popularity.hist(label='popularity', ax=axes[0]);
df.revenue_adj.hist(label='revenue', ax=axes[1]);
axes[0].set_title('Popularity histogram');
axes[0].set_xlabel('Popularity value');
axes[1].set_title('Revenue (adjusted) histogram');
axes[1].set_xlabel('Revenue (adjusted)');
fig.text(-0.05, 0.5, 'Number of movies', va='center', rotation='vertical');

Most popularity ratings are around 0 and 2.
There are a lot of cells in the revenue column with a value of 0.

# Remove rows with 0 revenue for analysis
df2 = df[df['revenue_adj']>0]

# Scatter plot of revenue (adjusted) vs popularity
df2.plot(x='revenue_adj', y='popularity', kind='scatter');
plt.xlabel('Revenue (adjusted)');
plt.ylabel('Popularity');
plt.title('Scatter plot of Revenue (adjusted) vs Popularity');

There is no correlation between a movie's popularity and revenue.

Do Popular movies receive higher ratings?¶

# Show histograms of vote_average and popularity
fig, axes =plt.subplots(1, 2)
fig.tight_layout()
df.popularity.hist(label='popularity', ax=axes[0]);
df.vote_average.hist(label='vote_average', ax=axes[1]);
axes[0].set_title('Popularity histogram');
axes[0].set_xlabel('Popularity value');
axes[1].set_title('Vote average histogram');
axes[1].set_xlabel('Vote average');
fig.text(-0.05, 0.5, 'Number of movies', va='center', rotation='vertical');

Most popularity ratings are around 0 and 2.
Most vote ratings are at the center between 4 and 7.5

# Scatter plot of popularity vs vote average 
df.plot(x='vote_average', y='popularity', kind='scatter');
plt.xlabel('Vote average');
plt.ylabel('Popularity');
plt.title('Scatter plot of Vote average vs Popularity');

There are 3 outliers with very high popularity. Those can be romoved for a better reading.

# Remove the 3 outliers with very high popularity
df4 = df[df['popularity']<15]
df4.plot(x='vote_average', y='popularity', kind='scatter');
plt.xlabel('Vote average');
plt.ylabel('Popularity');
plt.title('Altered scatter plot of Vote average vs Popularity');

Highly popular movies receive higher ratings of above 5 whilst high vote average does not necessarily mean high popularity.

Do movies with a higher budget generate more revenue?¶

# Remove rows with 0 revenue and 0 budget for analysis
df3 = df2[df2['budget_adj']>0]

# Scatter plot of revenue (adjusted) vs budget (adjusted)
df3.plot(x='budget_adj', y='revenue_adj', kind='scatter');
plt.xlabel('Budget (adjusted)');
plt.ylabel('Revenue (adjusted)');
plt.suptitle('Scatter plot of Revenue (adjusted) vs Budget (adjusted)', y=1);

There is no correlation between a movie's budget and the revenue it generates.

Which release months generate more revenue for movies?¶

# Group by month and get average revenue
df2.groupby('month').revenue_adj.mean().plot(kind= 'bar');
plt.xlabel('Month');
plt.ylabel('Mean revenue (adjusted)');
plt.title('Bar plot of mean revenue (adjusted) per month');

Movies released in June generate the most revenue whilst movies released in September generate the lowest revenue on average. The summer months (May, June, July) and the winter months (November and December) generate more revenue than the remaining months.

# How many movies are released each month
df2.groupby('month').month.count().plot(kind= 'bar');
plt.xlabel('Month');
plt.ylabel('Movie count');
plt.title('Bar plot of movie count per month');

More movies are released in September than other months, followed by December.

Which release years generated more revenue for movies?¶

# Average revenue (adjuseted) generated per year
df2.groupby('release_year').revenue_adj.mean().plot(kind= 'bar', figsize=(14,8));
plt.xlabel('Release year', size=20);
plt.ylabel('Mean revenue (adjusted)', size=20);
plt.title('Bar plot of mean revenue (adjusted) per year', size=20);

The early years, before 1980, movies generated more revenue (adjusted) than the latter years, with 1965 in particular generating the highest revenue. After 1980 revenue didn't vary much on average.

# Overall revenue generated per year
df2.groupby('release_year').revenue_adj.sum().plot(kind= 'bar', figsize=(14,8));
plt.xlabel('Release year', size=20);
plt.ylabel('Overall revenue (adjusted)', size=20);
plt.title('Bar plot of overall revenue (adjusted) per year', size=20);

The sum of movie revenues (adjusted) kept increasing year on year.

# How many movies were released each year?
df2.groupby('release_year').release_year.count().plot(kind= 'bar', figsize=(14,8));
plt.xlabel('Release year', size=20);
plt.ylabel('Movie Count', size=20);
plt.title('Bar plot of movie count per year', size=20);

More movies are releasesd from year to year.

Which genre generated more revenue for movies?¶

# We have to separate the genres' rows with multiple genres into separate rows

# Make several copies of the dataframe
df5=df2.copy()
df6=df2.copy()
df7=df2.copy()
df8=df2.copy()
df9=df2.copy()
df10=df2.copy()

# Drop genre rows with null values
df5.dropna(axis=0, inplace=True)

# Convert each cell in the the genres column to a list 
df5['genres']=df5['genres'].apply(lambda x: x.split("|"))
df5['genres'].str.len().max() # Shows that the maximum number of genres in a genres' cell is 5

# df6 is a dataframe containing rows with only 1 genre
df6=df6[df5['genres'].str.len()==1]
df_part1 = df6.copy()

# df7 is a dataframe containing rows with 2 genres
df7=df7[df5['genres'].str.len()==2]
# Split each row in df7 into 2 rows in 2 dataframes, each one containing one of the genres
df_part2 = df7.copy()
df_part2 ['genres']=df_part2['genres'].apply(lambda x: x.split("|")[0])
df_part3 = df7.copy()
df_part3 ['genres']=df_part3['genres'].apply(lambda x: x.split("|")[1])

# df8 is a dataframe containing rows with 3 genres
df8=df8[df5['genres'].str.len()==3]
# Split each row in df8 into 3 rows in 3 dataframes, each one containing one of the genres
df_part4 = df8.copy()
df_part4 ['genres']=df_part4['genres'].apply(lambda x: x.split("|")[0])
df_part5 = df8.copy()
df_part5 ['genres']=df_part5['genres'].apply(lambda x: x.split("|")[1])
df_part6 = df8.copy()
df_part6 ['genres']=df_part6['genres'].apply(lambda x: x.split("|")[2])

# df9 is a dataframe containing rows with 4 genres
df9=df9[df5['genres'].str.len()==4]
# Split each row in df9 into 4 rows in 4 dataframes, each one containing one of the genres
df_part7 = df9.copy()
df_part7 ['genres']=df_part7['genres'].apply(lambda x: x.split("|")[0])
df_part8 = df9.copy()
df_part8 ['genres']=df_part8['genres'].apply(lambda x: x.split("|")[1])
df_part9 = df9.copy()
df_part9 ['genres']=df_part9['genres'].apply(lambda x: x.split("|")[2])
df_part10 = df9.copy()
df_part10 ['genres']=df_part10['genres'].apply(lambda x: x.split("|")[3])

# df10 is a dataframe containing rows with 5 genres
df10=df10[df5['genres'].str.len()==5]
# Split each row in df10 into 5 rows in 5 dataframes, each one containing one of the genres
df_part11 = df10.copy()
df_part11 ['genres']=df_part11['genres'].apply(lambda x: x.split("|")[0])
df_part12 = df10.copy()
df_part12 ['genres']=df_part12['genres'].apply(lambda x: x.split("|")[1])
df_part13 = df10.copy()
df_part13 ['genres']=df_part13['genres'].apply(lambda x: x.split("|")[2])
df_part14 = df10.copy()
df_part14 ['genres']=df_part14['genres'].apply(lambda x: x.split("|")[3])
df_part15 = df10.copy()
df_part15 ['genres']=df_part15['genres'].apply(lambda x: x.split("|")[4])

# Create a new dataframe combining all the previous created dataframes with only 1 genre in the genres column
new_df=df_part1.append(df_part2).append(df_part3).append(df_part4).append(df_part5).append(df_part6).append(df_part7).append(df_part8).append(df_part9).append(df_part10).append(df_part11).append(df_part12).append(df_part13).append(df_part14).append(df_part15)

# Show which genre produce a higher average revenue (adjusted)
new_df.groupby('genres').revenue_adj.mean().plot(kind= 'bar', figsize=(12,8));
plt.xlabel('Genre', size = 20);
plt.ylabel('Mean revenue (adjusted)', size = 20);
plt.title('Bar plot of mean revenue (adjusted) per genre', size=20);

Animation and action movies generated more revenue per movie followed by fantasy and family. On the other hand documentary and foreign movies generated the lowest revenue per movie.

Which genre was more popular?¶

# Which genres were more popular?
new_df.groupby('genres').popularity.mean().plot(kind= 'bar', figsize=(12,8));
plt.xlabel('Genre', size = 20);
plt.ylabel('Popularity', size = 20);
plt.title('Bar plot of popularity per genre', size=20);

Adventure and science fiction were the most popular genres followed by fantasy, animation and action. Documentary, foreign and tv movies were the least popular genres.
Comparing this plot with the previous one shows that popular genres and the revenue associated with that genre are correlated.

# Which genre had more movies released? 
new_df.groupby('genres').genres.count().plot(kind= 'bar', figsize=(12,8));
plt.xlabel('Genre', size = 20);
plt.ylabel('Movie count', size = 20);
plt.title('Bar plot of movie count per genre', size=20);

Drama movies were the genre with the most releases followed by comedy. TV movie and foreign had the fewest releases followed by western and documentary.

Conclusions¶

In this project we be analyzed a data set containing information about 1000 movies collected from The Movie Database (TMDb). We were interested in finding correlations and trends between movie revenues, popularity and different properties associated with them.

The data was first cleaned by removing duplicate rows and unnecessary columns for the analysis. The date column was also converted to datetime and the month extracted to a separate column.

Results¶

The analysis showed that there was no correlation between a movie's popularity or budget and the revenue it generated. The analysis also showed that popular movies tend to have a rating higher than 5 but that a high rating did not necessarily mean a movie being popular.

Movies released in June generate the most revenue whilst movies released in September generate the lowest revenue on average. There is not much variation in average movie revenues (adjusted) throughout the years after the year 1960. More movies and overall revenue is generated throughout the years.

Animation and action movies generated more revenue per movie whereas documentary and foreign movies generated the lowest revenue per movie. Adventure and science fiction were the most popular genres whereas documentary, foreign and tv movies were the least popular genres. There is a correlation between popular genres and the revenue they generate.

Limitations¶

There was no correlation found between popularity and revenue. A deeper understanding of the value 'popularity' and where it came from would be beneficial.

There was a lot of missing budget and revenue values. They are provided as zeros. I had to remove this data as any assumptions about it would only distort results and analysis. Having this data would improve analysis.

Several cells also had missing values including directors, production companies and genres. I removed the missing genres rows as they were only a few. However several director and production company fields are missing and this can affect future analysis on their correlation with movies revenue and popularity.

Further Analysis¶

Further analysis can be performed to find out whether certain directors or cast members affect a movie's popularity and revenue. Further investegation of production companies can also be performed with regards to popularity and revenues. The popularity and revenues of certain genres throughout the years can also be analysed.

from subprocess import call
call(['python', '-m', 'nbconvert', 'Investigate_a_Dataset.ipynb'])

0

	id	imdb_id	popularity	budget	revenue	original_title	cast	homepage	director	tagline	...	overview	runtime	genres	production_companies	release_date	vote_count	vote_average	release_year	budget_adj	revenue_adj
0	135397	tt0369610	32.985763	150000000	1513528810	Jurassic World	Chris Pratt\|Bryce Dallas Howard\|Irrfan Khan\|Vi...	http://www.jurassicworld.com/	Colin Trevorrow	The park is open.	...	Twenty-two years after the events of Jurassic ...	124	Action\|Adventure\|Science Fiction\|Thriller	Universal Studios\|Amblin Entertainment\|Legenda...	6/9/15	5562	6.5	2015	1.379999e+08	1.392446e+09
1	76341	tt1392190	28.419936	150000000	378436354	Mad Max: Fury Road	Tom Hardy\|Charlize Theron\|Hugh Keays-Byrne\|Nic...	http://www.madmaxmovie.com/	George Miller	What a Lovely Day.	...	An apocalyptic story set in the furthest reach...	120	Action\|Adventure\|Science Fiction\|Thriller	Village Roadshow Pictures\|Kennedy Miller Produ...	5/13/15	6185	7.1	2015	1.379999e+08	3.481613e+08
2	262500	tt2908446	13.112507	110000000	295238201	Insurgent	Shailene Woodley\|Theo James\|Kate Winslet\|Ansel...	http://www.thedivergentseries.movie/#insurgent	Robert Schwentke	One Choice Can Destroy You	...	Beatrice Prior must confront her inner demons ...	119	Adventure\|Science Fiction\|Thriller	Summit Entertainment\|Mandeville Films\|Red Wago...	3/18/15	2480	6.3	2015	1.012000e+08	2.716190e+08
3	140607	tt2488496	11.173104	200000000	2068178225	Star Wars: The Force Awakens	Harrison Ford\|Mark Hamill\|Carrie Fisher\|Adam D...	http://www.starwars.com/films/star-wars-episod...	J.J. Abrams	Every generation has a story.	...	Thirty years after defeating the Galactic Empi...	136	Action\|Adventure\|Science Fiction\|Fantasy	Lucasfilm\|Truenorth Productions\|Bad Robot	12/15/15	5292	7.5	2015	1.839999e+08	1.902723e+09
4	168259	tt2820852	9.335014	190000000	1506249360	Furious 7	Vin Diesel\|Paul Walker\|Jason Statham\|Michelle ...	http://www.furious7.com/	James Wan	Vengeance Hits Home	...	Deckard Shaw seeks revenge against Dominic Tor...	137	Action\|Crime\|Thriller	Universal Pictures\|Original Film\|Media Rights ...	4/1/15	2947	7.3	2015	1.747999e+08	1.385749e+09

	id	popularity	budget	revenue	runtime	vote_count	vote_average	release_year	budget_adj	revenue_adj
count	10866.000000	10866.000000	1.086600e+04	1.086600e+04	10866.000000	10866.000000	10866.000000	10866.000000	1.086600e+04	1.086600e+04
mean	66064.177434	0.646441	1.462570e+07	3.982332e+07	102.070863	217.389748	5.974922	2001.322658	1.755104e+07	5.136436e+07
std	92130.136561	1.000185	3.091321e+07	1.170035e+08	31.381405	575.619058	0.935142	12.812941	3.430616e+07	1.446325e+08
min	5.000000	0.000065	0.000000e+00	0.000000e+00	0.000000	10.000000	1.500000	1960.000000	0.000000e+00	0.000000e+00
25%	10596.250000	0.207583	0.000000e+00	0.000000e+00	90.000000	17.000000	5.400000	1995.000000	0.000000e+00	0.000000e+00
50%	20669.000000	0.383856	0.000000e+00	0.000000e+00	99.000000	38.000000	6.000000	2006.000000	0.000000e+00	0.000000e+00
75%	75610.000000	0.713817	1.500000e+07	2.400000e+07	111.000000	145.750000	6.600000	2011.000000	2.085325e+07	3.369710e+07
max	417859.000000	32.985763	4.250000e+08	2.781506e+09	900.000000	9767.000000	9.200000	2015.000000	4.250000e+08	2.827124e+09

	popularity	budget	revenue	genres	vote_count	vote_average	release_year	budget_adj	revenue_adj	month
0	32.985763	150000000	1513528810	Action\|Adventure\|Science Fiction\|Thriller	5562	6.5	2015	1.379999e+08	1.392446e+09	6
1	28.419936	150000000	378436354	Action\|Adventure\|Science Fiction\|Thriller	6185	7.1	2015	1.379999e+08	3.481613e+08	5
2	13.112507	110000000	295238201	Adventure\|Science Fiction\|Thriller	2480	6.3	2015	1.012000e+08	2.716190e+08	3
3	11.173104	200000000	2068178225	Action\|Adventure\|Science Fiction\|Fantasy	5292	7.5	2015	1.839999e+08	1.902723e+09	12
4	9.335014	190000000	1506249360	Action\|Crime\|Thriller	2947	7.3	2015	1.747999e+08	1.385749e+09	4