seaborn.boxplot
Boxplots summarize numeric data over a set of categories. The data is divided into four groups called quartiles. A box is drawn connecting the innermost two quartiles, and a horizontal line is drawn at the position of the median (which always falls within the box). Usually, a second set of lines will be drawn some distance from the inner box denoting a “maximum” and “minimum” value for the data, and then values existing outside of these extrema are considered outliers and plotted as individual points. The location of these “whisker” lines is variable and generally some multiple of the innerquartile range (IQR), which is range of values covered by the inner box.
dataset: IMDB 5000 Movie Dataset
%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
plt.rcParams['figure.figsize'] = (20.0, 10.0)
plt.rcParams['font.family'] = "serif"
df = pd.read_csv('../../../datasets/movie_metadata.csv')
df.head()
color | director_name | num_critic_for_reviews | duration | director_facebook_likes | actor_3_facebook_likes | actor_2_name | actor_1_facebook_likes | gross | genres | ... | num_user_for_reviews | language | country | content_rating | budget | title_year | actor_2_facebook_likes | imdb_score | aspect_ratio | movie_facebook_likes | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Color | James Cameron | 723.0 | 178.0 | 0.0 | 855.0 | Joel David Moore | 1000.0 | 760505847.0 | Action|Adventure|Fantasy|Sci-Fi | ... | 3054.0 | English | USA | PG-13 | 237000000.0 | 2009.0 | 936.0 | 7.9 | 1.78 | 33000 |
1 | Color | Gore Verbinski | 302.0 | 169.0 | 563.0 | 1000.0 | Orlando Bloom | 40000.0 | 309404152.0 | Action|Adventure|Fantasy | ... | 1238.0 | English | USA | PG-13 | 300000000.0 | 2007.0 | 5000.0 | 7.1 | 2.35 | 0 |
2 | Color | Sam Mendes | 602.0 | 148.0 | 0.0 | 161.0 | Rory Kinnear | 11000.0 | 200074175.0 | Action|Adventure|Thriller | ... | 994.0 | English | UK | PG-13 | 245000000.0 | 2015.0 | 393.0 | 6.8 | 2.35 | 85000 |
3 | Color | Christopher Nolan | 813.0 | 164.0 | 22000.0 | 23000.0 | Christian Bale | 27000.0 | 448130642.0 | Action|Thriller | ... | 2701.0 | English | USA | PG-13 | 250000000.0 | 2012.0 | 23000.0 | 8.5 | 2.35 | 164000 |
4 | NaN | Doug Walker | NaN | NaN | 131.0 | NaN | Rob Walker | 131.0 | NaN | Documentary | ... | NaN | NaN | NaN | NaN | NaN | NaN | 12.0 | 7.1 | NaN | 0 |
5 rows × 28 columns
For the bar plot, let’s look at the number of movies in each category, allowing each movie to be counted more than once.
# split each movie's genre list, then form a set from the unwrapped list of all genres
categories = set([s for genre_list in df.genres.unique() for s in genre_list.split("|")])
# one-hot encode each movie's classification
for cat in categories:
df[cat] = df.genres.transform(lambda s: int(cat in s))
# drop other columns
df = df[['director_name','genres','duration'] + list(categories)]
df.head()
director_name | genres | duration | Comedy | Sport | Thriller | Music | Adventure | History | Biography | ... | Documentary | Horror | Fantasy | War | Action | Romance | Reality-TV | Drama | Animation | News | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | James Cameron | Action|Adventure|Fantasy|Sci-Fi | 178.0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | ... | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
1 | Gore Verbinski | Action|Adventure|Fantasy | 169.0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | ... | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
2 | Sam Mendes | Action|Adventure|Thriller | 148.0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
3 | Christopher Nolan | Action|Thriller | 164.0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
4 | Doug Walker | Documentary | NaN | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
5 rows × 29 columns
# convert from wide to long format and remove null classificaitons
df = pd.melt(df,
id_vars=['duration'],
value_vars = list(categories),
var_name = 'Category',
value_name = 'Count')
df = df.loc[df.Count>0]
top_categories = df.groupby('Category').aggregate(sum).sort_values('Count', ascending=False).index
howmany=10
df = df.loc[df.Category.isin(top_categories[:howmany])]
df.rename(columns={"duration":"Duration"},inplace=True)
df.head()
Duration | Category | Count | |
---|---|---|---|
7 | 100.0 | Comedy | 1 |
19 | 106.0 | Comedy | 1 |
35 | 104.0 | Comedy | 1 |
41 | 106.0 | Comedy | 1 |
43 | 103.0 | Comedy | 1 |
Basic plot
p = sns.boxplot(data=df,
x = 'Category',
y = 'Duration')
The outliers here are making things a bit squished, so I’ll remove them since I am just interested in demonstrating the visualization tool.
df = df.loc[df.Duration < 250]
p = sns.boxplot(data=df,
x = 'Category',
y = 'Duration')
Change the order of categories
p = sns.boxplot(data=df,
x = 'Category',
y = 'Duration',
order = sorted(df.Category.unique()))
Change the order that the colors are chosen
Change orientation to horizontal
p = sns.boxplot(data=df,
y = 'Category',
x = 'Duration',
order = sorted(df.Category.unique()),
orient="h")
Desaturate
p = sns.boxplot(data=df,
x = 'Category',
y = 'Duration',
order = sorted(df.Category.unique()),
saturation=.25)
Adjust width of boxes
p = sns.boxplot(data=df,
x = 'Category',
y = 'Duration',
order = sorted(df.Category.unique()),
width=.25)
Change the size of outlier markers
p = sns.boxplot(data=df,
x = 'Category',
y = 'Duration',
order = sorted(df.Category.unique()),
fliersize=20)
Adjust the position of the whiskers as a fraction of IQR
p = sns.boxplot(data=df,
x = 'Category',
y = 'Duration',
order = sorted(df.Category.unique()),
whis=.2)
Add a notch to the box indicating a confidence interval for the median
p = sns.boxplot(data=df,
x = 'Category',
y = 'Duration',
order = sorted(df.Category.unique()),
notch=True)
p = sns.boxplot(data=df,
x = 'Category',
y = 'Duration',
order = sorted(df.Category.unique()),
notch=False,
linewidth=2.5)
Finalize
sns.set(rc={"axes.facecolor":"#ccddff",
"axes.grid":False,
'axes.labelsize':30,
'figure.figsize':(20.0, 10.0),
'xtick.labelsize':25,
'ytick.labelsize':20})
p = sns.boxplot(data=df,
x = 'Category',
y = 'Duration',
palette = 'Paired',
order = sorted(df.Category.unique()),
notch=True)
plt.xticks(rotation=45)
l = plt.xlabel('')
plt.ylabel('Duration (min)')
plt.text(5.4,200, "Box Plot", fontsize = 95, color="black", fontstyle='italic')
<matplotlib.text.Text at 0x7f23580f4a58>
p.get_figure().savefig('../../figures/boxplot.png')