seaborn.boxplot


Boxplots summarize numeric data over a set of categories. The data is divided into four groups called quartiles. A box is drawn connecting the innermost two quartiles, and a horizontal line is drawn at the position of the median (which always falls within the box). Usually, a second set of lines will be drawn some distance from the inner box denoting a “maximum” and “minimum” value for the data, and then values existing outside of these extrema are considered outliers and plotted as individual points. The location of these “whisker” lines is variable and generally some multiple of the innerquartile range (IQR), which is range of values covered by the inner box.

dataset: IMDB 5000 Movie Dataset

%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
plt.rcParams['figure.figsize'] = (20.0, 10.0)
plt.rcParams['font.family'] = "serif"
df = pd.read_csv('../../../datasets/movie_metadata.csv')
df.head()
color director_name num_critic_for_reviews duration director_facebook_likes actor_3_facebook_likes actor_2_name actor_1_facebook_likes gross genres ... num_user_for_reviews language country content_rating budget title_year actor_2_facebook_likes imdb_score aspect_ratio movie_facebook_likes
0 Color James Cameron 723.0 178.0 0.0 855.0 Joel David Moore 1000.0 760505847.0 Action|Adventure|Fantasy|Sci-Fi ... 3054.0 English USA PG-13 237000000.0 2009.0 936.0 7.9 1.78 33000
1 Color Gore Verbinski 302.0 169.0 563.0 1000.0 Orlando Bloom 40000.0 309404152.0 Action|Adventure|Fantasy ... 1238.0 English USA PG-13 300000000.0 2007.0 5000.0 7.1 2.35 0
2 Color Sam Mendes 602.0 148.0 0.0 161.0 Rory Kinnear 11000.0 200074175.0 Action|Adventure|Thriller ... 994.0 English UK PG-13 245000000.0 2015.0 393.0 6.8 2.35 85000
3 Color Christopher Nolan 813.0 164.0 22000.0 23000.0 Christian Bale 27000.0 448130642.0 Action|Thriller ... 2701.0 English USA PG-13 250000000.0 2012.0 23000.0 8.5 2.35 164000
4 NaN Doug Walker NaN NaN 131.0 NaN Rob Walker 131.0 NaN Documentary ... NaN NaN NaN NaN NaN NaN 12.0 7.1 NaN 0

5 rows × 28 columns

For the bar plot, let’s look at the number of movies in each category, allowing each movie to be counted more than once.

# split each movie's genre list, then form a set from the unwrapped list of all genres
categories = set([s for genre_list in df.genres.unique() for s in genre_list.split("|")])
# one-hot encode each movie's classification
for cat in categories:
    df[cat] = df.genres.transform(lambda s: int(cat in s))
# drop other columns
df = df[['director_name','genres','duration'] + list(categories)]
df.head()
director_name genres duration Comedy Sport Thriller Music Adventure History Biography ... Documentary Horror Fantasy War Action Romance Reality-TV Drama Animation News
0 James Cameron Action|Adventure|Fantasy|Sci-Fi 178.0 0 0 0 0 1 0 0 ... 0 0 1 0 1 0 0 0 0 0
1 Gore Verbinski Action|Adventure|Fantasy 169.0 0 0 0 0 1 0 0 ... 0 0 1 0 1 0 0 0 0 0
2 Sam Mendes Action|Adventure|Thriller 148.0 0 0 1 0 1 0 0 ... 0 0 0 0 1 0 0 0 0 0
3 Christopher Nolan Action|Thriller 164.0 0 0 1 0 0 0 0 ... 0 0 0 0 1 0 0 0 0 0
4 Doug Walker Documentary NaN 0 0 0 0 0 0 0 ... 1 0 0 0 0 0 0 0 0 0

5 rows × 29 columns

# convert from wide to long format and remove null classificaitons
df = pd.melt(df,
             id_vars=['duration'],
             value_vars = list(categories),
             var_name = 'Category',
             value_name = 'Count')
df = df.loc[df.Count>0]
top_categories = df.groupby('Category').aggregate(sum).sort_values('Count', ascending=False).index
howmany=10
df = df.loc[df.Category.isin(top_categories[:howmany])]
df.rename(columns={"duration":"Duration"},inplace=True)
df.head()
Duration Category Count
7 100.0 Comedy 1
19 106.0 Comedy 1
35 104.0 Comedy 1
41 106.0 Comedy 1
43 103.0 Comedy 1

Basic plot

p = sns.boxplot(data=df,
                x = 'Category',
                y = 'Duration')

The outliers here are making things a bit squished, so I’ll remove them since I am just interested in demonstrating the visualization tool.

df = df.loc[df.Duration < 250]
p = sns.boxplot(data=df,
                x = 'Category',
                y = 'Duration')

Change the order of categories

p = sns.boxplot(data=df,
                x = 'Category',
                y = 'Duration',
                order = sorted(df.Category.unique()))

Change the order that the colors are chosen


Change orientation to horizontal

p = sns.boxplot(data=df,
                y = 'Category',
                x = 'Duration',
                order = sorted(df.Category.unique()),
                orient="h")

Desaturate

p = sns.boxplot(data=df,
                x = 'Category',
                y = 'Duration',
                order = sorted(df.Category.unique()),
                saturation=.25)

Adjust width of boxes

p = sns.boxplot(data=df,
                x = 'Category',
                y = 'Duration',
                order = sorted(df.Category.unique()),
                width=.25)

Change the size of outlier markers

p = sns.boxplot(data=df,
                x = 'Category',
                y = 'Duration',
                order = sorted(df.Category.unique()),
                fliersize=20)

Adjust the position of the whiskers as a fraction of IQR

p = sns.boxplot(data=df,
                x = 'Category',
                y = 'Duration',
                order = sorted(df.Category.unique()),
                whis=.2)

Add a notch to the box indicating a confidence interval for the median

p = sns.boxplot(data=df,
                x = 'Category',
                y = 'Duration',
                order = sorted(df.Category.unique()),
                notch=True)

p = sns.boxplot(data=df,
                x = 'Category',
                y = 'Duration',
                order = sorted(df.Category.unique()),
                notch=False,
                linewidth=2.5)

Finalize

sns.set(rc={"axes.facecolor":"#ccddff",
            "axes.grid":False,
            'axes.labelsize':30,
            'figure.figsize':(20.0, 10.0),
            'xtick.labelsize':25,
            'ytick.labelsize':20})
p = sns.boxplot(data=df,
                x = 'Category',
                y = 'Duration',
                palette = 'Paired',
                order = sorted(df.Category.unique()),
                notch=True)
plt.xticks(rotation=45)
l = plt.xlabel('')
plt.ylabel('Duration (min)')
plt.text(5.4,200, "Box Plot", fontsize = 95, color="black", fontstyle='italic')
<matplotlib.text.Text at 0x7f23580f4a58>

p.get_figure().savefig('../../figures/boxplot.png')