seaborn.countplot


Bar graphs are useful for displaying relationships between categorical data and at least one numerical variable. seaborn.countplot is a barplot where the dependent variable is the number of instances of each instance of the independent variable.

dataset: IMDB 5000 Movie Dataset

%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
plt.rcParams['figure.figsize'] = (20.0, 10.0)
df = pd.read_csv('../../../datasets/movie_metadata.csv')
df.head()
color director_name num_critic_for_reviews duration director_facebook_likes actor_3_facebook_likes actor_2_name actor_1_facebook_likes gross genres ... num_user_for_reviews language country content_rating budget title_year actor_2_facebook_likes imdb_score aspect_ratio movie_facebook_likes
0 Color James Cameron 723.0 178.0 0.0 855.0 Joel David Moore 1000.0 760505847.0 Action|Adventure|Fantasy|Sci-Fi ... 3054.0 English USA PG-13 237000000.0 2009.0 936.0 7.9 1.78 33000
1 Color Gore Verbinski 302.0 169.0 563.0 1000.0 Orlando Bloom 40000.0 309404152.0 Action|Adventure|Fantasy ... 1238.0 English USA PG-13 300000000.0 2007.0 5000.0 7.1 2.35 0
2 Color Sam Mendes 602.0 148.0 0.0 161.0 Rory Kinnear 11000.0 200074175.0 Action|Adventure|Thriller ... 994.0 English UK PG-13 245000000.0 2015.0 393.0 6.8 2.35 85000
3 Color Christopher Nolan 813.0 164.0 22000.0 23000.0 Christian Bale 27000.0 448130642.0 Action|Thriller ... 2701.0 English USA PG-13 250000000.0 2012.0 23000.0 8.5 2.35 164000
4 NaN Doug Walker NaN NaN 131.0 NaN Rob Walker 131.0 NaN Documentary ... NaN NaN NaN NaN NaN NaN 12.0 7.1 NaN 0

5 rows × 28 columns

For the bar plot, let’s look at the number of movies in each category, allowing each movie to be counted more than once.

# split each movie's genre list, then form a set from the unwrapped list of all genres
categories = set([s for genre_list in df.genres.unique() for s in genre_list.split("|")])
# one-hot encode each movie's classification
for cat in categories:
    df[cat] = df.genres.transform(lambda s: int(cat in s))
# drop other columns
df = df[['director_name','genres','duration'] + list(categories)]
df.head()
director_name genres duration Biography Western Documentary Adventure Drama Musical Reality-TV ... Family Romance Action Thriller History Sport Horror Film-Noir Crime Music
0 James Cameron Action|Adventure|Fantasy|Sci-Fi 178.0 0 0 0 1 0 0 0 ... 0 0 1 0 0 0 0 0 0 0
1 Gore Verbinski Action|Adventure|Fantasy 169.0 0 0 0 1 0 0 0 ... 0 0 1 0 0 0 0 0 0 0
2 Sam Mendes Action|Adventure|Thriller 148.0 0 0 0 1 0 0 0 ... 0 0 1 1 0 0 0 0 0 0
3 Christopher Nolan Action|Thriller 164.0 0 0 0 0 0 0 0 ... 0 0 1 1 0 0 0 0 0 0
4 Doug Walker Documentary NaN 0 0 1 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

5 rows × 29 columns

# convert from wide to long format and remove null classificaitons
df = pd.melt(df,
             id_vars=['duration'],
             value_vars = list(categories),
             var_name = 'Category',
             value_name = 'Count')
df = df.loc[df.Count>0]
# add an indicator whether a movie is short or long, split at 100 minutes runtime
df['islong'] = df.duration.transform(lambda x: int(x > 100))
# sort in descending order
#df = df.loc[df.groupby('Category').transform(sum).sort_values('Count', ascending=False).index]
df.head()
duration Category Count islong
113 206.0 Biography 1 1
257 170.0 Biography 1 1
272 165.0 Biography 1 1
289 140.0 Biography 1 1
290 176.0 Biography 1 1

Basic plot

p = sns.countplot(data=df, x = 'Category')

color by a category

p = sns.countplot(data=df,
                  x = 'Category',
                  hue = 'islong')

make plot horizontal

p = sns.countplot(data=df,
                  y = 'Category',
                  hue = 'islong')

Saturation

p = sns.countplot(data=df,
                  y = 'Category',
                  hue = 'islong',
                  saturation=.5)

Various palettes

p = sns.countplot(data=df,
                  y = 'Category',
                  hue = 'islong',
                  saturation=.9,
                  palette = 'deep')

p = sns.countplot(data=df,
                  y = 'Category',
                  hue = 'islong',
                  saturation=.9,
                  palette = 'muted')

p = sns.countplot(data=df,
                  y = 'Category',
                  hue = 'islong',
                  saturation=.9,
                  palette = 'pastel')

p = sns.countplot(data=df,
                  y = 'Category',
                  hue = 'islong',
                  saturation=.9,
                  palette = 'bright')

p = sns.countplot(data=df,
                  y = 'Category',
                  hue = 'islong',
                  saturation=.9,
                  palette = 'dark')

p = sns.countplot(data=df,
                  y = 'Category',
                  hue = 'islong',
                  saturation=.9,
                  palette = 'colorblind')

p = sns.countplot(data=df,
                  y = 'Category',
                  hue = 'islong',
                  saturation=.9,
                  palette = ((50/255, 132/255.0, 191/255.0), (255/255.0, 232/255.0, 0/255.0)))

p = sns.countplot(data=df,
                  y = 'Category',
                  hue = 'islong',
                  saturation=.9,
                  palette = 'Dark2')

help(sns.color_palette)
Help on function color_palette in module seaborn.palettes:
color_palette(palette=None, n_colors=None, desat=None)
    Return a list of colors defining a color palette.
    Availible seaborn palette names:
        deep, muted, bright, pastel, dark, colorblind
    Other options:
        hls, husl, any named matplotlib palette, list of colors
    Calling this function with ``palette=None`` will return the current
    matplotlib color cycle.
    Matplotlib paletes can be specified as reversed palettes by appending
    "_r" to the name or as dark palettes by appending "_d" to the name.
    (These options are mutually exclusive, but the resulting list of colors
    can also be reversed).
    This function can also be used in a ``with`` statement to temporarily
    set the color cycle for a plot or set of plots.
    Parameters
    ----------
    palette: None, string, or sequence, optional
        Name of palette or None to return current palette. If a sequence, input
        colors are used but possibly cycled and desaturated.
    n_colors : int, optional
        Number of colors in the palette. If ``None``, the default will depend
        on how ``palette`` is specified. Named palettes default to 6 colors,
        but grabbing the current palette or passing in a list of colors will
        not change the number of colors unless this is specified. Asking for
        more colors than exist in the palette will cause it to cycle.
    desat : float, optional
        Proportion to desaturate each color by.
    Returns
    -------
    palette : list of RGB tuples.
        Color palette. Behaves like a list, but can be used as a context
        manager and possesses an ``as_hex`` method to convert to hex color
        codes.
    See Also
    --------
    set_palette : Set the default color cycle for all plots.
    set_color_codes : Reassign color codes like ``"b"``, ``"g"``, etc. to
                      colors from one of the seaborn palettes.
    Examples
    --------
    Show one of the "seaborn palettes", which have the same basic order of hues
    as the default matplotlib color cycle but more attractive colors.
    .. plot::
        :context: close-figs
        >>> import seaborn as sns; sns.set()
        >>> sns.palplot(sns.color_palette("muted"))
    Use discrete values from one of the built-in matplotlib colormaps.
    .. plot::
        :context: close-figs
        >>> sns.palplot(sns.color_palette("RdBu", n_colors=7))
    Make a "dark" matplotlib sequential palette variant. (This can be good
    when coloring multiple lines or points that correspond to an ordered
    variable, where you don't want the lightest lines to be invisible).
    .. plot::
        :context: close-figs
        >>> sns.palplot(sns.color_palette("Blues_d"))
    Use a categorical matplotlib palette, add some desaturation. (This can be
    good when making plots with large patches, which look best with dimmer
    colors).
    .. plot::
        :context: close-figs
        >>> sns.palplot(sns.color_palette("Set1", n_colors=8, desat=.5))
    Use as a context manager:
    .. plot::
        :context: close-figs
        >>> import numpy as np, matplotlib.pyplot as plt
        >>> with sns.color_palette("husl", 8):
        ...    _ = plt.plot(np.c_[np.zeros(8), np.arange(8)].T)


help(sns.countplot)
Help on function countplot in module seaborn.categorical:
countplot(x=None, y=None, hue=None, data=None, order=None, hue_order=None, orient=None, color=None, palette=None, saturation=0.75, ax=None, **kwargs)
    Show the counts of observations in each categorical bin using bars.
    A count plot can be thought of as a histogram across a categorical, instead
    of quantitative, variable. The basic API and options are identical to those
    for :func:`barplot`, so you can compare counts across nested variables.
    Input data can be passed in a variety of formats, including:
    - Vectors of data represented as lists, numpy arrays, or pandas Series
      objects passed directly to the ``x``, ``y``, and/or ``hue`` parameters.
    - A "long-form" DataFrame, in which case the ``x``, ``y``, and ``hue``
      variables will determine how the data are plotted.
    - A "wide-form" DataFrame, such that each numeric column will be plotted.
    - Anything accepted by ``plt.boxplot`` (e.g. a 2d array or list of vectors)
    In most cases, it is possible to use numpy or Python objects, but pandas
    objects are preferable because the associated names will be used to
    annotate the axes. Additionally, you can use Categorical types for the
    grouping variables to control the order of plot elements.    
    Parameters
    ----------
    x, y, hue : names of variables in ``data`` or vector data, optional
        Inputs for plotting long-form data. See examples for interpretation.        
    data : DataFrame, array, or list of arrays, optional
        Dataset for plotting. If ``x`` and ``y`` are absent, this is
        interpreted as wide-form. Otherwise it is expected to be long-form.    
    order, hue_order : lists of strings, optional
        Order to plot the categorical levels in, otherwise the levels are
        inferred from the data objects.        
    orient : "v" | "h", optional
        Orientation of the plot (vertical or horizontal). This is usually
        inferred from the dtype of the input variables, but can be used to
        specify when the "categorical" variable is a numeric or when plotting
        wide-form data.    
    color : matplotlib color, optional
        Color for all of the elements, or seed for :func:`light_palette` when
        using hue nesting.    
    palette : seaborn color palette or dict, optional
        Colors to use for the different levels of the ``hue`` variable. Should
        be something that can be interpreted by :func:`color_palette`, or a
        dictionary mapping hue levels to matplotlib colors.    
    saturation : float, optional
        Proportion of the original saturation to draw colors at. Large patches
        often look better with slightly desaturated colors, but set this to
        ``1`` if you want the plot colors to perfectly match the input color
        spec.    
    ax : matplotlib Axes, optional
        Axes object to draw the plot onto, otherwise uses the current Axes.    
    kwargs : key, value mappings
        Other keyword arguments are passed to ``plt.bar``.
    Returns
    -------
    ax : matplotlib Axes
        Returns the Axes object with the boxplot drawn onto it.    
    See Also
    --------
    barplot : Show point estimates and confidence intervals using bars.    
    factorplot : Combine categorical plots and a class:`FacetGrid`.    
    Examples
    --------
    Show value counts for a single categorical variable:
    .. plot::
        :context: close-figs
        >>> import seaborn as sns
        >>> sns.set(style="darkgrid")
        >>> titanic = sns.load_dataset("titanic")
        >>> ax = sns.countplot(x="class", data=titanic)
    Show value counts for two categorical variables:
    .. plot::
        :context: close-figs
        >>> ax = sns.countplot(x="class", hue="who", data=titanic)
    Plot the bars horizontally:
    .. plot::
        :context: close-figs
        >>> ax = sns.countplot(y="class", hue="who", data=titanic)
    Use a different color palette:
    .. plot::
        :context: close-figs
        >>> ax = sns.countplot(x="who", data=titanic, palette="Set3")
    Use ``plt.bar`` keyword arguments for a different look:
    .. plot::
        :context: close-figs
        >>> ax = sns.countplot(x="who", data=titanic,
        ...                    facecolor=(0, 0, 0, 0),
        ...                    linewidth=5,
        ...                    edgecolor=sns.color_palette("dark", 3))
sns.set(rc={"axes.facecolor":"#ccddff",
            "axes.grid":False,
            'axes.labelsize':30,
            'figure.figsize':(20.0, 10.0),
            'xtick.labelsize':25,
            'ytick.labelsize':20})
p = sns.countplot(data=df, x = 'Category')
plt.text(9,2000, "Color Palettes", fontsize = 95, color='black', fontstyle='italic')
<matplotlib.text.Text at 0x7f4d2d749e80>

p.get_figure().savefig('../../figures/colors.png')