6 min read

Introduction to Plotnine (ggplot port in Python)

Objective

  • The idea of this post is to go over the main aspects of the Plotnine library in Python.
  • Plotnine is a port of ggplot2 (a popular visualization library in R) that has been available in Python for quite time, it’s based on matplotlib.
  • There are multiple ports of ggplot2 but in my opinion Plotnine is the most solid choice. It’s a great tool for exploratory analysis as it allows to generate complex graphs with minimal code.
  • I’ll be focusing mostly in showing how to use the library rather in the actual data analysis.
import pandas as pd
import numpy as np
import warnings

from plotnine import *
from plotnine.data import economics, mtcars, mpg
warnings.filterwarnings("ignore")

theme_set(theme_gray()) # default theme
%matplotlib inline
  • mpg is a popular dataset used in ggplot2’s R docs, therefore it’s interesting to see how plotnine works on the same data. The original ggplot2 docs can be found here.
mpg.head()
manufacturer model displ year cyl trans drv cty hwy fl class
0 audi a4 1.8 1999 4 auto(l5) f 18 29 p compact
1 audi a4 1.8 1999 4 manual(m5) f 21 29 p compact
2 audi a4 2.0 2008 4 manual(m6) f 20 31 p compact
3 audi a4 2.0 2008 4 auto(av) f 21 30 p compact
4 audi a4 2.8 1999 6 auto(l5) f 16 26 p compact
mpg.dtypes

manufacturer category model category displ float64 year int64 cyl int64 trans category drv category cty int64 hwy int64 fl category class category dtype: object

1. Scatter Plots - geom_point()

Docs: https://plotnine.readthedocs.io/en/stable/generated/plotnine.geoms.geom_point.html#plotnine.geoms.geom_point

In this first example we plot a scatter plot between two numeric variables (displ and hwy).

  • I pass the mpg DataFrame in the ggplot call and then define the x, y variables inside aes().
  • Once the data is mapped, I define a geom to visualize it. In this case I use geom_point
  • If multiple lines are added I need to wrap the code with ()
# aes in ggplot() call works
(ggplot(mpg, aes(x='displ', y='hwy')) +
 geom_point()
)
png

png

  • With a one liner we don’t need the extra ()
ggplot(mpg, aes(x='displ', y='hwy')) + geom_point()
png

png

  • It’s very straight forward to map a variable to color, in this case cyl.
  • cyl is defined as integer. If I wanted to have a different color for each value of cyl I would have to convert it to dtype category.
mpg['cyl'] = mpg['cyl'].astype("category")
ggplot(mpg) + geom_point(aes(x='displ', y='hwy', color='cyl'))
png

png

  • I can assign a plot to p and keep adding layers to it also. I find this approach more verbose and prefer using + but it’s also a valid way of using the library.
  • In this plot I add a diferent color gradient to display cty, as it’s a numeric variable.
p = ggplot(mpg)
p = p + geom_point(aes(x='displ', y='hwy', color='cty'))
p = p + scale_color_gradient(low='blue', high='red')
p
png

png

2. Scatter Plots - geom_boxplot()

Docs: https://plotnine.readthedocs.io/en/stable/generated/plotnine.geoms.geom_boxplot.html#plotnine.geoms.geom_boxplot

mpg["year"] = mpg["year"].astype("category")
mpg.head()
manufacturer model displ year cyl trans drv cty hwy fl class
0 audi a4 1.8 1999 4 auto(l5) f 18 29 p compact
1 audi a4 1.8 1999 4 manual(m5) f 21 29 p compact
2 audi a4 2.0 2008 4 manual(m6) f 20 31 p compact
3 audi a4 2.0 2008 4 auto(av) f 21 30 p compact
4 audi a4 2.8 1999 6 auto(l5) f 16 26 p compact
  • Now cyl and year are converted to category dtypes.
mpg.dtypes

manufacturer category model category displ float64 year category cyl category trans category drv category cty int64 hwy int64 fl category class category dtype: object

  • Below you can see a boxplot of hwy for each cyl value.
  • As expected we find that the as the amount of cylinders increses, the amount of highway miles per galon decreases.
ggplot(mpg) + geom_boxplot(aes(x='cyl', y='hwy'))
png

png

  • It’s also possible to compare boxplots by another categorical variable.
  • In this case we include the manufacturing year of the cars.
  • The y-axis variable is disp (engine displacement, in litres).
# Use two categorical variables by one numeric variable
ggplot(mpg) + geom_boxplot(aes(x='cyl', y='displ', fill='year'))
png

png

  • In the next example I use a new dataset economics that has US employement data.
  • As we have monthly data, it’s interesting to do a boxplot using year in the x-axis. This is a quick way to have an idea of the long term trend in the time serie and have a summary of the variability during each year.
economics.head()
date pce pop psavert uempmed unemploy
0 1967-07-01 507.4 198712 12.5 4.5 2944
1 1967-08-01 510.5 198911 12.5 4.7 2945
2 1967-09-01 516.3 199113 11.7 4.6 2958
3 1967-10-01 512.9 199311 12.5 4.9 3143
4 1967-11-01 518.1 199498 12.5 4.7 3066
economics.dtypes

date datetime64[ns] pce float64 pop int64 psavert float64 uempmed float64 unemploy int64 dtype: object

  • I convert the date to category dtype.
  • It’s also more relevant to plot the unemployement rate
  • In order to plot the x-axis vertically I need to add a call to theme
economics['year'] = pd.DatetimeIndex(economics.date).year.astype('category')
economics['unemployment_rate'] = (economics['unemploy'] / economics['pop']) * 100
(ggplot(economics, aes(x='year', y='unemployment_rate')) + 
 geom_boxplot() + 
 theme(axis_text_x = element_text(angle=90, hjust=1))
)
png

png

3. Facets

(ggplot(mpg) + 
 geom_histogram(aes(x='hwy'), bins=15) + 
 facet_wrap("~ class"))
png

png

  • It’s sometimes useful to use a free scale on the y-axis. For this I use the scales=‘free_y’ argument in the face_wrap call.
# Making scales of each plot independent
(ggplot(mpg) + 
 geom_histogram(aes(x='hwy'), bins=15) + 
 facet_wrap("~ class", scales='free_y'))
png

png

  • In this example I plot small multiples of scatter plots. The idea is to understand how the class variable affects the relation between displ and hwy.
(ggplot(mpg, aes(x='displ', y='hwy')) + 
 geom_point() + 
 facet_wrap("~ class"))
png

png

  • facet_grid is used for plotting 2 categorical variables whereas facet_wrap is mostly used for plotting 1 categorical variable, even though it’s possible to plot 2. See the SO question linked above if this sounds confusing.
  • From a stats perspective it’s possible to think of this plot as how the interaction of cyl and year affects displ.
(ggplot(mpg, aes(x='displ')) + 
 geom_histogram() + 
 facet_grid("cyl ~ year"))
png

png

  • In this plot I’m adding two layers of visualization of the data geom_point and geom_smooth. Finally I use face_wrap to make 3 scatter plots.
  • You can add models to geom_poin plots using the defaul geom_smooth ( uses a local regression model), use a linear model (method = ‘lm’ ) or a robust linear model as it’s in the example below.
  • The ‘factor(gear)’ is the way of converting an integer variable to categorical internally for plotting. You can also convert it to category dtype in pandas but this is more informal and useful for EDA.
(ggplot(mtcars, aes('wt', 'mpg', color='factor(gear)'))
 + geom_point()
 + geom_smooth(method='rlm')
 + facet_wrap('~gear'))
png

png