Imagine you had to implement some action across classes in 60 packages. You know what result you want, but may need to handle each class in a specific way.
For example,
tidy()
function to summarize different statistical models.In this post I will discuss two approaches you could take to do this.
While the first approach makes it easy to find everything in one place (the class definition), it requires coordination and cooperation across packages. It also fattens the class up with additional responsibility.
On the other hand, the second approach–which is single dispatch–makes it easy to focus on the action taking place (e.g. representing html output or summarizing a model), and separates a class’s core job from additional behaviors.
In this post, I will examine how IPython and the package broom use singledispatch to implement actions that (1) let class internals focus on a small set of responsibilities, (2) don’t require going on a sixty package PR submitting spree, and (3) allow for users to cleanly experiment with new actions.
This post will be part of a series discussing why I chose single dispatch for my port of dplyr to python, siuba. For more context on that decision, see this architecture doc
Python comes with many built in functions like repr()
, which is short for “represent”.
You can think of this function as one that will essentially reproduce what you gave it when printed.
For example, the code below shows printed representations of 1
and "1"
.
int_repr = repr(1)
str_repr = repr('1')
print(int_repr, str_repr)
1 '1'
While this works out of the box with built-in types, python also allows you to implement repr
on your own classes. In order to do this, you define a method named __repr__
.
class Shout:
def __init__(self, text):
self.text = text
def __repr__(self):
return "Shout({})".format(repr(self.text))
print(repr(Shout("yo!")))
Shout('yo!')
Notice that if I were to paste the representation from above into my code, it would essentially reproduce that shout object!
There are many methods like __repr__
, and they are foundational for creating new classes. But what if we wanted to create a new function, like repr_html()
. This function could output an HTML representation of a class, which would be useful for displaying in jupyter notebooks!
One approach is to ask classes to follow a convention. This is one approach taken by IPython, which essentially tells classes:
if you name a method
_repr_html_
, then I will use it for displaying in a notebook.
This is shown below.
class Shout2(Shout):
# method name IPython knows to look for
def _repr_html_(self):
return "<h1>" + self.text + "</h1>"
In practice, when a jupyter notebook goes to represent an object–like Shout2
–it will grab the _repr_html_
method or fall back to __repr__
.
This approach can work well, but leaves open a major question: what if classes don’t follow our convention? For example, the maintainer of another package might not want to add custom methods for every new action people want to perform.
Fortunately IPython’s html formatter allows an alternative approach. We can tell it that when it sees an object of a specific class, to call (dispatch) a specific function on it.
def shout_html(shout_obj):
return '<h1>{}</h1>'.format(repr(shout_obj.text))
# special ipython function to get the html formatter
html_formatter = get_ipython().display_formatter.formatters['text/html']
# when you see a Shout object, call the shout_html function on it
html_formatter.for_type(Shout, shout_html)
Notice the difference between these two approaches:
In this section we looked at how IPython uses single dispatch. In the following section, I’ll quickly review python’s built-in tools for single dispatch, before considering how it could be used to replicate the success of the R package broom.
While the above example focused on IPython, it’s worth noting that python’s builtin functools
library has a handy implementation of single dispatch.
from functools import singledispatch
@singledispatch
def repr_html(obj):
# default implementation
return repr(obj)
@repr_html.register(Shout)
def _repr_html_shout(obj):
# implementation for Shout
return '<h1>{}</h1>'.format(obj.text)
In this case, repr_html
produces different behaviors, depending on the type of object it receives.
repr_html("yo!") # string class uses default
"'yo!'"
repr_html(Shout("yo!")) # Shout classes gets custom
'<h1>yo!</h1>'
With this tool in hand let’s focus on where it really shines: implementing actions across classes from many packages. In the next section, I’ll discuss how it could immensely benefit data science–which is flush with packages implementing different data sources (e.g. pandas) and modelling methods (e.g. scikit learn).
Consider that in data analysis you often have two kinds of things:
Oftentimes, developers lump these two things together.
For example, pandas DataFrames have a .mean()
method, and statsmodels has a method to summarize model parameters.
This is fine within a package–but what happens when we want to implement a new action across packages? We probably don’t want everyone to sit down and duke it out over what methods can be added, and what they should return.
Single dispatch lets people separate development of data classes from the actions taken over them!
One example of where this was done at a large scale is the R package broom
.
As a person who needs to fit statistical models implemented in different packages, I can relate to the endless frustration of trying to find where in a model fit something is stored. Did my model object–let’s call it fit
–put the coefficients in fit.coef
, fit.coef_
, or fit.get_coefficients(robust = True)
?
The problems don’t stop once you can find a summary of your model, because identical statistical models, when fit by different packages, may use slightly (or radically) different summaries.
The R package broom recognized the need for consistent reporting, and created…
tidy()
- a single dispatch function in Rtidy()
should produceAnd that was it.
While they encourage modelling packages to hold and maintain their class specific implementation of tidy()
, it isn’t necessary, and broom’s github repo contains implementations for many packages.
tidy()
for the same model from scikit-learn, pymc3, and statsmodelsIn this section, I give a rough demo of what tidy()
might look like in python using the following:
In the code below I set up the demo.
import numpy as np
import pandas as pd
from functools import singledispatch
from siuba.data import mtcars
pd.set_option('precision', 2)
pd.set_option('display.width', 100)
pd.set_option('display.max_columns', 10)
@singledispatch
def tidy(fit):
raise NotImplementedError("No tidy method for class %s" %fit.__class__.__name__)
Click on any of the 3 tabs below to see library specific tidy implementations.
# Fit statsmodels -------------------------------------------------------------
import statsmodels.api as sm
import statsmodels.formula.api as smf
results = smf.ols('mpg ~ hp', data=mtcars).fit()
# Tidy implementation ---------------------------------------------------------
@tidy.register(results.__class__)
def _tidy_statsmodels(fit):
from statsmodels.iolib.summary import summary_params_frame
tidied = summary_params_frame(fit).reset_index()
rename_cols = {
'index': 'term', 'coef': 'estimate', 'std err': 'std_err',
't': 'statistic', 'P>|t|': 'p_value',
'Conf. Int. Low': 'conf_int_low', 'Conf. Int. Upp.': 'conf_int_high'
}
return tidied.rename(columns = rename_cols)
tidy(results)
term estimate std_err statistic p_value conf_int_low conf_int_high
0 Intercept 30.10 1.63 18.42 6.64e-18 26.76 33.44
1 hp -0.07 0.01 -6.74 1.79e-07 -0.09 -0.05
# fit sklearn -----------------------------------------------------------------
from sklearn.linear_model import LinearRegression
from siuba.data import mtcars
X = mtcars[['hp']]
y = mtcars['mpg']
# y = 1 * x_0 + 2 * x_1 + 3
reg = LinearRegression().fit(X, y)
# TIDY IMPLEMENTATION ---
import pandas as pd
@tidy.register(reg.__class__)
def _tidy_sklearn(fit, col_names = None):
estimates = [fit.intercept_, *fit.coef_]
if col_names is None:
terms = list(range(len(estimates)))
else:
terms = ['intercept', *col_names]
# pd.DataFrame()
return pd.DataFrame({
'term': terms, 'estimate': estimates,
'std_error': np.nan
})
tidy(reg, col_names = X.columns)
term estimate std_error
0 intercept 30.10 NaN
1 hp -0.07 NaN
# fit pymc3 -----------------------------------------------------------------
from pymc3 import *
x = mtcars['hp'].values
y = mtcars['mpg'].values
data = dict(x=x, y=y)
np.random.seed(999999)
with Model() as model: # model specifications in PyMC3 are wrapped in a with-statement
# Define priors
sigma = HalfCauchy('sigma', beta=10, testval=1.)
intercept = Normal('intercept', 0, sigma=20)
x_coeff = Normal('hp', 0, sigma=20)
# Define likelihood
likelihood = Normal('mpg', mu=intercept + x_coeff * x,
sigma=sigma, observed= y)
# Inference!
trace = sample(500, cores=2, progressbar = False) # draw 3000 posterior samples using NUTS sampling
# Tidy implementation ---------------------------------------------------------
@tidy.register(trace.__class__)
def _tidy_trace(fit, robust = False):
trace_df = trace_to_dataframe(fit)
agg_funcs = ['median', 'mad'] if robust else ['mean', 'std']
# data frame with columns like: median, mad.
tidied = trace_df.agg(agg_funcs).T.reset_index()
tidied.columns = ['term', 'estimate', 'std_err']
return tidied
tidy(trace)
term estimate std_err
0 intercept 29.93 1.67
1 hp -0.07 0.01
2 sigma 4.03 0.53
Notice how differently each package store and report their results! Much of what the tidy function does is simple calculations, re-arranging outputs, and renaming columns.
It doesn’t matter! Five different people could start their own versions of the broom package, and we wouldn’t have to wait
for the maintainers of scikit-learn or statsmodels or pymc3 or
Single dispatch takes a practice embodied by python actions like repr
, while allowing class-specific implementation to be specified outside a class itself.
This has 3 major benefits:
While I focused on its beneficial use in IPython and broom, it’s worth noting that many packages successfully implement this approach. For example, dbplyr an R library for executing the same analysis code on many different data sources (such as a local dataframe, or SQL table).
In my next post, I will focus on how siuba uses singledispatch to try and enable the same distributed collaboration as broom and dplyr.
If you have questions or thoughts, please reach out on twitter [@chowthedog](https://twitter.com/chowthedog)!
Follow on Twitter | Hucore theme & Hugo ♥