I’ve had to dive into pandas’ code base over the last year for a project (siuba), and my attitude has shifted dramatically from..
I think this is most apparent in the functions that decide what dtype a Block—the most basic thing that stores data in pandas—should be.
For the ubiquitous Object dtype, it often figures out which of the many possible more specific types to cast it to.
Ask yourself what this outputs:
import pandas as pd
import numpy as np
np.array([np.nan, "a"])
Lo and behold—it produces an array where the np.nan has been converted to the string “nan”.
array(['nan', 'a'], dtype='<U32')
And yet,
pd.Series([np.nan, "a"])
0 NaN
1 a
dtype: object
Knows this, has your back, and does not stringify it!
pandas also has a pathological fixation on when it tries to convert dtypes, since avoiding all the bad conversion outcomes is a relatively time intensive process (compared to e.g. creating a numpy array).
I realize things could be much easier in pandas user facing interface, but really appreciate the sheer amount of effort that has gone into its dtype wrangling.
Here are three libraries built in large part on pandas.
It’s worth noting that if you are a user coming from dplyr, the ibis library is probably a closer fit to what you’re looking for.
As I’ve worked on siuba—a port of dplyr to python—I’ve had to read code from dplyr’s ecosystem of packages, the Tidyverse. Looking through Tidyverse source code, it is clear to me that Tidyverse developers often have to do two things…
While porting dplyr’s user interface over, I’ve been increasingly surprised that pandas has already done the heavy lifting on the underlying calculations. Often, I am just wrapping a couple pandas methods, handling some edge cases, and resetting indexes.
A good example of this was implementing the spread()
function.
The spread
function in dplyr focuses on shaping data from a longer to a wider format, based on a pair columns:
The data is transformed so that each kind of measurement gets its own column.
For example, the figure below—taken from R for Data Science—shows the original data on the left, which measures “cases” and “population” for different countries over time. Here, the measure column is named key
.
It shows how spread()
maps the values to create the data on the right.
In pandas, this can roughly be achieved with the code below.
wide = long.pivot_table(index = ["Country", "year"], columns = "key")
wide
value
key cases population
Country year
Afghanistan 1999 745 19987071
2000 2666 20595360
Brazil 1999 37737 172006362
2000 80488 174504898
China 1999 212258 1272915272
2000 213766 1280428583
This gets us most of the way there, but there are two big downsides with this result:
These two things can make the DataFrame difficult to use and reason about, because you have to use different strategies when operating on indexes. One of siuba’s main job is to reset them, so that as much data as possible is stored as columns.
As it turns out, this is a little bit complicated.
(wide
.droplevel(0, axis=1)
.reset_index()
.rename_axis(columns = lambda x: None)
)
Country year cases population
0 Afghanistan 1999 745 19987071
1 Afghanistan 2000 2666 20595360
2 Brazil 1999 37737 172006362
3 Brazil 2000 80488 174504898
4 China 1999 212258 1272915272
5 China 2000 213766 1280428583
When doing data analysis, having to do this kind of fine-grained index and name fiddling is probably not what you want. However, as a developer, this kind of granularity is pretty powerful—and it made implementing spread()
a quick task!
Altogether, the final siuba code to produce the above result is more of a convenience wrapper.
from siuba import _, spread
# spreading from longer to wider format
spread(long, _.key, _.value)
(Note: siuba uses the unstack method in its spread implementation, rather than pivot_table, but the idea is the same.)
Recently, dplyr released more general versions of spread, and gather—called pivot_wider
and pivot_longer
.
As a testament to pandas flexibility, it looks like all the functionality to port these functions is already there.
pandas is a useful tool for data analysis in its own right. When you’re using it, you might miss some of the difficult jobs it is doing for you, like ensuring reasonable behavior around data types.
But that’s only half the picture. It also serves as a foundation for a lot of interesting tools, like…
It seems like there’s been a lot of frustration surfacing on twitter lately from people coming from R—especially if they’ve used dplyr and ggplot—towards pandas and matplotlib. I can relate. I’m developing a port of dplyr to python. But in the end, it’s probably helpful to view these libraries as foundational to a lot of other, higher-level libraries (some of which will hopefully get things right for you!).
Follow on Twitter | Hucore theme & Hugo ♥