Composing Functions

programming

python

Published

December 6, 2020

R core looks like it’s getting a new pipe operator |> for composing functions. It’s just like the existing magrittr pipe %>%, but has been implemented as a syntax transformation so that it is more computationally efficient and produces better stack traces. The pipe means instead of writing f(g(h(x))) you can write x |> h |> g |> f, which can be really handy when changing dataframes.

Python’s Pandas library doesn’t have this kind of convenience and it opens up a class of error that won’t happen in that R code. Here’s a typical bit of Pandas code:

df_clean = df_raw[(df_raw['colour'] == 'blue') & (df_raw['price'] > 50)]
df_clean.loc[df_clean['price'].isna(), 'price'] = df_clean['price'].mean()

There are so much repetition here it’s easy to make a mistake. On the first line df_raw is typed 3 times, a typo putting in a different dataframe will lead to subtle runtime errors that are hard to pick up; I’ve debugged them in my own code many times. The second line has a similar problem where df_clean is typed 3 times (if df_raw was put there by accident it could lead to an error). There’s also other Pandas traps here; forgetting the brackets on the first line will lead to an error due to the precedence of &, and I don’t know whether the second line actually changes df_raw (I may see some warning about that, and then if I want to preserve df_raw I’ll put a .copy() in.

In R dplyr it’s much cleaner and you can’t accidentally type the wrong thing because we’re chaining (here using the magrittr %>%, but soon we will be able to use |>):

df_clean <- df %>%
  filter(colour == "blue", price > 50) %>%
  assign(price = ifelse(is.na(price), 50, mean(price)))

In Pandas you can use method chaining (and tools like the pandas pipe) to clean it up. Using query we can get something close to dplyr, but it’s still a bit clunky and query can be very slow:

df_clean = (df_raw
  .query('colour == "blue" & price > 50')
  .assign(price = lambda df: df['price'].fillna(df['price'].mean()))

However there are cases where it’s really hard to do in Pandas, like getting the second most common value in a group. Because Pandas is built by appending functions to the Dataframe class if there’s not a method for it you have to patch it in like pyjanitor does, but it you do a proper pandas extension it’s quite verbose. In R because it uses a functional approach you can easily reuse common functions rather than having to write (and remember the names of!) Dataframe specific ones.

I think method chaining is a useful way to write data transformations; it exists in most functional languages Haskell, OCaml, F# and in Clojure’s useful threading macros. It’s even in Julia and there’s a proposal for Javascript. You can implement it in Python in some sense using magic methods for infix operators, like Thinc’s combinators but it’s against the grain in Python which s not a functional language.