R core looks like it's getting a new pipe operator
|> for composing functions.
It's just like the existing magrittr pipe
%>%, but has been implemented as a syntax transformation so that it is more computationally efficient and produces better stack traces.
The pipe means instead of writing
f(g(h(x))) you can write
x |> h |> g |> f, which can be really handy when changing dataframes.
Python's Pandas library doesn't have this kind of convenience and it opens up a class of error that won't happen in that R code. Here's a typical bit of Pandas code:
df_clean = df_raw[(df_raw['colour'] == 'blue') & (df_raw['price'] > 50)] df_clean.loc[df_clean['price'].isna(), 'price'] = df_clean['price'].mean()
There are so much repitition here it's easy to make a mistake.
On the first line
df_raw is typed 3 times, a typo putting in a different dataframe will lead to subtle runtime errors that are hard to pick up; I've debugged them in my own code many times.
The second line has a similar problm where
df_clean is typed 3 times (if
df_raw was put ther by accident it could lead to an error).
There's also other Pandas traps here; forgetting the brackets on the first line will lead to an error due to the precedence of &, and I don't know whether the second line actually changes
df_raw (I may see some warning about that, and then if I want to preserve
df_raw I'll put a
In R dplyr it's much cleaner and you can't accidentally type the wrong thing because we're chaining (here using the magrittr
%>%, but soon we will be able to use
df_clean <- df %>% filter(colour == "blue", price > 50) %>% assign(price = ifelse(is.na(price), 50, mean(price)))
df_clean = (df_raw .query('colour == "blue" & price > 50') .assign(price = lambda df: df['price'].fillna(df['price'].mean()))
However there are cases where it's really hard to do in Pandas, like getting the second most common value in a group. Because Pandas is built by appending functions to the Dataframe class if there's not a method for it you have to patch it in like pyjanitor does, but it you do a proper pandas extension it's quite verbose. In R because it uses a functional approach you can easily reuse common functions rather than having to write (and remember the names of!) Dataframe specific ones.