I really like method chaining in Pandas. It reduces the risk of typos or errors from running assignment out of order. However some things are really difficult to do with method chaining in Pandas; in particular getting the second most common value of each group.

This is much easier to do in R's dplyr with its consistent and flexible syntax than it is with Pandas.


For the table below find the total frequency and the second most common value of y by frequency for each x (in the case of ties any second most common value will suffice).

x y frequency
1 1 3
1 2 2
1 3 2
2 2 2
2 1 1
3 1 1


x frequency second
1 6 2
2 3 1
3 1 NA

Solution with dplyr

This is pretty straightforward to solve in R with dplyr; we can first sort the columns by frequency and pick the second element:

d <- data.frame(x=c(1,1,1,2,2,3),

d %>%
group_by(x) %>%
arrange(desc(n)) %>%
summarise(n = sum(n), second = nth(y, 2))

This gives exactly the result above.

Solving with Pandas

Starting with the same dataframe

df = pd.DataFrame({'x': [1, 1, 1, 2, 2, 3],
                   'y': [1, 2, 3, 1, 2, 1],
                   'n': [3, 2, 1, 1, 2, 1]})

To get the results over multiple lines is straightforward:

totals = df.groupby('y').n.sum()
# Note nth is 0 indexed
second = df.sort_values('n', ascending=False).groupby('x').y.nth(1)
ans = pd.DataFrame({'n': totals, 'second': second})

This is fine, but it means you have to break a chain. You can chain directly with agg if we want to find the top value:

.sort_values('n', ascending=False)
.agg(n=('n', 'sum'), first=('y', 'first'))

Unfortunately there's no built in second function. There is an nth function, but there's no way to pass the argument n in the agg call.

We could try to wrap nth in a partial, but I can't work out where in pandas nth is defined. Passing pandas.core.groupby.generic.DataFrameGroupBy.nth to agg gives an error.

> df.groupby('y').agg(a=('x', lambda x: nth(x,1)))
TypeError: n needs to be an int or a list/set/tuple of ints

We could try to define our own function to find the nth item, iloc almost works, but if there's an item that doesn't have an nth item it raises an IndexError.

> (df
.sort_values('n', ascending=False)
.agg(n=('n', 'sum'), second=('y', lambda y: y.iloc[1]))
IndexError: single positional indexer is out-of-bounds

Another strategy would be to slice into a running count; in dplyr:

d %>%
group_by(x) %>%
arrange(desc(n)) %>%
mutate(rn = row_number(),
       second = ifelse(rn == 2, y, NA)) %>%
summarise(n=sum(n), second=first(na.omit(second)))

We can do this in Pandas because the first function ignores NaN values where it can. Without chaining in Pandas this looks like:

df['rn'] = df.sort_values('n', ascending=False).groupby('x').cumcount()
df['second'] = df.y[df.rn == 1]
df.groupby('x').agg(n=('n', 'sum'), second=('second', 'first'))

I'm still not sure how to chain either of the two ways! One way I can get it to chain is by setting the index and assigning:

.sort_values('n', ascending=False)
.assign(second=lambda df: df.groupby('x').y.nth(1))
.agg(n=('n', 'sum'), second=('second', 'first'))

Unfortunately here the cure is worse than the disease and the chain is hard to manage and unreadable.

The cleanest way is to define an nth function that does what we need:

import numpy as np
def get_nth(n):
  def nth(x):
    return x[n] if len(x) > n else np.nan
  return nth

.sort_values('n', ascending=False)
.agg(n=('n', 'sum'), second=('y', get_nth(1)))

Finally we have a nice way that works but it requires defining a new function, because I can't work out how to use Pandas inbuilt nth functionality. While method chaining can make things easier it can be frustrating trying to work with agg in Pandas because you can't generally pass an argument to string functions.