# Second most common value with Pandas

python
pandas
r
Published

March 22, 2020

I really like method chaining in Pandas. It reduces the risk of typos or errors from running assignment out of order. However some things are really difficult to do with method chaining in Pandas; in particular getting the second most common value of each group.

This is much easier to do in R’s dplyr with its consistent and flexible syntax than it is with Pandas.

# Problem

For the table below find the total frequency and the second most common value of y by frequency for each x (in the case of ties any second most common value will suffice).

x y frequency
1 1 3
1 2 2
1 3 2
2 2 2
2 1 1
3 1 1

x frequency second
1 6 2
2 3 1
3 1 NA

## Solution with dplyr

This is pretty straightforward to solve in R with dplyr; we can first sort the columns by frequency and pick the second element:

``````d <- data.frame(x=c(1,1,1,2,2,3),
y=c(1,2,3,1,2,1),
n=c(3,2,1,1,2,1))

d %>%
group_by(x) %>%
arrange(desc(n)) %>%
summarise(n = sum(n), second = nth(y, 2))``````

This gives exactly the result above.

## Solving with Pandas

Starting with the same dataframe

``````df = pd.DataFrame({'x': [1, 1, 1, 2, 2, 3],
'y': [1, 2, 3, 1, 2, 1],
'n': [3, 2, 1, 1, 2, 1]})``````

To get the results over multiple lines is straightforward:

``````totals = df.groupby('y').n.sum()
# Note nth is 0 indexed
second = df.sort_values('n', ascending=False).groupby('x').y.nth(1)
ans = pd.DataFrame({'n': totals, 'second': second})``````

This is fine, but it means you have to break a chain. You can chain directly with agg if we want to find the top value:

``````(df
.sort_values('n', ascending=False)
.groupby('x')
.agg(n=('n', 'sum'), first=('y', 'first'))
)``````

Unfortunately there’s no built in `second` function. There is an nth function, but there’s no way to pass the argument n in the `agg` call.

We could try to wrap nth in a partial, but I can’t work out where in pandas nth is defined. Passing `pandas.core.groupby.generic.DataFrameGroupBy.nth` to agg gives an error.

``````> df.groupby('y').agg(a=('x', lambda x: nth(x,1)))
TypeError: n needs to be an int or a list/set/tuple of ints``````

We could try to define our own function to find the nth item, `iloc` almost works, but if there’s an item that doesn’t have an nth item it raises an IndexError.

``````> (df
.sort_values('n', ascending=False)
.groupby('x')
.agg(n=('n', 'sum'), second=('y', lambda y: y.iloc))
)
IndexError: single positional indexer is out-of-bounds``````

Another strategy would be to slice into a running count; in dplyr:

``````d %>%
group_by(x) %>%
arrange(desc(n)) %>%
mutate(rn = row_number(),
second = ifelse(rn == 2, y, NA)) %>%
summarise(n=sum(n), second=first(na.omit(second)))``````

We can do this in Pandas because the `first` function ignores NaN values where it can. Without chaining in Pandas this looks like:

``````df['rn'] = df.sort_values('n', ascending=False).groupby('x').cumcount()
df['second'] = df.y[df.rn == 1]
df.groupby('x').agg(n=('n', 'sum'), second=('second', 'first'))``````

I’m still not sure how to chain either of the two ways! One way I can get it to chain is by setting the index and assigning:

``````(df
.sort_values('n', ascending=False)
.set_index('x')
.assign(second=lambda df: df.groupby('x').y.nth(1))
.groupby('x')
.agg(n=('n', 'sum'), second=('second', 'first'))
)``````

Unfortunately here the cure is worse than the disease and the chain is hard to manage and unreadable.

A cleaner way is to define an `nth` function that does what we need:

``````import numpy as np
def get_nth(n):
def nth(x):
return x[n] if len(x) > n else np.nan
return nth

(df
.sort_values('n', ascending=False)
.groupby('x')
.agg(n=('n', 'sum'), second=('y', get_nth(1)))
)``````

Another even better option suggested by Samuel Oranyeli is to use `pipe` to be able to use `nth` with other aggregations:

``````df
.groupby('x')
.pipe(lambda df: pd.DataFrame({'frequency' : df.n.sum(),
'second' : df.y.nth(1)}))``````

While these will do, they’re still quite frustrating to use. I’ll be watching the Python libraries that are built on top of Pandas, like Siuba (an adaptation of dplyr) and Datatable (an adaptation of the R DataTables library), which may make these transformations easier to do.