Testing Pandas transformations with Hypothesis
Pandas and numpy let you perform fast transformations on large datasets by executing optimised low-level code. However the syntax is very terse and it can quickly become hard to see what it’s doing. Often it’s clearer in pure Python code, but Pandas apply
function is much slower. Hypothesis gives a way to check they are doing the same thing.
For example I’ve got some code where I’ve got a salary, but I don’t know whether the rate is hourly, daily or annual. I want to infer it from the code from some rules and return the number of hours it refers to. I can compare a version that works on Pandas series series_infer_salary_period_hours
with one that works on individual salaries infer_salary_period_hours
as follows:
from hypothesis import given
from hypothesis.strategies import floats
from hypothesis.extra.pandas import series
from pandas.testing import assert_series_equal
@given(series(elements=(floats(0, 500_000))))
def test_infer_salary_period_hours_apply(s):
assert_series_equal(series_infer_salary_period_hours(s),apply(infer_salary_period_hours).astype('Int64')) s.
Things to note are that we restrict the elements to a reasonable range, and have to be careful with the types.
We can also check that it works the same on a one-element series. In this case it’s easy to check special edge cases at the boundaries using the @example
decorator.
from hypothesis import example, given
from hypothesis.strategies import floats
import pandas as pd
@given(floats(0, 500_000))
@example(15)
@example(100)
@example(300)
@example(1000)
@example(20_000)
def test_infer_salary_period_hours_element(s):
= pd.Series([s])
s_series = series_infer_salary_period_hours(s_series).iloc[0]
series_ans = infer_salary_period_hours(s)
ans assert ans == series_ans or (ans is None and pd.isna(series_ans))
Note that we have to be a bit careful about how we check None
which is converted to nan
by Pandas, which is not equal to any other nan
.
In general it can be useful to check a Numpy or Pandas row level function against a scalar function written in vanilla Python.
Here’s a full extract of this example:
from hypothesis import example, given
from hypothesis.strategies import floats
from hypothesis.extra.pandas import series
from typing import Optional
import pandas as pd
from pandas.testing import assert_series_equal
def infer_salary_period_hours(salary: float) -> Optional[int]:
"""Infer salary period from a salary.
Returns None if can't infer a period.
"""
if 15 <= salary <= 100:
# Likely hourly rate
return 1
elif 300 <= salary <= 1000:
# Likely daily rate
return 40
elif salary >= 20_000:
# Likely annual
return 2_000
def series_infer_salary_period_hours(s: pd.Series) -> pd.Series:
= pd.Series(None, s.index, dtype='Int64')
ans 15, 100)] = 1
ans[s.between(300, 1000)] = 40
ans[s.between(>= 20_000] = 2_000
ans[s return ans
@given(series(elements=(floats(0, 500_000))))
def test_infer_salary_period_hours_apply(s):
assert_series_equal(series_infer_salary_period_hours(s),apply(infer_salary_period_hours).astype('Int64'))
s.
@given(floats(0, 500_000))
@example(15)
@example(100)
@example(300)
@example(1000)
@example(20_000)
def test_infer_salary_period_hours_element(s):
= pd.Series([s])
s_series = series_infer_salary_period_hours(s_series).iloc[0]
series_ans = infer_salary_period_hours(s)
ans assert ans == series_ans or (ans is None and pd.isna(series_ans))