Testing Pandas transformations with Hypothesis
pandas

Testing Pandas transformations with Hypothesis

Pandas and numpy let you perform fast transformations on large datasets by executing optimised low-level code. However the syntax is very terse and it can quickly become hard to see what it's doing. Often it's clearer in pure Python code, but Pandas apply function is much slower. Hypothesis gives a way to check they are doing the same thing. For example I've got some code where I've got a salary, but I don't know whether the rate is hourly, daily or annual.

Writing Pandas Dataframes to S3
pandas

Writing Pandas Dataframes to S3

Writing a Pandas (or Dask) dataframe to Amazon S3, or Google Cloud Storage, all you need to do is pass an S3 or GCS path to a serialisation function, e.g. # df is a pandas dataframe df.to_csv(f's3://{bucket}/{key}') Under the hood Pandas uses fsspec which lets you work easily with remote filesystems, and abstracts over s3fs for Amazon S3 and gcfs for Google Cloud Storage (and other backends such as (S)FTP, SSH or HDFS).

Fast Pandas  DataFrame to Dictionary
pandas

Fast Pandas DataFrame to Dictionary

Tabular data in Pandas is very flexible, but sometimes you just want a key value store for fast lookups. Because Python is slow, but Pandas and Numpy often have fast C implementations under the hood, the way you do something can have a large impact on its speed. The fastest way I've found to convert a dataframe to a dictionary from the columns keys to the column value is: df.set_index(keys)[value].to_dict() The rest of this article will discuss how I used this to speed up a function by a factor of 20.

Aggregating Quantiles with Pandas
python

Aggregating Quantiles with Pandas

One of my favourite tools in Pandas is agg for aggregation (it's a worse version of dplyrs summarise). Unfortunately it can be difficult to work with for custom aggregates, like nth largest value. If your aggregate is parameterised, like quantile, you potentially have to define a function for every parameter you use. A neat trick is to use a class to capture the parameters, making it much easier to try out variations.

Flattening Nested Objects in Python
python

Flattening Nested Objects in Python

Sometimes I have nested object of dictionaries and lists, frequently from a JSON object, that I need to deal with in Python. Often I want to load this into a Pandas dataframe, but accessing and mutating dictionary column is a pain, with a whole bunch of expressions like .apply(lambda x: x[0]['a']['b']). A simple way to handle this is to flatten the objects before I put them into the dataframe, and then I can access them directly.

Decorating Pandas Tables
python

Decorating Pandas Tables

When looking at Pandas dataframes in a Jupyter notebook it can be hard to find what you're looking for in a big mess of numbers. Something that can help is formatting the numbers, making them shorter and using graphics to highlight points of interest. Using Pandas style you can make the story of your dataframe standout in a Jupyter notebook, and even export the styling to Excel. The Pandas style documentation gives pretty clear examples of how to use it.

Second most common value with Pandas
python

Second most common value with Pandas

I really like method chaining in Pandas. It reduces the risk of typos or errors from running assignment out of order. However some things are really difficult to do with method chaining in Pandas; in particular getting the second most common value of each group. This is much easier to do in R's dplyr with its consistent and flexible syntax than it is with Pandas. Problem For the table below find the total frequency and the second most common value of y by frequency for each x (in the case of ties any second most common value will suffice).