Writing Pandas Dataframes to S3

pandas
python
Published

May 28, 2021

Writing a Pandas (or Dask) dataframe to Amazon S3, or Google Cloud Storage, all you need to do is pass an S3 or GCS path to a serialisation function, e.g.

# df is a pandas dataframe
df.to_csv(f's3://{bucket}/{key}')

Under the hood Pandas uses fsspec which lets you work easily with remote filesystems, and abstracts over s3fs for Amazon S3 and gcfs for Google Cloud Storage (and other backends such as (S)FTP, SSH or HDFS). In particular s3fs is very handy for doing simple file operations in S3 because boto is often quite subtly complex to use.

Sometimes managing access credentials can be difficult, s3fs uses botocore credentials, trying first environment variables, then configuration files, then IAM metadata. But you can also specify an AWS Profile manually, and you can pass this (and other arguments) through pandas using the storage_options keyword argument:

# df is a pandas dataframe
df.to_parquet(f's3://{bucket}/{key}', storage_options={'profile': aws_profile}))

One useful alternative is to create AWS Athena tables over the dataframes, so you can access them with SQL. The fastest way to do this is with AWS Data Wrangler, although PyAthena is also a good option.