Copy Pandas DataFrames and HDF5 files to PostgreSQL database
This package is part of Harvard Growth Lab’s portfolio of software packages, digital products and interactive data visualizations. To browse our entire portfolio, please visit growthlab.app. To learn more about our research, please visit Harvard Growth Lab’s home page.
Pandas-to-postgres allows you to bulk load the contents of large dataframes into postgres as quickly as possible. The main differences from pandas' to_sql
function are:
COPY
combined with to_csv
instead of execute / executemany
, which runs much faster for large volumes of dataCOPY FROM STDIN
with StringIO
to avoid IO overhead to intermediate files. This matters in particular for data stored in unusual formats like HDF, STATA, parquet - common in the scientific world.Anecdotally, we use this to load approximately 640 million rows of data from a 7.1GB HDF file (zlib compressed), 75% of it spread across 3 of 23 tables, with a mean number of columns of 6. We load this into an m4.xlarge RDS instance running postgres 10.3 in 54 minutes (approximately 10-15 minutes of which is recreating indexes), using 4 threads.
from pandas_to_postgres import (
DataFrameCopy,
hdf_to_postgres,
)
table_model = db.metadata.tables['my_awesome_table']
# already loaded DataFrame & SQLAlchemy Table model
with db.engine.connect() as c:
DataFrameCopy(df, conn=c, table_obj=table_model).copy()
# HDF from file
hdf_to_postgres('./data.h5', engine_args=["psycopg://..."])
# Parallel HDF from file
hdf_to_postgres('./data.h5', engine_args=["psycopg://..."], processes=4)
COPY WITH BINARY
to remove the pandas to csv bottleneck, but didn't provide as good an improvement for us.