Compare tables within or across databases
A data diff is the value-level comparison between two tablesβused to identify critical changes to your data and guarantee data quality.
There is a lot you can do with data-diff: you can test SQL code by comparing development or staging environment data to production, or compare source and target data to identify discrepancies when moving data between databases.
data-diff is an open source utility for running stateless diffs as a great single player experience.
Scale up with Datafold Cloud to make data diffing a company-wide experience to both supercharge your data diffing CLI experience (ex: data-diff --dbt --cloud) and run diffs manually in your CI process and within the Datafold UI. This includes column-level lineage with BI tool integrations, CI testing, faster cross-database diffing, and diff history.
When developing SQL code, data-diff helps you validate and preview changes by comparing data between development/staging environments and production. Here's how it works:
data-diff is a powerful tool for comparing data when you're moving it between systems. Use it to ensure data accuracy and identify discrepancies during tasks like:
data-diff integrates with dbt Core to seamlessly compare local development to production datasets.
Learn more about how data-diff works with dbt:
Looking to use data-diff in dbt development?
Development testing with Datafold enables you to see the impact of dbt code changes on data as you write the code, whether in your IDE or CLI.
Head over to our data-diff
+ dbt
documentation to get started with a development testing workflow!
data-diff
with adaptersTo compare data between databases, install data-diff
with specific database adapters. For example, install it for PostgreSQL and Snowflake like this:
pip install data-diff 'data-diff[postgresql,snowflake]' -U
Additionally, you can install all open source supported database adapters as follows.
pip install data-diff 'data-diff[all-dbs]' -U
data-diff
with connection URIsThen, we compare tables between PostgreSQL and Snowflake using the hashdiff algorithm:
data-diff \
postgresql://<username>:'<password>'@localhost:5432/<database> \
<table> \
"snowflake://<username>:<password>@<account>/<DATABASE>/<SCHEMA>?warehouse=<WAREHOUSE>&role=<ROLE>" \
<TABLE> \
-k <primary key column> \
-c <columns to compare> \
-w <filter condition>
You can use a toml
configuration file to run your data-diff
job. In this example, we compare tables between MotherDuck (hosted DuckDB) and Snowflake using the hashdiff algorithm:
## DATABASE CONNECTION ##
[database.duckdb_connection]
driver = "duckdb"
# filepath = "datafold_demo.duckdb" # local duckdb file example
# filepath = "md:" # default motherduck connection example
filepath = "md:datafold_demo?motherduck_token=${motherduck_token}" # API token recommended for motherduck connection
[database.snowflake_connection]
driver = "snowflake"
database = "DEV"
user = "sung"
password = "${SNOWFLAKE_PASSWORD}" # or "<PASSWORD_STRING>"
# the info below is only required for snowflake
account = "${ACCOUNT}" # by33919
schema = "DEVELOPMENT"
warehouse = "DEMO"
role = "DEMO_ROLE"
## RUN PARAMETERS ##
[run.default]
verbose = true
## EXAMPLE DATA DIFF JOB ##
[run.demo_xdb_diff]
# Source 1 ("left")
1.database = "duckdb_connection"
1.table = "development.raw_orders"
# Source 2 ("right")
2.database = "snowflake_connection"
2.table = "RAW_ORDERS" # note that snowflake table names are case-sensitive
verbose = false
data-diff
jobMake sure to export relevant environment variables as needed. For example, we compare data based on the earlier configuration:
# export relevant environment variables, example below
export motherduck_token=<MOTHERDUCK_TOKEN>
# run the configured data-diff job
data-diff --conf datadiff.toml \
--run demo_xdb_diff \
-k "id" \
-c status
# output example
- 1, completed
+ 1, returned
After running your data-diff job, review the output to identify and analyze differences in your data.
Check out documentation for the full command reference.
Database | Status | Connection string |
---|---|---|
PostgreSQL >=10 | π’ | postgresql://<user>:<password>@<host>:5432/<database> |
MySQL | π’ | mysql://<user>:<password>@<hostname>:5432/<database> |
Snowflake | π’ | "snowflake://<user>[:<password>]@<account>/<database>/<SCHEMA>?warehouse=<WAREHOUSE>&role=<role>[&authenticator=externalbrowser]" |
BigQuery | π’ | bigquery://<project>/<dataset> |
Redshift | π’ | redshift://<username>:<password>@<hostname>:5439/<database> |
DuckDB | π’ | duckdb://<filepath> |
MotherDuck | π’ | duckdb://<filepath> |
Microsoft SQL Server* | π’ | mssql://<user>:<password>@<host>/<database>/<schema> |
Oracle | π‘ | oracle://<username>:<password>@<hostname>/servive_or_sid |
Presto | π‘ | presto://<username>:<password>@<hostname>:8080/<database> |
Databricks | π‘ | databricks://<http_path>:<access_token>@<server_hostname>/<catalog>/<schema> |
Trino | π‘ | trino://<username>:<password>@<hostname>:8080/<database> |
Clickhouse | π‘ | clickhouse://<username>:<password>@<hostname>:9000/<database> |
Vertica | π‘ | vertica://<username>:<password>@<hostname>:5433/<database> |
*MS SQL Server support is limited, with known performance issues that are addressed in Datafold Cloud.
Your database not listed here?
data-diff
efficiently compares data using two modes:
joindiff: Ideal for comparing data within the same database, utilizing outer joins for efficient row comparisons. It relies on the database engine for computation and has consistent performance.
hashdiff: Recommended for comparing datasets across different databases or large tables with minimal differences. It uses hashing and binary search, capable of diffing data across distinct database engines.
joindiff
hashdiff
:For detailed algorithm and performance insights, explore here, or head to our docs to learn more about how Datafold diffs data.
We thank everyone who contributed so far!
We'd love to see your face here: Contributing Instructions
This project is licensed under the terms of the MIT License.