Automated data quality suggestions and analysis with Deequ on AWS Glue
Read our AWS Big Data Blog for an in-depth look at this solution.
Deequ is an open source library built on top of Apache Spark for defining “unit tests for data”. It is used internally at Amazon for verifying the quality of large production datasets, particularly to:
More details on Deequ can be found in this AWS Blog.
A serverless data quality framework based on Deequ and running on AWS Glue is showcased in this repository. It takes a database and tables in the AWS Glue Catalog as inputs and outputs various data quality metrics into S3. Additionally, it performs an automatic generation of constraints on previously unseen data. The suggestions are stored in DynamoDB tables and can be reviewed and amended at any point by data owners in a UI. All constraints are disabled by default. Once enabled, they are used by the Glue jobs to carry out the data quality checks on the tables.
To deploy the infrastructure and code, you'll need an AWS account and a correctly configured AWS profile with enough permissions to create the architecture below - Administrator rights are recommended.
cd ./src
./deploy.sh -p <aws_profile> -r <aws_region>
All arguments to the deploy.sh
script are optional. The default AWS profile and region are used if none are provided.
The script will:
amazon-deequ-glue
holding all the infrastructure listed belowThe initial deployment can take 10-15 minutes. The same command can be used for both creating and updating infrastructure.
If you choose NOT to implement an AWS Amplify frontend, then add a "-d" flag to the deploy script:
cd ./src
./deploy.sh -p <aws_profile> -r <aws_region> -d
DataQualitySuggestion
DynamoDB table. It also outputs the quality checks results based on these suggestions into S3. The suggestions can be reviewed and amended at any point by data ownersDataQualitySuggestion
and DataQualityAnalysis
DynamoDB tables and runs 1) a constraint verification and 2) an analysis metrics computation that it outputs in parquet format to S3data-quality-crawler
. The metrics in the data quality S3 bucket are crawled, stored in a data_quality_db
database in the AWS Glue Catalog and are immediately available to be queried in AthenaWe assume you have a Glue database hosting one or more tables in the same region where you deployed this framework.
data-quality-sm
. Start an execution inputting a JSON like the below:
{
"glueDatabase": "my_database",
"glueTables": "table1,table2"
}
The data quality process described in the previous section begins and you can follow it by looking at the different Glue jobs execution runs. By the end of this process, you should see that data quality suggestions were logged in the DataQualitySuggestions
DynamoDB table and Glue tables were created in the data_quality_db
Glue database which can be queried in Athenadeequ-constraints
app. Then click on the highlighted URL (listed as https://<env>.<appsync_app_id>.amplifyapp.com
) to open the data quality constraints web app. After completing the registration process (i.e. Create Account) and signing in, a UI similar to the below is visible:
it lists data quality suggestions produced by the Glue job in the previous step. Data owners can add/remove and enable/disable these constraints at any point in this UI. Notice how the Enabled
field is set to N
by default for all suggestions. This is to ensure all constraints are human-reviewed before they are processed. Click on the checkbox button to enable a constraintAnalyzers
tab. These constraints are used by Deequ to calculate column-level statistics on the dataset (e.g. CountDistinct, DataType, Completeness…) called metrics (Refer to the Data Analysis section of this blog for more details). Here is an example of an analysis constraint entry:
An exhaustive list of suggestion and analysis constraints can be found in the docs.
Translate Scala scripts to Python using python-deequ
See CONTRIBUTING for more information.
This library is licensed under the MIT-0 License. See the LICENSE file.