Datadevops Save

How DevOps principles can be applied to Data Pipeline Solution built with Azure Databricks, Data Factory and ADL Gen2. Moved to: https://github.com/Azure-Samples/modern-data-warehouse-dataops

Project README

We've moved!

This repository has moved under the official Azure Samples Github organization: .

https://github.com/Azure-Samples/modern-data-warehouse-dataops


DataDevOps

The purpose of this repository is to demonstrate how DevOps principles can be applied Data Pipeline Solution.

IMAGE ALT TEXT HERE

Architecture

The following shows the overall architecture of the solution.

Architecture

Design Considerations

  • Data Transformation logic belongs in packages, not Notebooks
    • All main data transformation code should be packaged up within a Python package/JAR/etc. These packages are then uploaded to DBFS and installed on a specifically configured cluster, along with all other third-party dependencies (ei. azure-cosmosdb-spark jar). Notebooks then simply import the package(s) and calls any relevant functions. Effectively, Notebooks become a lightweight wrapper around the packages. This ensures seperation of concerns and promotes code reuse, testability, and code quality.
  • Data should be tested
    • Two different tests should be performed:
      • Structure (Is the data in the expected shape / schema?)
      • Content (Are there unexpected nulls? Are the summary statistics in expected ranges?)
  • Data should have lineage
    • Just as application deployments should have lineage in order to track which code commit produced which artifacts and deployments, each final loaded data record should be tagged with the appropriate ETL pipeline run id. Not only does this ensure traceability, it also helps with recovery from any potential failed / half-run data loads.

Build and Release Pipeline

The following shows the overall CI/CD process end to end.

CI/CD

Both Build and Release Pipelines are built using AzureDevOps (Public instance) and can be view using the following links:

More information here.

Environments

  • Dev - Development collaboration branch
  • QA - Environment where all integration tests are run (not yet implmented)
  • Staging/UAT - A mirror of the production job, along with state and data. Deploying to staging first give the ability to "mock" a realistic release into production.
  • Production

In addition to these environment, each developer may choose to have their own Development(s) environment for their individual use.

Testing

  • Unit Testing - Standard unit tests which tests small pieces of functionality within your code. Data transformation code should have unit tests.

  • Integration Testing - This includes end-to-end testing of the ETL pipeline.

  • Data Testing

    1. Structure - Test for correct schema, expected structure.
    2. Content - Can be tested through quantitative summary statistics and qualitative data quality graphs within the notebook.

Monitoring

Databricks

Data Factory

Deploy the solution

Pre-requisites:

  1. Github Account
  2. Azure DevOps Account + Project
  3. Azure Account

Software pre-requisites:

  1. For Windows users, Windows Subsystem For Linux
  2. az cli 2.x
  3. Python 3+
  4. databricks-cli
  5. jq

NOTE: This deployment was tested using WSL (Ubuntu 16.04) and Debian GNU/Linux 9.9 (stretch)

Deployment Instructions

  1. Fork this repository. Forking is necessary if you want to setup git integration with Azure Data Factory.

  2. Deploy Azure resources.

    1. Clone the forked repository and cd into the root of the repo
    2. Run ./deploy.sh.
      • This will deploy three Resource Groups (one per environment) each with the following Azure resources.
        • Data Factory (empty) - next steps will deploy actual data pipelines.
        • Data Lake Store Gen2 and Service Principal with Storage Contributor rights assigned.
        • Databricks workspace - notebooks uploaded, SparkSQL tables created, and ADLS Gen2 mounted using SP.
        • KeyVault with all secrets stored.
      • This will create a local .env.{environment_name} files containing essential configuration information.
      • All Azure resources are tagged with correct Environment.
      • IMPORTANT: Due to a limitation of the inability to generate Databricks PAT tokens automatically, you will be prompted generate and enter this per environment. See here for more information.
      • The solution is designed such that all starting environment deployment configuration should be specified in the arm.parameters files. This is to centralize configuration.
  3. Setup ADF git integration in DEV Data Factory

    1. In the Azure Portal, navigate to the Data Factory in the DEV environment.
    2. Click "Author & Monitor" to launch the Data Factory portal.
    3. On the landing page, select "Set up code repository". For more information, see here.
    4. Fill in the repository settings with the following:
      • Repository type: Github
      • Github Account: your_Github_account
      • Git repository name: forked Github repository
      • Collaboration branch: master
      • Root folder: /adf
      • Import Existing Data Factory resource to respository: Unselected
    5. Navigating to "Author" tab, you should see all the pipelines deployed.
    6. Select Connections > Ls_KeyVault. Update the Base Url to the KeyVault Url of your DEV environment.
    7. Select Connections > Ls_AdlsGen2_01. Update URL to the ADLS Gen2 Url of your DEV environment.
    8. Click Publish to publish changes.
  4. Setup Build Pipelines. You will be creating two build pipelines, one which will trigger for every pull request which will run Unit Testing + Linting, and the second one which will trigger on every commit to master and will create the actual build artifacts for release.

    1. In Azure DevOps, navigate to Pipelines. Select "Create Pipeline".
    2. Under "Where is your code?", select Github (YAML).
      • If you have not yet already, you maybe prompted to connect your Github account. See here for more information.
    3. Under "Select a repository", select your forked repo.
    4. Under "Configure your pipeline", select "Existing Azure Pipelines YAML file".
      • Branch: master
      • Path: /src/ddo_transform/azure-pipelines-ci-qa.yaml
    5. Select Run.
    6. Repeat steps 1-4, but select as the path /src/ddo_transform/azure-pipelines-ci-artifacts.
  5. Setup Release Pipelines

    WORK IN PROGRESS

    1. In Azure DevOps, navigate to Release. Select "New pipeline".
    2. Under "Select a template", select "Empty job".
    3. Under "Stage", set Stage name to "Deploy to STG".
    4. Under Agent job, fill in information as shown:

    Release_1_AgentJob

    1. Add a step to the Agent job by select the "+" icon.

Known Issues, Limitations and Workarounds

  • Currently, ADLS Gen2 cannot be managed via the az cli 2.0.
    • Workaround: Use the REST API to automate creation of the File System.
  • Databricks KeyVault-backed secrets scopes can only be create via the UI, and thus cannot be created programmatically and was not incorporated in the automated deployment of the solution.
    • Workaround: Use normal Databricks secrets with the downside of duplicated information.
  • Databricks Personal Access Tokens can only be created via the UI.
    • Workaround: User is asked to supply the tokens during deployment, which is unfortunately cumbersome.
  • Data Factory Databricks Linked Service does not support dynamic configuration, thus needing a manual step to point to new cluster during deployment of pipeline to a new environment.
    • Workaround: Alternative is to create an on-demand cluster however this may introduce latency issues with cluster spin up time. Optionally, user can manually update Linked Service to point to correct cluster.

Data

Physical layout

ADLS Gen2 is structured as the following:

datalake                    <- filesystem
    /libs                   <- contains all libs, jars, wheels needed for processing
    /data
        /lnd                <- landing folder where all data files are ingested into.
        /interim            <- interim (cleanesed) tables
        /dw                 <- final tables 

All data procured here: https://www.melbourne.vic.gov.au/about-council/governance-transparency/open-data/Pages/on-street-parking-data.aspx

Open Source Agenda is not affiliated with "Datadevops" Project. README Source: devlace/datadevops
Stars
58
Open Issues
2
Last Commit
1 year ago
Repository
License
MIT

Open Source Agenda Badge

Open Source Agenda Rating