Complete end to end sample of doing DevOps with Azure Databricks
Complete end to end sample of doing DevOps with Azure Databricks. This is based on working with lots of customers who have requested that they can reference a documented apporach. The included code utilizes KeyVault for each environement and uses Azure AD authorization tokens to call the Databricks REST API.
This will show you how to deploy your Databricks assests via GitHub Actions and Azure Dev Ops Pipelines so that your Notebooks, Clusters, Jobs and Init Scipts are automatically deployed and configured per environment.
Description | URL |
---|---|
Overview of the Databricks DevOps solution | Right click here and Open Link in New Tab |
Setting up a Service Principal in Azure and then configuring for GitHub Action and Azure DevOps Pipeline | Right click here and Open Link in New Tab |
Initialize Azure with KeyVault and Secrets using the GitHub Action | Right click here and Open Link in New Tab |
Deploy your Databricks artifacts using the GitHub Action | Right click here and Open Link in New Tab |
Detailed Deployment Review of GitHub Action and Git Setup | Right click here and Open Link in New Tab |
Configuring GitHub Integration with Databricks | Right click here and Open Link in New Tab |
Clone this repo to your GitHub
Click on Settings | Secrets and create a secret named: AZURE_CREDENTIALS
{
"clientId": "REPLACE:00000000-0000-0000-0000-000000000000",
"clientSecret": "REPLACE: YOUR PASSWORD/SECRET",
"subscriptionId": "REPLACE:00000000-0000-0000-0000-000000000000",
"tenantId": "REPLACE:00000000-0000-0000-0000-000000000000",
"activeDirectoryEndpointUrl": "https://login.microsoftonline.com",
"resourceManagerEndpointUrl": "https://management.azure.com/",
"activeDirectoryGraphResourceId": "https://graph.windows.net/",
"sqlManagementEndpointUrl": "https://management.core.windows.net:8443/",
"galleryEndpointUrl": "https://gallery.azure.com/",
"managementEndpointUrl": "https://management.core.windows.net/"
}
NOTE: When you click on actions you will not see the action if you forked/cloned this repo. I have requested the ability to "import" an existing action. For now you need to create a new action (blank one) and then copy the YAML from the .github/workflows/pipeline.yml file.
Click on Actions
and click Databricks-CI-CD
and click Run workflow
Fill in the fields (only bold are not the defaults)
notebooks/MyProject
/MyProject
Databricks-MyProject
(NOTE: "-Dev" will be appended)EastUS2
Databricks-MyProject
KeyVault-MyProject
** NOTE: You need to put a 1 or 2, etc on the end of this to make it globally unique**00000000-0000-0000-0000-000000000000
Initialize KeyVault
The pipeline will create 3 Azure resource groups
The pipeline will create 3 Databricks workspaces
The pipeline will create 3 Azure KeyVaults (you can use your own KeyVault, see later in this document)
In the Azure Portal
Add Access Policy
Secret Management
0 Selected
2 Selected
(select just Get and List)0 Selected
Secret Management
00000000-0000-0000-0000-000000000000
00000000-0000-0000-0000-000000000000
00000000-0000-0000-0000-000000000000
some crazy string
Re-run the pipeline
00000000-0000-0000-0000-000000000000
Databricks
(this is the default, so you really do not need to select it)Create an Azure Dev Ops project
Click on the Repo icon on the left menu
https://github.com/AdamPaternostro/Azure-Databricks-Dev-Ops
Click on the Pipeline icon on the left menu
Click on the Gear icon (bottom left)
Click on the Pipeline icon on the left menu
notebooks/MyProject
/MyProject
Databricks-MyProject
(NOTE: "-Dev" will be appended)EastUS2
Databricks-MyProject
KeyVault-MyProject
** NOTE: You need to put a 1 or 2, etc on the end of this to make it globally unique**00000000-0000-0000-0000-000000000000
DatabricksDevOpsConnection
Initialize KeyVault
You should see the following
The first time the pipeline runs it will create your Databricks workspace and KeyVault. It will skip all the other steps!
Add Access Policy
Secret Management
0 Selected
2 Selected
(select just Get and List)0 Selected
Secret Management
00000000-0000-0000-0000-000000000000
00000000-0000-0000-0000-000000000000
00000000-0000-0000-0000-000000000000
some crazy string
Re-run the pipeline
00000000-0000-0000-0000-000000000000
DatabricksDevOpsConnection
Databricks
(this is the default, so you really do not need to select it)You shoudl set approvals on each environment so the pipeline does not deploy to QA or Prod without an Approval.
I typically use the same exact name for each of my Azure resources for each of my environments. I simply append a "-Environment (Dev, QA, Prod)" to make my deployments easy to author in my pipelines. I always suggust an easy naming standard per environment to make your DevOps code easy to write.
Azure Resource Groups are created (if they do not exist)
Azure Databricks are created or existing ones are set to the state in the ARM template
Azure KeyVaults are created or existing ones are set to the state in the ARM template
KeyVault Secrets are downloaded by DevOps
Init Scripts are deployed
dbfs:/init-scripts
is createdClusters are deployed
Notebooks are deployed
/Users
folder under a new folder that your specify. The new folder is not under any specific user, it will be at the root. I consider notebooks under a user as experimental and should not be used for official jobs.Jobs are deployed
deployment-scripts
folder.
Databricks-MyProject-WorkArea
Databricks-MyProject-WorkArea
in the resource group Databricks-MyProject-WorkArea
working directory
set to be in the folder as the artifacts
(e.g. if deploying jobs then run from jobs folder)
# Change the below path
cd ./Azure-Databricks-Dev-Ops/notebooks/MyProject
../../deployment-scripts/deploy-notebooks.sh \
'00000000-0000-0000-0000-000000000000 (tenant id)' \
'00000000-0000-0000-0000-000000000000 (client id)' \
'... (client secret)' \
'00000000-0000-0000-0000-000000000000 (subscription id)' \
'Databricks-MyProject-Dev' \
'Databricks-MyProject-Dev'
'/ProjectFolder'
"existing_cluster_id": "Small"
. The existing cluster id says "Small" which is NOT an actual cluster id. It is actually the name field in the small-cluster.json ("cluster_name": "Small",
). During deployment the deploy-jobs.sh will lookup the existing_cluster_id value in the name field and populate the jobs JSON with the correct Databricks cluster id.
{
"name": "Test-DevOps-Job-Interactive-Cluster",
"existing_cluster_id": "Small", <- this gets replaced with the actual cluster id
"email_notifications": {},
"timeout_seconds": 0,
"notebook_task": {
"notebook_path": "/MyProject/Pop vs. Price SQL.sql",
"revision_timestamp": 0
},
"max_concurrent_runs": 1
}
Sample-REST-API-To-Databricks.sh
to call the List
operation to get existing items from a workspace. If you create a new Job in Databricks, then run this script calling the jobs/list
to grab the JSON to place in source control.