ML Ops Accelerator: Databricks & Azure Machine Learning Unification
Development for Azure DevOps Deployment in Progress
Features to be included in future releases:
The deployment instructions for the video are slightly outdated (albeit still usefull). Please follow instructions below instead. The video still provides useful content for concepts outwith the deployment.
This Repository contains an Azure Databricks Continuous Deployment and Continuous Development Framework for delivering Data Engineering/Machine Learning projects based on the below Azure Technologies:
Azure Databricks | Azure Log Analytics | Azure Monitor Service | Azure Key Vault |
---|
Azure Databricks is a powerful technology, used by Data Engineers and Scientists ubiquitously. However, operationalizing it within a Continuous Integration and Deployment setup that is fully automated, may prove challenging.
The net effect is a disproportionate amount of the Data Scientist/Engineers time contemplating DevOps matters. This Repository's guiding vision is to automate as much of the infrastructure as possible.
az login
# If There Are Multiple Tenants In Your Subscription, Ensure You Specify The Correct Tenant "az login --tenant"
# ** Microsoft Employees Use: az login --tenant fdpo.onmicrosoft.com (New Non Prod Tenant )
echo "Enter Your Git Username... "
# Example: "Ciaran28"
$Git_Configuration = "GitHub_Username"
echo "Enter Your Git Repo Url (this could be any Repository In Your Account )... "
# Example: "https://github.com/ciaran28/dstoolkit-mlops-databricks"
$Repo_ConfigurationURL = ""
echo "From root execute... "
./setup.ps1
Follow the naming convention (case sensitive)
For each environment create GitHub Secrets entitled ARM_CLIENT_ID, ARM_CLIENT_SECRET and ARM_TENANT_ID using the output in VS Code PowerShell Terminal from previous step. (Note: The Service Principal below was destroyed, and therefore the credentials are useless )
In addition generate a GitHub Personal Access Token and use it to create a secret named PAT_GITHUB:
Secrets in GitHub should look exactly like below. The secrets are case sensitive, therefore be very cautious when creating.
The end to end machine learning pipleine will be pre-configured in the "workflows" section in databricks. This utilises a Job Cluster which will automatically upload the necessary dependencies contained within a python wheel file
If you wish to run the machine learning scripts from the Notebook instead, first upload the dependencies (automatic upload is in development). Simply navigate to python wheel file contained within the dist/ folder. Manually upload the python wheel file to the cluster that you wish to run for the Notebook.
The Branching Strategy I have chosen is configured automatically as part of the accelerator. It follows a GitHub Flow paradigm in order to facilitate rapid Continuous Integration, with some nuances. (see Footnote 1 which contains the SST Git Flow Article written by Willie Ahlers for the Data Science Toolkit - This provides a narrative explaining the numbers below)[^1]
The branching strategy is easy to change via updating the "if conditions" within .github/workflows/onRelease.yaml.
In most situations, Databricks recommends that during the ML development process, you promote code, rather than models, from one environment to the next. Moving project assets this way ensures that all code in the ML development process goes through the same code review and integration testing processes. It also ensures that the production version of the model is trained on production code. For a more detailed discussion of the options and trade-offs, see Model deployment patterns.
https://learn.microsoft.com/en-us/azure/databricks/machine-learning/mlops/deployment-patterns
In an organization, thousands of features are buried in different scripts and in different formats; they are not captured, organized, or preserved, and thus cannot be reused and leveraged by teams other than those who generated them.
Because feature engineering is so important for machine learning models and features cannot be shared, data scientists must duplicate their feature engineering efforts across teams.
To solve those problems, a concept called feature store was developed, so that: