Road to Azure Data Engineer Part-I: DP-200 - Implementing an Azure Data Solution
Various modules and percentage involved in DP-200.
Data Storage - (20-30%)
Data Processing - (30-35%)
Data Security - (15-20%)
Monitoring Data Solutions - (10-15%)
Troubleshooting Azure services - (10 - 15%)
Duration: 210 min
Multiple choice questions around 60 including Lab consisting of around 10 small tasks to be performed on Azure Portal.
Questions would come focusing the “skills measured” topics in the DP-200
Typical questions would test the selection of Azure services for a given case study and also sequence of steps involved in implementing a solution like ingesting data from ADLS to SQL DW using polybase
7 MCP mandatory questions. (Cannot go back after selecting one answer)
Case Study
Mixed Questions
4 configurations options available includes
Azure Blob
Azure Files
Azure Queues
Azure Tables
Performance:
Access tier:
Note:
Azure Blob Usage
Apache Spark-based analytics platform
Enterprise Security
Integration with other Cloud Services
SQL | DataFrame |
---|---|
SELECT col_1 FROM myTable | df.select(col("col_1")) |
DESCRIBE myTable | df.printSchema() |
SELECT * FROM myTable WHERE col_1 > 0 | df.filter(col("col_1") > 0) |
..GROUP BY col_2 | ..groupBy(col("col_2")) |
..ORDER BY col_2 | ..orderBy(col("col_2")) |
..WHERE year(col_3) > 1990 | ..filter(year(col("col_3"))) > 1990) |
SELECT * FROM myTable LIMIT 10 | df.limit(10) |
display(myTable)(text format) | df.show() |
display(myTable)(html format) | display(df) |
Basic Transformations
Advanced Transformations
Can Build Globally Distributed Databases with Cosmos DB, it can handle
Azure Cosmos DB indexes every field by default
Azure Cosmos DB (NOSql)
Request Unit (RU) for a DB
Note: A physical node can have 10 GB of information that means each Unique partition Key can have 10 GB of unique values.
We can also create the same using Azure CLI
az account list —output table // Lists the set of Azure subscriptions that we have
Az account set —subscription “<subscription name>”
az group list —out table // List of resource groups
export NAME=“<Azure Cosmos DB account name>”
export RESOURCE_GROUP=“<rgn>[sandbox resource group name]</rgn>”
Export LOCATION=“<location>” // Data centre location
Export DB_NAME=“Products”
Az group create —name <name> —location <location>
Az cosmosdb create —name $NAME —kind GlobalDocumentDB —resource-group $RESOURCE_GROUP
Az cosmosdb database create —name $NAME —db-name $DB_NAME —resource-group $RESOURCE_GROUP
Az cosmosdb collection create —collection-name “Clothing“ —partition-key-path “/productId” —throughput 1000 - name $NAME —db-name $DB_NAME —resource-group $RESOURCE_GROUP
Consistency Level | Guarantees |
---|---|
Strong | Linearizability. Reads are guaranteed to return the most recent version of an item |
Bounded Staleness | ConsistentPrefix. Reads lag behind writesby at most k prefixes or t interval. |
Session | Consisten Prefix. Monotonic reads, monotonic writes, read-your-writes, write-follows-reads. |
Consistent Prefix | Updates returned are some prefix of all the updates, with no gaps. |
Eventual | Out of order reads. |
Question related to setting up latency :bangbang: :triangular_flag_on_post:
What is the Latency I will have to use in order to provide the lower latency of reads and writes :bangbang: :triangular_flag_on_post: - Eventual Consistency
COSMOS-DB takes care of consistency of data when replicated :bangbang: :triangular_flag_on_post:
To connect to Database
jay@Azure:~$ az configure --defaults group=ms-dp-200 sql-server=jaysql01
jay@Azure:~$ az sql db list
O/P:
jay@Azure:~$ az sql db list | jq '[.[] | {name: .name}]'
O/P:
[
{
"name": "master"
},
{
"name": "sqldbjay01"
}
]
jay@Azure:~$ az sql db show --name sqldbjay01
az sql db show-connection-string --client sqlcmd --name sqldbjay01
O/P:
"sqlcmd -S tcp:<servername>.database.windows.net,1433 -d sqldbjay01 -U <username> -P <password> -N -l 30"
"sqlcmd -S tcp:sqldbjay01.database.windows.net,1433 -d sqldbjay01 -U jay -P “******” -N -l 30"
SELECT name FEOM sys.tables; GO
SQL-DB does not take care of consistency of data when replicated, it needs to be done manually. :bangbang: :triangular_flag_on_post:
Bottom-Up Architecture
Top-down Architecture
Elastic scale & performance
Powered by the Cloud
Using PolyBase to Load Data in Azure SQL Data Warehouse :bangbang: :triangular_flag_on_post:
The MPP engine’s integration method with PolyBase
PolyBase ETL for DW are
- Go to Resource
Blobs
CREATE MASTER KEY;
CREATE DATABASE SCOPED CREDENTIAL AzureStorageCredential
WITH
IDENTITY = ‘jayDW’,
SECRET = ‘THE-VALUE-OF-THE-ACCESS-KEY’ -- put key1’s value here
;
CREATE EXTERNAL DATA SOURCE AzureStorage
WITH (
TYPE = HADOOP ,
LOCATION = ‘wasbs://[email protected]‘,
CREDENTIAL = AzureStorageCredential
);
CREATE EXTERNAL FILE FORMAT TextFile
WITH (
FORMAT_TYPE = DelimiteddText,
FORMAT_OPTIONS (FIELD_TERMINATOR = ‘,’)
);
— Load the data from Azure Blob storage to SQL Data Warehouse
CREATE TABLE [dbo].[StageDate]
WITH (
CLUSTERED COLUMNSTORE INDEX,
DISTRIBUTION = ROUND_ROBIN
)
AS
SELECT * FOM [dbo].[Temp];
— Create statistics on the new data
CREATE STATISTICS [DataKey] on [StageDate] ([DateKey]);
CREATE STATISTICS [Quarter] on [StageDate] ([DateKey]);
CREATE STATISTICS [Month] on [StageDate] ([Month]);
Linked services are much like connection strings, which define the connection information needed for Data Factory to connect to external resources.
Activities within ADF defines the actions that will be performed on the data and there are three categories including:
Securing your network from attacks and unauthorized access is an important part of any architecture.
Databases stores information that is sensititve, such as physical address, email address, and phone numbers. The following is used to protect this data:
Azure Monitor provides a holistic monitoring approach by collecting, analyzing, and acting on telemetry from both cloud and on-premises environments
Log Analytics within Azure monitor has the capability to monitor and measure network activity.
Data Lake Storage
SQL Database
Cosmos DB
Colocation of Resources
SQL Data Warehouse
When using Stream Analytics, a Job encapsulates the stream Analytic work and is made up of three components:
When trying to resolve data load issues, it is forst pragmatic to make the holistic checks on Azure, as well as the network checks and diagnose and solve the issue check. After that, then check:
Data redundancy is the process of storing data in multiple locations to ensure that it is highly available.
There should be process that are involved in backing up or preoviding failover for databases in an Azure data platform technology. Depending on circumstances, there are numerous approaches that can be adopted.
Advantages:
Unstructured: Product catalog data
Only need to be retrieved by ID.
Customers require a high number of read operations with low latency.
Creates and updates will be somewhat infrequent and can have higher latency than read operations.
Latency & throughput: Retrievals by ID need to support low latency and high throughput. Creates and updates can have higher latency than read operations.
Transactional support: Not required
Azure Service | Purpose |
---|---|
Azure SQL Data Sync | Synchronization of data between Azure Sql & On-premises SQL with bi-directional |
Azure SQL DB Elastic pool's | Depend on eDTUs or Vcore's and max data size |
Azure Data Lake Storage | Azure Storage with Hierarchical nature |
Azure SQL Database Managed Instance | Data Migration between On-premise & Cloud with almost 100% compatability eg: From on-premises or IaaS, self-built, or ISV provided environment to fully managed PaaS cloud environment, with as low migration effort as possible. |
Azure Resource Manager Templates | Used when same operation needs to be performed frequently or daily basis with minimal effort eg: Clusters |
Data Migration Assistant | Synchronize data from on-premises Microsoft SQL Server database to Azure SQL Database and to determine whether data will move without compatibility issues |
Azure Data Warehouse | Used frequently for Analytical data store |
Azure Data Factory | Orchestrate and manage the data lifecycle |
Azure Databricks (Spark) | In memory processing (or) support for usage of Scala, Java, Python, R languages (or) Cluster scale up or scale down |
Data load between any of the two services SQl <=> Blob <=> Data-warehouse | 99% of the cases we use CTAS(Create Table As Select) and not other operations such as Insert into, so on.. |
Azure Database Migration Service (DMS) | A fully managed service designed to enable seamless migrations from multiple database sources to Azure data platforms with minimal downtime (online migrations). |
Database Experimentation Assistant (DEA) | Helps you evaluate a targeted version of SQL Server for a specific workload. Customers upgrading from earlier versions of SQL Server (starting with 2005) to more recent versions of SQL Server can use the analysis metrics that the tool provides. |
SQL Server Migration Assistant (SSMA) | A tool designed to automate database migration to SQL Server from Microsoft Access, DB2, MySQL, Oracle, and SAP ASE. |
Azure Data Warehouse | Data distribution | Reason | Fit For |
---|---|---|---|
Small Dimension Table | Replicated | Data size usually less than 2 GB | star schema with less than 2 GB of storage after compression |
Temporary/Staging Table | Round Robin | Data size usually less than 5 GB | No obvious joining key or good candidate column |
Fact Table | Hash Distributed | Data Size is huge more than 100 GB | Large dimension tables |
Type | Fit For |
---|---|
Heap | Staging or temporary table, Small tables with small lookups |
Clustered index | Tables with up to 100 million rows, Large tables (more than 100 million rows) with only 1-2 columns heavily used |
Clustered columnstore index (CCI) (default) | Large tables (more than 100 million rows) |
STANDARD | HIGH CONCURRENCY | |
---|---|---|
Recommended for... | Single User | Multiple Users |
Language Support | SQL, Python, R, and Scala | SQL, Python, and R (not Scala) |
Notebook Isolation | No | Yes |
TYPE | DESCRIPTION |
---|---|
Schedule | Runs on a wall-clock schedule (e.g. every X mins/h/d/w/m's). |
Tumbling Window | A series of fixed-sized, non-overlapping, and contiguous time intervals. |
Event-based | Runs pipelines in response to an event (e.g. Blob Created/Deleted). |
ENTITY | SQL | CASSANDRA | MONGODB | GREMLIN | TABLE |
---|---|---|---|---|---|
Container | Container | Table | Collection | Graph | Table |
Item | Document | Row | Document | Node or Edge | Item |
Database | Database | Keyspace | Database | Database | N/A |
Data Masking Attribute | Masking value | Example |
---|---|---|
Default | Zero value for numeric data types (or) Mask 4 or less string data type characters | Default value (0, xxxx, 01-01-1900) |
Custom | Mask everything except characters at beginning and at the end | Custom string (prefix [padding] suffix) |
Random Number | Returns a random number | Random number range eg: 0 to 100 |
Mask first letter and domain | [email protected] | |
Credit card | Exposes the last four digits of the designated fields | XXXX-XXXX-XXXX-1234 |
Azure Service | Parameters |
---|---|
Azure Stream Analytics | Depends on Streaming Unit |
Azure SQL DB | Depends on DTU's |
Azure Datawarehouse | Depends on Cache used. Unit measured is Data warehouse Units (DWU) |
Azure Cosmos DB | Depends on Data Integration Unit (or) Request Units (RU) |