Build, run and manage your data pipelines with Python or SQL on any cloud
VDK DAG (previously vdk-meta-jobs) is the official name of the plugin allowing users to express dependencies between data jobs and is released as Beta with more stability and usability and documentation improvements.
Check out for more in the plugin page.
Now users can share links with filters applied:
Users can now access VDK UI using quickstart-vdk. VDK UI is made to be much more configurable:
People now can specify the python version they need their job to run when deployed in VDK Control Service runtime:
vdk deploy --python-version 3.7 ..
Or in job config.ini
[job]
python_version = 3.7
Users can also see which version of python is VDK Control Service supporting currently:
vdk info
would return something like
Getting control service information... VDK Control service version: PipelinesControlService/0.0.1-SNAPSHOT/5f078fe ... Supported python versions: 3.9 3.8
Full Changelog: https://github.com/vmware/versatile-data-kit/compare/v0.13...v0.14
An installed Generative Data Pack plugin automatically expands the data sent for ingestion.
This GDP plugin detects the execution ID of a Data Job running, and decorates your data product with it. So that, it is now possible to correlate a data record with a particular ingestion Data Job execution ID.
For more information see the plugin documentation
Now each job in a DAG can be passed arguments :
{
"job_name": "name-of-job",
"team_name": "team-of-job",
"fail_meta_job_on_error": false,
"arguments": <ARGUMENTS IN DICTIONARY FORMAT HERE>,
"depends_on": ["name-of-job1", "name-of-job2"]
}
Users will be able to develop jobs entirely in a Notebook file with all features of VDK available out of the box After installation of vdk-notebook users can now will have access to job_input interface to execute templates, ingest data and all else.
To enable separation of product and development code vdk-notebook integration provides a way for users to set which cells are deployable and part of their production code and which are not.
When installing quickstart-vdk VDK Server is available for local testing and now includes UI:
pip install quickstart-vdk
vdk server --install
For more information see here
The Versatile Data Kit Frontend provides 2 npm (angular) libraries which can be used to build integrate VDK UI with your own screens:
Full Changelog: https://github.com/vmware/versatile-data-kit/compare/v0.12...v0.13
VDK Operations UI would enable data practitioners to efficiently manage (operate and monitor) their data jobs. It has been used internally in VMware for some time and the team open source it last month.
Check out more details at the Operations UI VEP
Look forward to the official launch soon.
Significantly simplified and improve the main README and the CONTRIBUTING.md thanks to @gary-tai and @zverulacis
implemented a limit on starting jobs at once
META_JOBS_MAX_CONCURRENT_RUNNING_JOBS=<number>
Learn more about the VDK Meta Jobs features in VDK Meta Jobs VEP
We are working on introducing an optional python_version property to the Control Service API, which allows users to specify the Python version they want to use for their job deployment. This means users no longer have to rely on the service administrator to make changes to the configuration and can deploy their jobs with the version they need.
See more information in the Multiple Python Versions VEP
So far the way VDK recommended to store secrets was to use Properties API. Though it works well, it doesn't really meet the criteria for storing properly restricted data and likely also confidential data
The team is working on providing similar to Properties interface Secrets (backed by HashiCorp Vault).
See more information in the Vault Integration For Secrets Storage VEP
Full Changelog: https://github.com/vmware/versatile-data-kit/compare/v0.11...v0.12
Allow quality checks to be made before the data is inserted into the target table. Currently, the checks done on the processing step are not covering if the semantics of the data is correct. Therefore, bad data could went into the target table which could be unwanted behavior.
Example:
def sample_check_true(tmp_table_name):
return False if "bad" in tmp_table_name else True
template_args["check"] = sample_check
job_input.execute_template(
template_name="load/dimension/scd1",
template_args=template_args,
)
When querying information about jobs now users of the Jobs QUery API can use wildcard matches :
wildcard matching for example *search*
in graphQl filters for job name
and team name
as well as before exact matching of search strings
Users are looking to be able to determine where requests originated from when analyzing and browsing the telemetry data about VDK Control Service usage.
export VDK_CONTROL_SERVICE_USER_AGENT = foo
or in config.ini
[vdk]
vdk_control_service_user_agent=foo
If not set it would default to "vdk-control-cli/{version} ({os.name}; {sys.platform})" + {python version}
A new VDK plugin that supports running data jobs which consists of .ipynb files. You can see VDK Notebook plugin page for more information.
This extension introduces a magic command for Jupyter. The command enables the user to load job_input for his current data job and use it freely while working with Jupyter. You can see VDK ipython plugin page for more information.
Check the installation page
Full Changelog: https://github.com/vmware/versatile-data-kit/compare/v0.10...v0.11
Introduces thread-dump capabilities in the Data Jobs
See more details in the plugin home page and the VDK Enhancement Proposal
Introduces support for Python 3.11 in vdk-core and other plugins
See installation instructions here. The versions of VDK components released under VDK 0.10 are:
control-service 1.5.707959356 vdk-core==0.3.723457889
vdk-lineage-model==0.0.723435904 vdk-meta-jobs==0.1.723435904 vdk-sqlite==0.1.730902357 vdk-jobs-troubleshooting==0.2.741769066 vdk-lineage==0.3.723435904 vdk-control-cli==1.3.736732752
Full Changelog: https://github.com/vmware/versatile-data-kit/compare/v0.9...v0.10
Using this plugin you can specify dependencies between data jobs as a direct acyclic graph (DAG).
For example
def run(job_input):
jobs = [
{
"job_name": "name-of-job",
"team_name": "team-of-job",
"fail_meta_job_on_error": True or False,
"depends_on": [name-of-job1, name-of-job2]
},
...
]
MetaJobInput().run_meta_job(jobs)
See more details in the plugin home page
During the installation of Control Service administrators can limit what type of files can be uploaded as part of a data job.
A new configuration option is added called uploadValidationFileTypesAllowList
.
uploadValidationFileTypesAllowList
is comma separated list with file types.
For example Setting
uploadValidationFileTypesAllowList=image/png,text/plain
then only png images and plain text files can be uploaded. Otherwise, upload requests will fail.
See more details in helm chart documentation
This plugin allows for the configuration of the format of VDK logs.
Before there were separate plugins for each format, but they are not deprecated in favour of this one.
The plugin introduces a new configuration option LOGGING_FORMAT
with possible values 'json', 'ltsv', 'text'
export LOGGING_FORMAT=JSON
For embedded DB for control-service metadata storage, the Bitnami-available chart of PostgreSQL is added.
Now user can install it with
helm install vdk-control-service --postgresql.enabled=true cockroachdb.enabled=false
See installation instructions here. The versions of VDK components released under VDK 0.7 are:
control-service 1.5.707959356 vdk-core==0.3.692414765
vdk-logging-json==0.1.693641831 vdk-meta-jobs==0.1.684477187 vdk-postgres== 0.0.692283840 vdk-trino== 0.4.703555598
Full Changelog: https://github.com/vmware/versatile-data-kit/compare/v0.8...v0.9
This plugin provides the ability to audit and potentially limit user operations. It requires Python 3.8 or newer. These operations can be deep within the Python runtime or standard libraries, such as dynamic code compilation, module imports, or OS command invocations.
If we want to forbid some os.* operations we can do it like this:
export AUDIT_HOOK_ENABLED=true
export AUDIT_HOOK_FORBIDDEN_EVENTS_LIST='os.removexattr;os.rename;os.rmdir;os.scandir'
export AUDIT_HOOK_EXIT_ON_FORBIDDEN_EVENT=true
vdk run <job-name>
See more details in the vdk-audit plugin page
Deployed jobs by Control Service can now use any version of Python and not just 3.7 automatically.
This template can be used to load raw data from Data Lake to target Table in Data Warehouse. In summary, it appends all records from the source table to the target table. Similar to all other SQL modeling templates there is also schema validation, table refresh and statistics are computed when necessary.
Example:
def run(job_input):
# . . .
template_args = {
'source_schema': 'source',
'source_view': 'view_source',
'target_schema': 'target',
'target_table': 'destination_table'
}
job_input.execute_template('insert', template_args)
See more details in the template documentation page
See installation instructions here. The versions of VDK components released under VDK 0.7 are:
control-service 1.5.671965442 vdk-core==0.3.662978536
vdk-ingest-http==0.2.670842377 vdk-impala==0.4.672320306
Full Changelog: https://github.com/vmware/versatile-data-kit/compare/v0.7...v0.8
Since template executions are autonomous data job runs, we need to be able to determine if a template is running at any time. For example, to distinguish between root data job finalization, and template finalization
For example if we want to send telemetry somewhere:
@hookimpl
def finalize_job(self, context: JobContext) -> None:
template = context.core_context.state.get(ExecutionStateStoreKeys.TEMPLATE_NAME)
if template:
telemetry.send(phase="finalize_template", template_name = template)
else:
telemetry.send(phase="finalize_job", job_name=context.name)
Enable users to override logs per module, temporarily (e.g for debugging or prototyping reasons to increase the verbosity of certain module).
For example assuming default log level is INFO we can enable verbose logs for 2 modules "vdk.api" and "custom.module":
export LOG_LEVEL_MODULE="vdk.api=DEBUG;custom.module=DEBUG"
vdk run job-name
Or in specific job config.ini:
[vdk]
log_level_module=vdk.api=DEBUG;custom.module=DEBUG
A simplistic plugin, that allows a developer or presenter to quickly store properties on the local FS.
It can be used to store secrets/configuration for a dev/demo session, that does not require a prerequisite of the entire Control Service installed and running. It can be used to test a job run locally only without updating the state of the deployed job.
Example:
export PROPERTIES_DEFAULT_TYPE="fs-properties-client"
or in specific job config.ini
[vdk]
properties_default_type=fs-properties-client
Now properties are stored in a local file. The file location can be further configured using FS_PROPERTIES_FILENAME
and FS_PROPERTIES_DIRECTORY
Create new plugin skeleton very easy
cookiecutter https://github.com/tozka/cookiecutter-vdk-plugin.git
and follow the instructions
Now a job (or a template) can be canceled from any step and all remaining steps in the job (or template) will be skipped. For example, it can be used if a data job depends on processing data from a source that has indicated no new entries since the last run, then we can skip the remaining steps.
Example:
def run(job_input: IJobInput):
data = get_last_delta()
if not data:
job_input.skip_remaining_steps()
See installation instructions here. The versions of VDK components released under VDK 0.7 are:
control-service 1.5.622899758
vdk-control-cli==1.3.626767210 vdk-core==0.3.652866366
vdk-properties-fs==0.0.651770458 vdk-kerberos-auth==0.3.631374202 vdk-impala==0.4.651849986
Full Changelog: https://github.com/vmware/versatile-data-kit/compare/v0.6...v0.7
Before configuration option must have been prefixed with "vdk_" when set as an environment variable in order to be recognized. This was very error prone since the options are documented without the prefix.
Now they can be set without a prefix as well.
The following are equivalent:
export VDK_DB_DEFAULT_TYPE='impala'
export DB_DEFAULT_TYPE='impala'
If both are set, the "prefixed" variable has a higher priority.
VDK Lineage Model plugin aims to abstract emitting lineage data from VDK data jobs, so that different lineage loggers can be configured at run time in any plugin that supports emitting lineage data
Check out more at the plugin page.
Alongside vdk ingest-csv
which enabled users to import (or ingest) CSV data into a table.
Users can now export CSV with a simple command from SQL query:
vdk export-csv -q "select * from my_table --file 'output.csv'
Checkout out more at the plugin page
Until now properties required Control Service to be able to work. Sometimes for prototyping and testing purposes, you do not need to connect to external services.
In a specific job's config file (config.ini
[vdk]
properties_default_type = memory
Or as an environment variable
export properties_default_type="memory"
Example how to anonymize any data being ingested using VDK with a plugin.
Check out more at the example page
Example how to create dependencies between data job in Airflow.
Check out more at the example page
See installation instructions here. The versions of VDK components released under VDK 0.6 are:
control-service 1.5.620438292 vdk-core==0.3.620677184
airflow-provider-vdk==0.0.602273476 vdk-lineage-model== 0.0.581430542 vdk-kerberos-auth==0.3.584577337 vdk-ingest-http==0.2.616713987 vdk-impala==0.4.613570906 vdk-lineage== 0.3.604201902 vdk-trino== 0.4.605101952
Full Changelog: https://github.com/vmware/versatile-data-kit/compare/v0.5...v0.6
The hooks enable users to add behavior to existing SQL queries without modifying the code itself. It is invoked for reach query before and after enabling to track its full execution. For example
@hookimpl(hookwrapper=True)
db_connection_execute_operation(execution_cursor: ExecutionCursor) -> Optional[int]:
start = time.time()
outcome = yield # we yield the execution so that query is executed
end = time.time()
log.info(f" duration: {end - start}. ")
Users can integrate with Apache Airflow to orchestrate in a DAG (workflow) Data Jobs. Check out more at airflow-provider-vdk
Full Changelog: https://github.com/vmware/versatile-data-kit/compare/v0.4...v0.5