Versatile Data Kit Versions Save

Build, run and manage your data pipelines with Python or SQL on any cloud

v0.14

11 months ago

Major features include:

VDK DAG plugin release

VDK DAG (previously vdk-meta-jobs) is the official name of the plugin allowing users to express dependencies between data jobs and is released as Beta with more stability and usability and documentation improvements.

Check out for more in the plugin page.

Now users can share links with filters applied:

  • Data Jobs list (Manage and Explore screen) are shareable through URL, as every applied filter is persisted to URL and vice-versa
  • Data Job Executions screen filters and sort parameters are shareable through URL, as every applied filter or sort is persisted to URL and vice-versa

VDK UI configuration improvements and easy to get started by using quickstart-vdk

Users can now access VDK UI using quickstart-vdk. VDK UI is made to be much more configurable:

  • Toggleable authentication (default: enabled) using the 'skipAuth' flag.
  • Configuration of authentication parameters.
  • Ability to specify visual elements displayed, e.g., navigation button to the Explore page.

VDK Control CLI supports python version

People now can specify the python version they need their job to run when deployed in VDK Control Service runtime:

vdk deploy --python-version 3.7 ..

Or in job config.ini

[job]
python_version = 3.7

Users can also see which version of python is VDK Control Service supporting currently:

vdk info

would return something like

Getting control service information... VDK Control service version: PipelinesControlService/0.0.1-SNAPSHOT/5f078fe ... Supported python versions: 3.9 3.8

What's Changed

New Contributors

Full Changelog: https://github.com/vmware/versatile-data-kit/compare/v0.13...v0.14

v0.13

1 year ago

Major features include:

New plugin: vdk-gdp-execution-id

An installed Generative Data Pack plugin automatically expands the data sent for ingestion.

This GDP plugin detects the execution ID of a Data Job running, and decorates your data product with it. So that, it is now possible to correlate a data record with a particular ingestion Data Job execution ID.

For more information see the plugin documentation

vdk-dag: pass arguments to jobs in a DAG

Now each job in a DAG can be passed arguments :

{
"job_name": "name-of-job",
"team_name": "team-of-job",
"fail_meta_job_on_error": false,
"arguments": <ARGUMENTS IN DICTIONARY FORMAT HERE>,
"depends_on": ["name-of-job1", "name-of-job2"]
}

vdk-notebook: VDK job input in vdk cells

Users will be able to develop jobs entirely in a Notebook file with all features of VDK available out of the box After installation of vdk-notebook users can now will have access to job_input interface to execute templates, ingest data and all else.

image

vdk-notebook: vdk and non-vdk cells

To enable separation of product and development code vdk-notebook integration provides a way for users to set which cells are deployable and part of their production code and which are not.

image

quickstart-vdk now includes the Operations UI

When installing quickstart-vdk VDK Server is available for local testing and now includes UI:

pip install quickstart-vdk
vdk server --install 

For more information see here

Versatile Data Kit Frontend npm libraries release

The Versatile Data Kit Frontend provides 2 npm (angular) libraries which can be used to build integrate VDK UI with your own screens:

  • @versatiledatakit/data-pipelines Versatile Data Kit Data Pipelines library provides UI screens that helps to manage data jobs via Versatile Data Kit Control Service
  • @versatiledatakit/shared Versatile Data Kit Shared library enables reusability of shared features like: NgRx Redux, Error Handlers, Utils, Generic Components, etc.

What's Changed

New Contributors

Full Changelog: https://github.com/vmware/versatile-data-kit/compare/v0.12...v0.13

v0.12

1 year ago

Major features include:

Open-sourcing VDK Operations UI

VDK Operations UI would enable data practitioners to efficiently manage (operate and monitor) their data jobs. It has been used internally in VMware for some time and the team open source it last month.

Check out more details at the Operations UI VEP

Look forward to the official launch soon.

Documentation Improvements

Significantly simplified and improve the main README and the CONTRIBUTING.md thanks to @gary-tai and @zverulacis

VDK Meta Jobs Preparation for Alpha release

implemented a limit on starting jobs at once

META_JOBS_MAX_CONCURRENT_RUNNING_JOBS=<number>

Learn more about the VDK Meta Jobs features in VDK Meta Jobs VEP

Started initiative to support multiple python versions

We are working on introducing an optional python_version property to the Control Service API, which allows users to specify the Python version they want to use for their job deployment. This means users no longer have to rely on the service administrator to make changes to the configuration and can deploy their jobs with the version they need.

See more information in the Multiple Python Versions VEP

Started initiative to create Secrets Interface

So far the way VDK recommended to store secrets was to use Properties API. Though it works well, it doesn't really meet the criteria for storing properly restricted data and likely also confidential data

The team is working on providing similar to Properties interface Secrets (backed by HashiCorp Vault).

See more information in the Vault Integration For Secrets Storage VEP

What's Changed

New Contributors

Full Changelog: https://github.com/vmware/versatile-data-kit/compare/v0.11...v0.12

v0.11

1 year ago

Major features include:

Introduce data quality checks (pre-alpha) (for scd1 template)

Allow quality checks to be made before the data is inserted into the target table. Currently, the checks done on the processing step are not covering if the semantics of the data is correct. Therefore, bad data could went into the target table which could be unwanted behavior.

Example:

    def sample_check_true(tmp_table_name):
        return False if "bad" in tmp_table_name else True 

    template_args["check"] = sample_check 
    job_input.execute_template(
        template_name="load/dimension/scd1",
        template_args=template_args,
    )

Jobs Query API (GraphQL) wildcard matching filter for team and job names

When querying information about jobs now users of the Jobs QUery API can use wildcard matches : wildcard matching for example *search* in graphQl filters for job name and team name as well as before exact matching of search strings

Provide User Agent when using VDK CLI

Users are looking to be able to determine where requests originated from when analyzing and browsing the telemetry data about VDK Control Service usage.

export VDK_CONTROL_SERVICE_USER_AGENT = foo 

or in config.ini

[vdk]
vdk_control_service_user_agent=foo

If not set it would default to "vdk-control-cli/{version} ({os.name}; {sys.platform})" + {python version}

New plugin: vdk-notebook

A new VDK plugin that supports running data jobs which consists of .ipynb files. You can see VDK Notebook plugin page for more information.

vdk-ipython

This extension introduces a magic command for Jupyter. The command enables the user to load job_input for his current data job and use it freely while working with Jupyter. You can see VDK ipython plugin page for more information.

Installation

Check the installation page

What's Changed

New Contributors

Full Changelog: https://github.com/vmware/versatile-data-kit/compare/v0.10...v0.11

v0.10

1 year ago

Summary

Major features include:

vdk-jobs-troubleshooting - new plugin

Introduces thread-dump capabilities in the Data Jobs

See more details in the plugin home page and the VDK Enhancement Proposal

Support for Python 3.11

Introduces support for Python 3.11 in vdk-core and other plugins

Package versions

See installation instructions here. The versions of VDK components released under VDK 0.10 are:

Main components

control-service 1.5.707959356 vdk-core==0.3.723457889

Plugins

vdk-lineage-model==0.0.723435904 vdk-meta-jobs==0.1.723435904 vdk-sqlite==0.1.730902357 vdk-jobs-troubleshooting==0.2.741769066 vdk-lineage==0.3.723435904 vdk-control-cli==1.3.736732752

What's Changed

New Contributors

Full Changelog: https://github.com/vmware/versatile-data-kit/compare/v0.9...v0.10

v0.9

1 year ago

Summary

Major features include:

vdk-meta-jobs new plugin

Using this plugin you can specify dependencies between data jobs as a direct acyclic graph (DAG).

For example

def run(job_input):
    jobs = [
        {
        "job_name": "name-of-job",
        "team_name": "team-of-job",
        "fail_meta_job_on_error": True or False,
        "depends_on": [name-of-job1, name-of-job2]
        },
        ...
    ]
    MetaJobInput().run_meta_job(jobs)

See more details in the plugin home page

Control Service security hardening

  • Options for jobs to run in read-only file system
  • Provide credentials configuration for using private images during by the Control Service
  • Use a separate file system for storing temporary user-supplied files by Control Service
  • Enhanced job upload validation for zip exploits and unallowed files

Data Job Upload validation allow list

During the installation of Control Service administrators can limit what type of files can be uploaded as part of a data job. A new configuration option is added called uploadValidationFileTypesAllowList . uploadValidationFileTypesAllowList is comma separated list with file types.

For example Setting

uploadValidationFileTypesAllowList=image/png,text/plain

then only png images and plain text files can be uploaded. Otherwise, upload requests will fail.

See more details in helm chart documentation

vdk-logging-format - new plugin

This plugin allows for the configuration of the format of VDK logs.

Before there were separate plugins for each format, but they are not deprecated in favour of this one.

The plugin introduces a new configuration option LOGGING_FORMAT with possible values 'json', 'ltsv', 'text'

export LOGGING_FORMAT=JSON

Control Service helm chart support for Postgres

For embedded DB for control-service metadata storage, the Bitnami-available chart of PostgreSQL is added.

Now user can install it with

helm install vdk-control-service --postgresql.enabled=true cockroachdb.enabled=false

Package versions

See installation instructions here. The versions of VDK components released under VDK 0.7 are:

Main components

control-service 1.5.707959356 vdk-core==0.3.692414765

Plugins

vdk-logging-json==0.1.693641831 vdk-meta-jobs==0.1.684477187 vdk-postgres== 0.0.692283840 vdk-trino== 0.4.703555598

What's Changed

New Contributors

Full Changelog: https://github.com/vmware/versatile-data-kit/compare/v0.8...v0.9

v0.8

1 year ago

Summary

Major features include:

New plugin: VDK Audit

This plugin provides the ability to audit and potentially limit user operations. It requires Python 3.8 or newer. These operations can be deep within the Python runtime or standard libraries, such as dynamic code compilation, module imports, or OS command invocations.

If we want to forbid some os.* operations we can do it like this:

export AUDIT_HOOK_ENABLED=true
export AUDIT_HOOK_FORBIDDEN_EVENTS_LIST='os.removexattr;os.rename;os.rmdir;os.scandir'
export AUDIT_HOOK_EXIT_ON_FORBIDDEN_EVENT=true

vdk run <job-name>

See more details in the vdk-audit plugin page

Any version of python in VDK Control Service

Deployed jobs by Control Service can now use any version of Python and not just 3.7 automatically.

Insert only impala load template

This template can be used to load raw data from Data Lake to target Table in Data Warehouse. In summary, it appends all records from the source table to the target table. Similar to all other SQL modeling templates there is also schema validation, table refresh and statistics are computed when necessary.

Example:

def run(job_input):
    # . . .
    template_args = {
        'source_schema': 'source',
        'source_view': 'view_source',
        'target_schema': 'target',
        'target_table': 'destination_table'
    }
    job_input.execute_template('insert', template_args)

See more details in the template documentation page

Package versions

See installation instructions here. The versions of VDK components released under VDK 0.7 are:

Main components

control-service 1.5.671965442 vdk-core==0.3.662978536

Plugins

vdk-ingest-http==0.2.670842377 vdk-impala==0.4.672320306

What's Changed

New Contributors

Full Changelog: https://github.com/vmware/versatile-data-kit/compare/v0.7...v0.8

v0.7

1 year ago

Summary

Major features include:

VDK Template running state detection capability

Since template executions are autonomous data job runs, we need to be able to determine if a template is running at any time. For example, to distinguish between root data job finalization, and template finalization

For example if we want to send telemetry somewhere:

    @hookimpl
    def finalize_job(self, context: JobContext) -> None:
        template = context.core_context.state.get(ExecutionStateStoreKeys.TEMPLATE_NAME)
        if template: 
           telemetry.send(phase="finalize_template", template_name = template) 
        else: 
           telemetry.send(phase="finalize_job", job_name=context.name)

New Logging configuration LOG_LEVEL_MODULE

Enable users to override logs per module, temporarily (e.g for debugging or prototyping reasons to increase the verbosity of certain module).

For example assuming default log level is INFO we can enable verbose logs for 2 modules "vdk.api" and "custom.module":

export LOG_LEVEL_MODULE="vdk.api=DEBUG;custom.module=DEBUG" 
vdk run job-name 

Or in specific job config.ini:

[vdk]
log_level_module=vdk.api=DEBUG;custom.module=DEBUG

New plugin backend for Properties: from local file system

A simplistic plugin, that allows a developer or presenter to quickly store properties on the local FS.

It can be used to store secrets/configuration for a dev/demo session, that does not require a prerequisite of the entire Control Service installed and running. It can be used to test a job run locally only without updating the state of the deployed job.

Example:

export PROPERTIES_DEFAULT_TYPE="fs-properties-client"

or in specific job config.ini

[vdk]
properties_default_type=fs-properties-client

Now properties are stored in a local file. The file location can be further configured using FS_PROPERTIES_FILENAME and FS_PROPERTIES_DIRECTORY

Coockiecutter for new plugins

Create new plugin skeleton very easy

cookiecutter https://github.com/tozka/cookiecutter-vdk-plugin.git

and follow the instructions

Add the ability to cancel remaining job steps

Now a job (or a template) can be canceled from any step and all remaining steps in the job (or template) will be skipped. For example, it can be used if a data job depends on processing data from a source that has indicated no new entries since the last run, then we can skip the remaining steps.

Example:

def run(job_input: IJobInput): 
    data = get_last_delta()
    if not data:
        job_input.skip_remaining_steps()

Package versions

See installation instructions here. The versions of VDK components released under VDK 0.7 are:

Main components

control-service 1.5.622899758

vdk-control-cli==1.3.626767210 vdk-core==0.3.652866366

Plugins

vdk-properties-fs==0.0.651770458 vdk-kerberos-auth==0.3.631374202 vdk-impala==0.4.651849986

What's Changed

New Contributors

Full Changelog: https://github.com/vmware/versatile-data-kit/compare/v0.6...v0.7

v0.6

1 year ago

Summary

Major features include:

Configuration auto-wiring improvement: detect non vdk_ prefixed environment variables

Before configuration option must have been prefixed with "vdk_" when set as an environment variable in order to be recognized. This was very error prone since the options are documented without the prefix.

Now they can be set without a prefix as well.

The following are equivalent:

export VDK_DB_DEFAULT_TYPE='impala'
export DB_DEFAULT_TYPE='impala'

If both are set, the "prefixed" variable has a higher priority.

New plugin/library: vdk-lineage-model

VDK Lineage Model plugin aims to abstract emitting lineage data from VDK data jobs, so that different lineage loggers can be configured at run time in any plugin that supports emitting lineage data

Check out more at the plugin page.

New export-csv command

Alongside vdk ingest-csv which enabled users to import (or ingest) CSV data into a table. Users can now export CSV with a simple command from SQL query:

vdk export-csv -q "select * from my_table --file 'output.csv'

Checkout out more at the plugin page

In memory properties client

Until now properties required Control Service to be able to work. Sometimes for prototyping and testing purposes, you do not need to connect to external services.

  • New configuration value can be set.

In a specific job's config file (config.ini

[vdk]
properties_default_type = memory

Or as an environment variable

export properties_default_type="memory"
  • Now the properties would be entirely in memory. That means they will be "deleted" after the job's run.

New example: Ingest and anonymize

Example how to anonymize any data being ingested using VDK with a plugin.

Check out more at the example page

New example: Airflow integration

Example how to create dependencies between data job in Airflow.

Check out more at the example page

Package versions

See installation instructions here. The versions of VDK components released under VDK 0.6 are:

Main components

control-service 1.5.620438292 vdk-core==0.3.620677184

Plugins

airflow-provider-vdk==0.0.602273476 vdk-lineage-model== 0.0.581430542 vdk-kerberos-auth==0.3.584577337 vdk-ingest-http==0.2.616713987 vdk-impala==0.4.613570906 vdk-lineage== 0.3.604201902 vdk-trino== 0.4.605101952

What's Changed

New Contributors

Full Changelog: https://github.com/vmware/versatile-data-kit/compare/v0.5...v0.6

v0.5

1 year ago

Summary

Major features include:

New managed db_connection_execute_operation hook

The hooks enable users to add behavior to existing SQL queries without modifying the code itself. It is invoked for reach query before and after enabling to track its full execution. For example

@hookimpl(hookwrapper=True)
db_connection_execute_operation(execution_cursor: ExecutionCursor) -> Optional[int]: 
                start = time.time()
                outcome = yield # we yield the execution so that query is executed 
                end = time.time()
                log.info(f" duration: {end - start}. ")

Airflow Provider VDK release (beta)

Users can integrate with Apache Airflow to orchestrate in a DAG (workflow) Data Jobs. Check out more at airflow-provider-vdk

What's Changed

New Contributors

Full Changelog: https://github.com/vmware/versatile-data-kit/compare/v0.4...v0.5