Smart Data Lake Versions Save

Smart Automation Tool for building modern Data Lakes and Data Pipelines

2.5.0

1 year ago

Major features

Upgrade to Spark 3.3
SDL Agents
Support for Apache Iceberg
Integration with Unity Catalog

Features

#541
#549
#571
#582
#619
#621
#625
#635
#652
SmartDataLakeBuilderLab to use DataObjects more interactively in Notebooks
many-to-many transformations in Python

Improvements

Switch to log4j2 yaml format
New variable failSimulationOnMissingInputSubFeeds to configure if runs should fail when input subfeeds are missing
Expectation improvements (SQLQueryExpectation)
Improvements on JDBC transaction handling
Improvements on Schema Viewer
Proxy Support for SftpFileRefConnections
FileTransferAction: Support for multiple file transfers in parallel
Global Config: allowAsRecursiveInput - allow exceptions on specific DataObjects
Improved Xsd and JsonSchema support
Improved Metric writing to Azure LogAnalytics
Improved support on Amazon Glue

Bugfixes

#599
#627
#633
#653
Various smaller bugfixes and error handling improved

Dependencies

Spark: Update from Spark 3.2 to 3.3 Delta Lake: Update from 2.0 to 2.2

2.4.2

1 year ago

Bugfixes and improvements:

Fix writing to Oracle databases when temporary tables are involved (#633)
When saveMode=Overwrite for JdbcTableDataObject, allow writing to the database table even if the column order in the dataframe is different (#633)
Add parameters to JdbcTableConnection in order to configure the commit behaviour in JDBC connections (#633)

Note: this release is created as Hotfix Release on top of version 2.4.1, as develop-spark3 branch is already on 2.5.0-SNAPSHOT.

2.4.1

1 year ago

Bugfixes and improvements:

Increase spark-extensions version to 3.2.5 (#627): Remove restrictive avro schema equality test
Do not write schema file in simulations (#627)
Do not throw exception when there is no path for sample file in CustomFileAction (#627)

Note: this release is created as Hotfix Release on top of version 2.4.0, as develop-spark3 branch is already on 2.5.0-SNAPSHOT.

2.4.0

1 year ago

Bugfixes and improvements

#518 Schema Viewer shows wrong information #580 Can't use same ExcelFileDataObject for write and read #600 Schema viewer does not indicate whether a field is required #601 Loading Schema from file should be done lazy Leading underscores are preserved when normalizing column names ExecutionMode and executionCondition are only applied in exec phase

Features

#591 Column encryption #610 Support DataObjectStateIncrementalMode for KafkaTopicDataObject

Dependencies

Bump commons-net from 3.1 to 3.9.0

2.3.2

1 year ago

What changes are included in the pull request?

Bugfixes

#593
NotSerializableException with RelaxedCsvFileDataObject

Improvements

#577

Dependency Updates

commons-text

2.3.1

1 year ago

This is mainly a bugfix release, see: #583 #584 #578 #579

One new Feature: #575

2.3.0

1 year ago

Version upgrades

Spark 3.2.1 -> 3.2.2
Delta-lake Delta-Lake 1.1.0 -> 2.0.0

New Features

GenericDataFrame implementation to create transformations that run with Spark and Snowpark/Snowflake (#376)
Constraints and Expectations (#43, #377, #388), see also http://smartdatalake.ch/docs/reference/dataQuality#constraints
Historize with incremental cdc mode (#407), see also http://smartdatalake.ch/blog/sdl-hist
Spark file dataobject incremental mode (#517)
Spark Dataset transformations using ScalaClassSparkDsTransformer (#489)
DataObject schemas from caseClass, jsonSchema, xsdFile and avroSchemaFile (#512)
Methods to provide schema in init-phase (#522)
Support for json-schema with confluent schema registry (#538)
JDBC overall transaction (#254)
FinalStateWriter to store state once a job is finished

Minor Bugfixes and improvements

Improve parsing xsd schema
Improve Housekeeping
Implement ColNamesLowercaseTransformer and remove converting columns to lowercase
HiveConnection pathPrefix optional, FileIncrementalMoveMode absolute archivePath
Cleanup partition directories after failure in SparkFileDataObject
Fix schema versioning
Fix Airbyte supportsIncremental optional
Fix naming of input views when chaining SQL transformations
Fix transformer dataframe output mapping and input partitionvalues
Fix calling move/compactPartition only if list is not empty

Full Changelog: https://github.com/smart-data-lake/smart-data-lake/compare/2.2.1...2.3.0

2.2.1

2 years ago

Version upgrades

update Spark version 3.2.0 -> 3.2.1

New Features

StatusInfo REST-Server (#450)
Websocket for live status (#450)
DagExporter command line tool to export basic dag selected by a feed-selector

Minor Bugfixes and improvements

add maven profile to create fat-jar for Spark 3.1 (#465)
fix spark 3.1 json4s compatibility
fix reading state file from previous versions
update spark-extensions: fix execution on Databricks
fix and refine validatePartitionValuesExisting
move sparkSession from object Environment to GlobalConfig to support running multiple SDLB jobs on the same JVM (e.g. Databricks cluster)
fix Airbyte parser issue (#483)
update spark-excel and poi dependency because of vulnerability (#485)

2.2.0

2 years ago

Version upgrades

Update to Spark 3.2 (#406) Update delta lake to version 1.1 (#406)

dont use DeltaLake Table API because of strange errors
Delta Lake Version 1.1 needs Spark 3.2 Update scala-maven-plugin to support scala 2.12.14+

New Features

Implement CustomSnowparkAction (rudimentary Snowpark support, #376) Implement script support and CustomScriptAction (#422) Implement AirbyteDataObject (#365) Implement basic ScalaNotebookDfTransformer (#401) Implement SDL json schema creator (#440) Add Atlas metadata exporter implementation

Minor Bugfixes and improvements

Extend StateListener.notifyState with parameter indicating change Action Adapted StateChangeLogger to log only for the action for the notification was emitted Refactor Actions SubFeed handling Refactor integrating SparkSession into ActionPipelineContext and usage of implicit parameters Add SASL Authentication for Kafka Avoid loosing full error response text from webservice calls Improve build stability by using linesIterator, otherwise on some environments the java:String.lines has precedence over scala:StringLike:lines, which causes compile problems. Use json4s instead of hocon/configs to write json-state-files Allow using custom class loader in order to find classes defined or loaded from notebooks (polynote) when parsing configuration Extend ScalaJWebserviceClient so it can be re-used in getting-started Force SaveMode.Overwrite for DeduplicateAction and HistorizeAction if mergeModeEnable=false Make runtime info public (#454)

1.3.1

2 years ago

Improved Delta Lake support

improve comparing schema ignoring nullability
added support for evolving schema when working with DeltaLakeTableDataObject with SDLSaveMode.Append
handle missing delta table, _delta_log and missing hadoop path

Data Objects extensions

implement DataObjects with state (#365)
implement reading partitioned xml-data
implement Jdbc table creation and schema evolution

Streaming improvements

don't increment runId when all actions are skipped in streaming mode
fix ActionDAGRunState.isSkipped for mixed scenarios (async and sync actions)
make execActionDAG tail recursive to avoid stack overflow for long running streaming jobs

New SDLSaveMode.merge to do upsert statemetns

implement save mode merge for JdbcTableDataObject and DeltaLakeTableDataObject
implement merge mode for CopyAction
implement merge mode for DeduplicateAction (#235)
implement merge mode for HistorizeAction (#235)

New sdl-azure module

add Azure libraries, AzureADClientGrantAuthMode
introduced state change logger, which submits save events to azure log monitoring
support for azure key vault secret provider

Small bugfixes & improvements

support more type conversions in schema evolution
if possible use schemaMin to create empty DataFrame if table for recursive input doesn't exist yet.
Prevent file names starting with . in WebserviceFileDataObject (crc files still have original name though)
Remove special chars from fileRefs generated by WebserviceFileDataObject (#395)
throw exception if config entry for connections, dataObjects or actions is not of type object (#396)
fix evaluating to_date and other ReplaceableExpressions with ExpressionEvaluator
cleanup kafka dependency from deltalake pom.xml
remove wrong error message about missing executionId in SparkStageMetricsListener
fix reading data frame from skipped SubFeed if filters are ignored
fix parsing event info if appName contains special characters
add a transformer to repartition dataframe
made SmartDataLakeLogger public
Simplify final exception for better usability of log: truncate stacktrace starting from "monix.*" entries, limit logical plan in AnalysisException to 5 lines
Simplify logging of TaskFailedException

Cleanup

Cleanup deprecated PartitionDiffMode.stopIfNoData