Smart Data Lake Versions Save

Smart Automation Tool for building modern Data Lakes and Data Pipelines

2.5.0

1 year ago

Major features

  • Upgrade to Spark 3.3
  • SDL Agents
  • Support for Apache Iceberg
  • Integration with Unity Catalog

Features

  • #541
  • #549
  • #571
  • #582
  • #619
  • #621
  • #625
  • #635
  • #652
  • SmartDataLakeBuilderLab to use DataObjects more interactively in Notebooks
  • many-to-many transformations in Python

Improvements

  • Switch to log4j2 yaml format
  • New variable failSimulationOnMissingInputSubFeeds to configure if runs should fail when input subfeeds are missing
  • Expectation improvements (SQLQueryExpectation)
  • Improvements on JDBC transaction handling
  • Improvements on Schema Viewer
  • Proxy Support for SftpFileRefConnections
  • FileTransferAction: Support for multiple file transfers in parallel
  • Global Config: allowAsRecursiveInput - allow exceptions on specific DataObjects
  • Improved Xsd and JsonSchema support
  • Improved Metric writing to Azure LogAnalytics
  • Improved support on Amazon Glue

Bugfixes

  • #599
  • #627
  • #633
  • #653
  • Various smaller bugfixes and error handling improved

Dependencies

Spark: Update from Spark 3.2 to 3.3 Delta Lake: Update from 2.0 to 2.2

2.4.2

1 year ago

Bugfixes and improvements:

  • Fix writing to Oracle databases when temporary tables are involved (#633)
  • When saveMode=Overwrite for JdbcTableDataObject, allow writing to the database table even if the column order in the dataframe is different (#633)
  • Add parameters to JdbcTableConnection in order to configure the commit behaviour in JDBC connections (#633)

Note: this release is created as Hotfix Release on top of version 2.4.1, as develop-spark3 branch is already on 2.5.0-SNAPSHOT.

2.4.1

1 year ago

Bugfixes and improvements:

  • Increase spark-extensions version to 3.2.5 (#627): Remove restrictive avro schema equality test
  • Do not write schema file in simulations (#627)
  • Do not throw exception when there is no path for sample file in CustomFileAction (#627)

Note: this release is created as Hotfix Release on top of version 2.4.0, as develop-spark3 branch is already on 2.5.0-SNAPSHOT.

2.4.0

1 year ago

Bugfixes and improvements

#518 Schema Viewer shows wrong information #580 Can't use same ExcelFileDataObject for write and read #600 Schema viewer does not indicate whether a field is required #601 Loading Schema from file should be done lazy Leading underscores are preserved when normalizing column names ExecutionMode and executionCondition are only applied in exec phase

Features

#591 Column encryption #610 Support DataObjectStateIncrementalMode for KafkaTopicDataObject

Dependencies

Bump commons-net from 3.1 to 3.9.0

2.3.2

1 year ago

What changes are included in the pull request?

Bugfixes

  • #593
  • NotSerializableException with RelaxedCsvFileDataObject

Improvements

  • #577

Dependency Updates

  • commons-text

2.3.1

1 year ago

This is mainly a bugfix release, see: #583 #584 #578 #579

One new Feature: #575

2.3.0

1 year ago

Version upgrades

  • Spark 3.2.1 -> 3.2.2
  • Delta-lake Delta-Lake 1.1.0 -> 2.0.0

New Features

  • GenericDataFrame implementation to create transformations that run with Spark and Snowpark/Snowflake (#376)
  • Constraints and Expectations (#43, #377, #388), see also http://smartdatalake.ch/docs/reference/dataQuality#constraints
  • Historize with incremental cdc mode (#407), see also http://smartdatalake.ch/blog/sdl-hist
  • Spark file dataobject incremental mode (#517)
  • Spark Dataset transformations using ScalaClassSparkDsTransformer (#489)
  • DataObject schemas from caseClass, jsonSchema, xsdFile and avroSchemaFile (#512)
  • Methods to provide schema in init-phase (#522)
  • Support for json-schema with confluent schema registry (#538)
  • JDBC overall transaction (#254)
  • FinalStateWriter to store state once a job is finished

Minor Bugfixes and improvements

  • Improve parsing xsd schema
  • Improve Housekeeping
  • Implement ColNamesLowercaseTransformer and remove converting columns to lowercase
  • HiveConnection pathPrefix optional, FileIncrementalMoveMode absolute archivePath
  • Cleanup partition directories after failure in SparkFileDataObject
  • Fix schema versioning
  • Fix Airbyte supportsIncremental optional
  • Fix naming of input views when chaining SQL transformations
  • Fix transformer dataframe output mapping and input partitionvalues
  • Fix calling move/compactPartition only if list is not empty

Full Changelog: https://github.com/smart-data-lake/smart-data-lake/compare/2.2.1...2.3.0

2.2.1

2 years ago

Version upgrades

  • update Spark version 3.2.0 -> 3.2.1

New Features

  • StatusInfo REST-Server (#450)
  • Websocket for live status (#450)
  • DagExporter command line tool to export basic dag selected by a feed-selector

Minor Bugfixes and improvements

  • add maven profile to create fat-jar for Spark 3.1 (#465)
  • fix spark 3.1 json4s compatibility
  • fix reading state file from previous versions
  • update spark-extensions: fix execution on Databricks
  • fix and refine validatePartitionValuesExisting
  • move sparkSession from object Environment to GlobalConfig to support running multiple SDLB jobs on the same JVM (e.g. Databricks cluster)
  • fix Airbyte parser issue (#483)
  • update spark-excel and poi dependency because of vulnerability (#485)

2.2.0

2 years ago

Version upgrades

Update to Spark 3.2 (#406) Update delta lake to version 1.1 (#406)

  • dont use DeltaLake Table API because of strange errors
  • Delta Lake Version 1.1 needs Spark 3.2 Update scala-maven-plugin to support scala 2.12.14+

New Features

Implement CustomSnowparkAction (rudimentary Snowpark support, #376) Implement script support and CustomScriptAction (#422) Implement AirbyteDataObject (#365) Implement basic ScalaNotebookDfTransformer (#401) Implement SDL json schema creator (#440) Add Atlas metadata exporter implementation

Minor Bugfixes and improvements

Extend StateListener.notifyState with parameter indicating change Action Adapted StateChangeLogger to log only for the action for the notification was emitted Refactor Actions SubFeed handling Refactor integrating SparkSession into ActionPipelineContext and usage of implicit parameters Add SASL Authentication for Kafka Avoid loosing full error response text from webservice calls Improve build stability by using linesIterator, otherwise on some environments the java:String.lines has precedence over scala:StringLike:lines, which causes compile problems. Use json4s instead of hocon/configs to write json-state-files Allow using custom class loader in order to find classes defined or loaded from notebooks (polynote) when parsing configuration Extend ScalaJWebserviceClient so it can be re-used in getting-started Force SaveMode.Overwrite for DeduplicateAction and HistorizeAction if mergeModeEnable=false Make runtime info public (#454)

1.3.1

2 years ago

Improved Delta Lake support

  • improve comparing schema ignoring nullability
  • added support for evolving schema when working with DeltaLakeTableDataObject with SDLSaveMode.Append
  • handle missing delta table, _delta_log and missing hadoop path

Data Objects extensions

  • implement DataObjects with state (#365)
  • implement reading partitioned xml-data
  • implement Jdbc table creation and schema evolution

Streaming improvements

  • don't increment runId when all actions are skipped in streaming mode
  • fix ActionDAGRunState.isSkipped for mixed scenarios (async and sync actions)
  • make execActionDAG tail recursive to avoid stack overflow for long running streaming jobs

New SDLSaveMode.merge to do upsert statemetns

  • implement save mode merge for JdbcTableDataObject and DeltaLakeTableDataObject
  • implement merge mode for CopyAction
  • implement merge mode for DeduplicateAction (#235)
  • implement merge mode for HistorizeAction (#235)

New sdl-azure module

  • add Azure libraries, AzureADClientGrantAuthMode
  • introduced state change logger, which submits save events to azure log monitoring
  • support for azure key vault secret provider

Small bugfixes & improvements

  • support more type conversions in schema evolution
  • if possible use schemaMin to create empty DataFrame if table for recursive input doesn't exist yet.
  • Prevent file names starting with . in WebserviceFileDataObject (crc files still have original name though)
  • Remove special chars from fileRefs generated by WebserviceFileDataObject (#395)
  • throw exception if config entry for connections, dataObjects or actions is not of type object (#396)
  • fix evaluating to_date and other ReplaceableExpressions with ExpressionEvaluator
  • cleanup kafka dependency from deltalake pom.xml
  • remove wrong error message about missing executionId in SparkStageMetricsListener
  • fix reading data frame from skipped SubFeed if filters are ignored
  • fix parsing event info if appName contains special characters
  • add a transformer to repartition dataframe
  • made SmartDataLakeLogger public
  • Simplify final exception for better usability of log: truncate stacktrace starting from "monix.*" entries, limit logical plan in AnalysisException to 5 lines
  • Simplify logging of TaskFailedException

Cleanup

  • Cleanup deprecated PartitionDiffMode.stopIfNoData