Awesome Data Temporality Save

A curated list to help you manage temporal data across many modalities 🚀.

Project README

Awesome Data Temporality

A curated list to help you manage temporal data across many modalities 🚀.

Generative Art Created By DALL·E!

Awesome Data Temporality

Data Versioning for Machine Learning

Data versioning is the practice of storing multiple versions of the same data and providing a mechanism for accessing and managing these versions. This can be useful in a variety of situations, such as when data is accidentally deleted or corrupted, or when it is necessary to see how the data has changed over time. The vast majority of "data versioning" tools you see today are related to better managing your datasets for machine learning. The implementation paradigm used is to store versions of your data and models in Git commits. Therefore the following part of the awesome list is centered around machine learning. However, there are other ways to manage your temporal data covered in later sections.

DVC Management and versioning of datasets and machine learning models.
lakeFS Repeatable, atomic and versioned data lake on top of object storage.
Dolt Dolt is Git for Data!
Pachyderm Pachyderm data versioning technology
wrgl Data version control for data projects
neptune Log, organize, compare, register, and share all your ML model metadata in a single place
Git LFS An open source Git extension for versioning large files
DagsHub Where people build data science projects
Neptune Best 7 Data Version Control Tools
DagsHub Comparing Data Version Control Tools
Aporia Best Data Versioning Tools for MLOps
A Conceptual Framework A Conceptual Framework and Proposed Principles
Australian Research Data Commons What is data versioning?
Research Data Alliance Principles and best practices in data versioning for all data sets big and small
MLFlow + LakeFS Data Versioning for Efficient Workflows with MLFlow and LakeFS
LakeFS Data Versioning: All You Need to Know
Dr. Raj Ramesh How to manage model and data versions
DVC + MLflow (DVC) Data Versioning and Reproducible ML with DVC and MLflow
5 min Explainer (DVC) Version Control for Data Science Explained in 5 Minutes
The Guide to Data Versioning (LakeFS) The Guide to Data Versioning with LakeFS
CodeX Data Versioning for Modern Data Teams and Platforms Advantages & Best Practices
DevGenius Data versioning applied on machine learning projects
Eduonix Data Versioning Data Versioning - How to Version your Data
DZone Data Versioning Data Versioning 101
KDNuggets Data Versioning Data Versioning: Does it mean what you think it means?
ODSC Data Versioning How Data Versioning Can Be Used in Machine Learning
Neuroimaging (Quilt) Use cases for data versioning: debugging, collaboration, and compliance in neuroimaging
Data + AI Summit Europe 2020 (DVC) Data Versioning and Reproducible ML with DVC and MLflow

Time Travel and Temporal Tables

Data time travel refers to the ability to go back in time and access previous versions of data. In order to enable data time travel, it is necessary to implement a system for versioning data, which involves storing multiple versions of the same data and providing a mechanism for accessing and managing these versions. Whereas temporal tables, also known as system-versioned temporal tables, are tables in a database that automatically track the history of data changes and allow you to query the data as it existed at any point in time. Both time travel an temporal tables often are used interchangablely to mean the same thing. Temporal tables are more of an implementation specific feature of some databases. These tables are useful for auditing, tracking changes to data over time, and performing point-in-time analysis. You can usually query a temporal table using the FOR SYSTEM_TIME clause in a SELECT statement.

Postgres Postgres Temporal Tables Extension
Azure MSSQL temporal tables
MariaDB MariaDB Temporal Tables
Cockroach Cockroach documentation about AS OF SYSTEM TIME
Dremio Dremio Iceberg Time Travel
Iceberg Apache iceberg time travel
Teradata Teradata Vantage™ Temporal Table Support
SAP HANA Temporal Tables (History, System-Versionined, Application-Time Period)
Delta Lake Introducing Delta Time Travel for Large Scale Data Lakes
Hopsworks Time travel operations in Hopsworks Feature Store
BigQuery Access historical data using time travel
JOOQ System and application versioned tables
Apache Flink Temporal Table Function
Apache Arrow Voltron as-of join (not exactly bi-temporal, but is temporal joining)
Redpanda Time travel debugging through historical messages to identify and debug problems
Time Travel or Data Versioning (DataBricks) Time Travel/Data Versioning using Delta Lake
Data Reproducibility (DataBricks) Data Reproducibility, Audits, Immediate Rollbacks, and Other Applications of Time Travel
Apache Iceberg (Trino) Apache Iceberg: A table format for data lakes with unforeseen use cases
Hopworks Point-in-time Joins Python-centric Feature Stores
Iceberg Architecture Data Science DC Nov 2021 Meetup: Apache Iceberg - An Architectural Look Under the Covers
Postgres - Paper Temporal Tables in Postgres
MSSQL (Business Problems) 5 Business Problems You Can Solve Using Temporal Tables
MSSQL (Tim Mitchell) Introduction to SQL Server Temporal Tables Tim Mitchell

Slowly Changing Dimensions Data Modeling

Slowly changing dimensions are those in which the attributes of the dimension change over time, and the changes need to be tracked in the data warehouse. For example, a customer's address or name might change over time, and the data warehouse needs to track these changes so that historical data can be analyzed correctly.

VDK Versatile Data Kit (VDK) is an open source framework including help to manage SCD style data.
dbtvault A free to use dbt package for creating and loading Data Vault 2.0 compliant Data Warehouses (powered by dbt, an open source data engineering tool, registered trademark of dbt Labs)
dataform Common data models for creating type-2 slowly changing dimensions tables from mutable data sources in Dataform.
dbt snapshots DBT snapshots
DeltaLake Databricks change data capture with Delta Live Tables
6 Kinds 6 Different Types of Slowly Changing Dimensions and How to Apply Them?
Data Vault Loading Dimensions from a Data Vault Model
SCD Data Warehouse Slowly Changing Dimension Handling in Data Warehouses Using Temporal Database Features
Redshift Implement a slowly changing dimension in Amazon Redshift

Bi-temporality Tools + Modeling

Bitemporality is a concept in database management that refers to the ability of a database to store and manage data that is associated with multiple time periods. This can include historical data as well as data that is still in the process of being entered or updated. In a bitemporal database, data is stored in multiple versions, with each version corresponding to a specific point in time. This allows users to view and query the data as it existed at different points in time, which can be useful for a variety of purposes such as understanding how data has changed over time or for tracking the history of a particular piece of data.

Martin Fowler Bitemporal History (explained) from world famous Martin Fowler
Crux of Bitemporality The Crux of Bitemporality - Jon Pither
Capgemini Enhancing Time Series Data by Applying Bitemporality (opinionated white paper mentioning KDB+)
GoldenSource A financial services data modeling software company perspective on bitemporality
MarkLogic A deep dive into bitemporality in MarkLogic
XTDB XTDB bitemporal graph database by Juxt with support for bitemporality
ARXIV Bitemporal Property Graphs to Organize Evolving Systems white paper
Axway Decision Insights bitemporal capability
Cloudera - Data Modeling Bi-temporal data modeling with Envelope
Bitemporal Database Book Bitemporal Databases: Modeling and Implementation
Speakerdeck An overview of bitemporality
Val on Programming (Datomic) Datomic: this is not the history you're looking for
Cybertec Implementing "As Of" queries in Postgresql
Bitempura.DB Bitempura.DB is a simple, bitemporal key-value database.
Modeler (Anchormodeler) (Bi-temporal) data modelling tool inspired by Anchor modeler, for PostgreSQL
BarbelHisto Lightweight ultra-fast Java library to store data in bi-temporal format
Robinhood Tracking Temporal Data at Robinhood

Change Data Capture (CDC) Tools

Change data capture (CDC) is a process that captures and stores data about changes made to a database or other data source. It is often used in data warehousing and data integration scenarios to ensure that data in different systems is kept up to date and in sync. CDC involves tracking changes made to a database or data source and storing information about those changes in a separate location, such as a separate database or log file. This allows the data in the original source to be updated, while still maintaining a record of the changes that were made.

Debezium Change data capture for a variety of databases
Supabase realtime Broadcast, Presence, and Postgres Changes via WebSockets
airbyte Data integration platform for ELT pipelines from APIs, databases & files to warehouses & lakes
Flink CDC Connectors for Apache Flink
gravity A Data Replication Center
brooklin An extensible distributed system for reliable nearline data streaming at scale

Soft Delete in ORM Frameworks

Soft delete is a method of deleting data from a database in a way that allows the data to be recovered if necessary. When data is deleted using the soft delete method, it is not physically removed from the database. Instead, it is marked as deleted and is typically no longer visible to users, but it can still be recovered if necessary. The soft delete method is often used as a way to prevent accidental or unintended data loss, as it allows deleted data to be recovered if necessary. It is also useful in scenarios where data needs to be retained for compliance or regulatory purposes, as it allows data to be retained while still making it unavailable to users.

Golang Bun Lightweight Golang ORM for PostgreSQL, MySQL, MSSQL, and SQLite
Golang GORM The fantastic ORM library for Golang
Typescript DeepKit High performance typescript framework
Java Spring How to Implement a Soft Delete with Spring JPA
Typescript TypeOrm Easy CRUD for GraphQL
Typescript Sequalize Sequelize is a modern TypeScript and Node.js ORM for Oracle, Postgres, MySQL, MariaDB, SQLite and SQL Server, and more.
Typescript Prisma Next-generation Node.js and TypeScript ORM
Rust Diesel query builder
Python Django Soft delete for Django ORM, with support for undelete
brandur Soft Deletion Probably Isn't Worth It
Evil Martians Soft deletion with PostgreSQL: but with logic on the database!

Contribution

This list started as personal collection of interesting things about data versioning. Your contributions and suggestions are warmly welcomed. Read the contribution guidelines.

Open Source Agenda is not affiliated with "Awesome Data Temporality" Project. README Source: daefresh/awesome-data-temporality

Stars

Open Issues

Last Commit

1 year ago

Repository

daefresh/awesome-data-temporality

Open Source Agenda Badge

<a href="https://www.opensourceagenda.com/projects/awesome-data-temporality"><img src="https://www.opensourceagenda.com/projects/awesome-data-temporality/reviews/badge.svg" alt="Open Source Agenda"></a>

Submit Review Review Your Favorite Project

Submit Resource Articles, Courses, Videos

Submit Article Submit a post to our blog

From the blog

Dec 11, 2022

How to Choose Which Programming Language to Learn First?

From the blog

Dec 11, 2022