Kamu Cli Save

New generation decentralized data lake and a streaming data pipeline

Project README

Quickstart

Using the installer script (Linux / MacOSX / WSL2):

curl -s "https://get.kamu.dev" | sh

Watch this introductory video to see kamu in action.
Learn how to use kamu with this self-serve demo without needing to install anything.
Then follow the "Getting Started" section of our documentation to install the tool and try a bunch of examples.

About

kamu (pronounced kaˈmju) is an easy-to-use command-line tool for managing, transforming, and collaborating on structured data.

In short, it can be described as:

Git for data
Decentralized data warehouse
A peer-to-peer stream processing data pipeline
Blockchain-like ledger for data
Or even Kubernetes for data :)

Using kamu, any person or smallest organization can easily share structured data with the world. Data can be static or flow continuously. In all cases kamu will ensure that it stays:

Reproducible - i.e. you can ask the publisher "Give me the same exact data you gave me a year ago"
Verifiable - i.e. you can ask the publisher "Is this the exact data you had a year ago?"

Teams and data communities can then collaborate on cleaning, enriching, and aggregating data by building arbitrarily complex decentralized data pipelines. Following the "data as code" philosophy kamu doesn't let you touch data manually - instead, you transform it using Streaming SQL (we support multiple frameworks). This ensures that data supply chains are:

Autonomous - write query once and run it forever, no more babysitting fragile batch workflows
Low latency - get accurate results immediately, as new data arrives
Transparent - see where every single piece of data came from, who transformed it, and how
Collaborative - collaborate on data just like on Open Source Software

Data scientists, analysts, ML/AI researchers, and engineers can then:

Access fresh, clean, and trustworthy data in seconds
Easily keep datasets up-to-date
Safely reuse data created by the hard work of the community

The reuse is achieved by maintaining unbreakable lineage and provenance trail in tamper-proof metadata, which lets you assess the trustworthiness of data, no matter how many hands and transformation steps it went through.

In a larger context, kamu is a reference implementation of Open Data Fabric - a Web 3.0 protocol for providing timely, high-quality, and verifiable data for data science, smart contracts, web and applications.

Use Cases

In general, kamu is a great fit for cases where data is exchanged between several independent parties, and for (low to moderate frequency & volume) mission-critical data where high degree of trustworthiness and protection from malicious actors is required.

Examples:

Open Data

To share data outside your organization today you have limited options:

You can publish it on some open data portal, but lose ownership and control of your data
You can deploy and operate some open-source data portal (like CKAN or Dataverse), but you probably have neither time nor money to do so
You can self-host it as a CSV file on some simple HTTP/FTP server, but then you are making it extremely hard for others to discover and use your data

Let's acknowledge that for organizations that produce the most valuable data (governments, hospitals, NGOs), publishing data is not part of their business. They typically don't have the incentives, expertise, and resources to be good publishers.

This is why the goal of kamu is to make data publishing cheap and effortless:

It invisibly guides publishers towards best data management practices (preserving history, making data reproducible and verifiable)
Adds as little friction as exporting data to CSV
Lets you host your data on any storage (FTP, S3, GCS, etc.)
Maintain full control and ownership of your data

As opposed to just the download counter you get on most data portals, kamu brings publishers closer with the communities allowing them to see who and how uses their data. You no longer send data into "the ether", but create a closed feedback loop with your consumers.

Science & Research

One of the driving forces behind kamu's design was the ongoing reproducibility crisis in science, which we believe in a large extent is caused by our poor data management practices.

After incidents like The Surgisphere scandal the sentiment in research is changing from assuming that all research is done in good faith, to considering any research unreliable until proven otherwise.

Data portals like Dataverse, Dryad, Figshare, and Zenodo are helping reproducibility by archiving data, but this approach:

Results in hundreds of millions of poorly systematized datasets
Tends to produce the research based on stale and long-outdated data
Creates lineage and provenance trail that is very manual and hard to trace (through published papers)

In kamu we believe that the majority of valuable data (weather, census, health records, financial core data) flows continuously, and most of the interesting insights lie around the latest data, so we designed it to bring reproducibility and verifiability to near real-time data.

When using kamu:

Your data projects are 100% reproducible using a built-in stable references mechanism
Your results can be reproduced and verified by others in minutes
All the data prep work (that often accounts for 80% of time of a data scientist) can be shared and reused by others
Your data projects will continue to function long after you've moved on, so the work done years ago can continue to produce valuable insights with minimal maintenance on your part
Continuously flowing datasets are much easier to systematize than the exponentially growing number of snapshots

Data-driven Journalism

Data-driven journalism is on the rise and has proven to be extremely effective. In the world of misinformation and extremely polarized opinions data provides us an anchoring point to discuss complex problems and analyze cause and effect. Data itself is non-partisan and has no secret agenda, and arguments around different interpretations of data are infinitely more productive than ones based on gut feelings.

Unfortunately, too often data has issues that undermine its trustworthiness. And even if the data is correct, it's very easy to pose a question about its sources that will take too long to answer - the data will be dismissed, and the gut feelings will step in.

This is why kamu's goal is to make data verifiable trustworthy and make answering provenance questions a matter of seconds. Only when data cannot be easily dismissed we will start to pay proper attention to it.

And once we agree that source data can be trusted, we can build analyses and real-time dashboards that keep track of complex issues like corruption, inequality, climate, epidemics, refugee crises, etc.

kamu prevents good research from going stale the moment it's published!

Business core data

kamu aims to be the most reliable data management solution that provides recent data while maintaining the highest degree of accountability and tamper-proof provenance, without you having to put all data in some central database.

We're developing it with financial and pharmaceutical use cases in mind, where audit and compliance could be fully automated through our system.

Note that we currently focus on mission-critical data and kamu is not well suited for IoT or other high-frequency and high-volume cases, but can be a good fit for insights produced from such data that influence your company's decisions and strategy.

Personal analytics

Being data geeks, we use kamu for data-driven decision-making even in our personal lives.

Actually, our largest data pipelines so far were created for personal finance:

to collect and harmonize data from multiple bank accounts
convert currencies
analyze stocks trading data.

We also scrape a lot of websites to make smarter purchasing decisions. kamu lets us keep all this data up-to-date with an absolute minimal effort.

Features

kamu connects publishers and consumers of data through a decentralized network and lets people collaborate on extracting insight from data. It offers many perks for everyone who participates in this first-of-a-kind data supply chain:

For Data Publishers

Easily share your data with the world without moving it anywhere
Retain full ownership and control of your data
Close the feedback loop and see who and how uses your data
Provide real-time, verifiable and reproducible data that follows the best data management practices

For Data Scientists

Ingest any existing dataset from the web
Always stay up-to-date by pulling latest updates from the data sources with just one command
Use stable data references to make your data projects fully reproducible
Collaborate on cleaning and improving data of existing datasets
Create derivative datasets by transforming, enriching, and summarizing data others have published
Write query once and run it forever - our pipelines require nearly zero maintenance
Built-in support for GIS data
Share your results with others in a fully reproducible and reusable form

For Data Consumers

Download a dataset from a shared repository
Verify that all data comes from trusted sources using 100% accurate lineage
Audit the chain of transformations this data went through
Validate that downloaded was not tampered with a single command
Trust your data by knowing where every single bit of information came from with our fine grain provenance

For Data Exploration

Explore data and run ad-hoc SQL queries (backed by the power of Apache Spark)
Launch a Jupyter notebook with one command
Join, filter, and shape your data using SQL
Visualize the result using your favorite library
Explore complex pipelines in Web UI

Community

If you like what we're doing - support us by starring the repo, this helps us a lot!

Subscribe to our YouTube channel to get fresh tech talks and deep dives.

Stop by and say "hi" in our Discord Server - we're always happy to chat about data.

If you'd like to contribute start here.

Open Source Agenda is not affiliated with "Kamu Cli" Project. README Source: kamu-data/kamu-cli

Stars

279

Open Issues

109

Last Commit

1 day ago

Repository

kamu-data/kamu-cli

Homepage

https://kamu.dev

Open Source Agenda Badge

<a href="https://www.opensourceagenda.com/projects/kamu-cli"><img src="https://www.opensourceagenda.com/projects/kamu-cli/reviews/badge.svg" alt="Open Source Agenda"></a>

Submit Review Review Your Favorite Project

Submit Resource Articles, Courses, Videos

Submit Article Submit a post to our blog

From the blog

Dec 11, 2022

How to Choose Which Programming Language to Learn First?

From the blog

Dec 11, 2022