New generation decentralized data lake and a streaming data pipeline
Using the installer script (Linux / MacOSX / WSL2):
curl -s "https://get.kamu.dev" | sh
kamu
in action.kamu
with this self-serve demo without needing to install anything.kamu
(pronounced kaˈmju
) is an easy-to-use command-line tool for managing, transforming, and collaborating on structured data.
In short, it can be described as:
Using kamu
, any person or smallest organization can easily share structured data with the world. Data can be static or flow continuously. In all cases kamu
will ensure that it stays:
Teams and data communities can then collaborate on cleaning, enriching, and aggregating data by building arbitrarily complex decentralized data pipelines. Following the "data as code" philosophy kamu
doesn't let you touch data manually - instead, you transform it using Streaming SQL (we support multiple frameworks). This ensures that data supply chains are:
Data scientists, analysts, ML/AI researchers, and engineers can then:
The reuse is achieved by maintaining unbreakable lineage and provenance trail in tamper-proof metadata, which lets you assess the trustworthiness of data, no matter how many hands and transformation steps it went through.
In a larger context, kamu
is a reference implementation of Open Data Fabric - a Web 3.0 protocol for providing timely, high-quality, and verifiable data for data science, smart contracts, web and applications.
In general, kamu
is a great fit for cases where data is exchanged between several independent parties, and for (low to moderate frequency & volume) mission-critical data where high degree of trustworthiness and protection from malicious actors is required.
Examples:
To share data outside your organization today you have limited options:
Let's acknowledge that for organizations that produce the most valuable data (governments, hospitals, NGOs), publishing data is not part of their business. They typically don't have the incentives, expertise, and resources to be good publishers.
This is why the goal of kamu
is to make data publishing cheap and effortless:
As opposed to just the download counter you get on most data portals, kamu
brings publishers closer with the communities allowing them to see who and how uses their data. You no longer send data into "the ether", but create a closed feedback loop with your consumers.
One of the driving forces behind kamu
's design was the ongoing reproducibility crisis in science, which we believe in a large extent is caused by our poor data management practices.
After incidents like The Surgisphere scandal the sentiment in research is changing from assuming that all research is done in good faith, to considering any research unreliable until proven otherwise.
Data portals like Dataverse, Dryad, Figshare, and Zenodo are helping reproducibility by archiving data, but this approach:
In kamu
we believe that the majority of valuable data (weather, census, health records, financial core data) flows continuously, and most of the interesting insights lie around the latest data, so we designed it to bring reproducibility and verifiability to near real-time data.
When using kamu
:
Data-driven journalism is on the rise and has proven to be extremely effective. In the world of misinformation and extremely polarized opinions data provides us an anchoring point to discuss complex problems and analyze cause and effect. Data itself is non-partisan and has no secret agenda, and arguments around different interpretations of data are infinitely more productive than ones based on gut feelings.
Unfortunately, too often data has issues that undermine its trustworthiness. And even if the data is correct, it's very easy to pose a question about its sources that will take too long to answer - the data will be dismissed, and the gut feelings will step in.
This is why kamu
's goal is to make data verifiable trustworthy and make answering provenance questions a matter of seconds. Only when data cannot be easily dismissed we will start to pay proper attention to it.
And once we agree that source data can be trusted, we can build analyses and real-time dashboards that keep track of complex issues like corruption, inequality, climate, epidemics, refugee crises, etc.
kamu
prevents good research from going stale the moment it's published!
kamu
aims to be the most reliable data management solution that provides recent data while maintaining the highest degree of accountability and tamper-proof provenance, without you having to put all data in some central database.
We're developing it with financial and pharmaceutical use cases in mind, where audit and compliance could be fully automated through our system.
Note that we currently focus on mission-critical data and kamu
is not well suited for IoT or other high-frequency and high-volume cases, but can be a good fit for insights produced from such data that influence your company's decisions and strategy.
Being data geeks, we use kamu
for data-driven decision-making even in our personal lives.
Actually, our largest data pipelines so far were created for personal finance:
We also scrape a lot of websites to make smarter purchasing decisions. kamu
lets us keep all this data up-to-date with an absolute minimal effort.
kamu
connects publishers and consumers of data through a decentralized network and lets people collaborate on extracting insight from data. It offers many perks for everyone who participates in this first-of-a-kind data supply chain:
If you like what we're doing - support us by starring the repo, this helps us a lot!
Subscribe to our YouTube channel to get fresh tech talks and deep dives.
Stop by and say "hi" in our Discord Server - we're always happy to chat about data.
If you'd like to contribute start here.
Website | Docs | Tutorials | Examples | FAQ | Chat | Contributing | Developer Guide | License