rtdl makes it easy to build and maintain a real-time data lake
rtdl is a universal real-time ingestion and pre-processing layer for every
data lake – regardless of table format, OLAP layer, catalog, or cloud vendor. It is the easiest
way to build and maintain real-time data lakes. You send rtdl a real-time data stream – often
from a tool like Segment – and it builds you a real-time data lake on AWS S3, GCP Cloud
Storage, and Azure Blob Storage.
You provide the data, rtdl builds your lake.
Stay up-to-date on rtdl via our website and blog, and learn how to use rtdl via our documentation.
rtdl's initial feature set is built and working. You can use the API on port 80 to
configure streams that ingest json from an rtdl endpoint on port 8080, process them into Parquet,
and save the files to a destination configured in your stream. rtdl can write files locally, to
HDFS, to AWS S3, GCP Cloud Storage, and Azure Blob Storage and you can query your data via Dremio's
web UI at http://localhost:9047 (login with Username: rtdl
and Password rtdl1234
). rtdl supports
writing in the Delta Lake table format as well as integration with the
AWS Glue and Snowflake External Tables
metadata catalogs.
stream
configurations.For more detailed instructions, see our Initialize rtdl docs.
docker compose -f docker-compose.init.yml up -d
.
docker compose -f docker-compose.init.yml down
and retry.rtdl_rtdl-db-init
, rtdl_dremio-init
, and rtdl_redpanda-init
exit and complete
with EXITED (0)
, kill and delete the rtdl container set by running
docker compose -f docker-compose.init.yml down
.docker compose up -d
every time after.docker compose down
to stop.Note #1: To start from scratch, run rm -rf storage/
from the rtdl root folder.
Note #2: If you experience file write issues preventing Dremio and/or Redpanda services
from starting, please add user: root
to the docker-compose.init.yml
and docker-compose.yml
files in the Dremio and Redpanda service definitions. This issue has been encountered on Linux.
For more detailed setup instructions for your cloud provider, see our setup docs:
<YOUR_BUCKET_NAME>
with the
name of the S3 bucket you created in step 1.
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "ListAllBuckets",
"Effect": "Allow",
"Action": [
"s3:GetBucketLocation",
"s3:ListAllMyBuckets"
],
"Resource": [
"arn:aws:s3:::*"
]
},
{
"Sid": "ListBucket",
"Effect": "Allow",
"Action": [
"s3:ListBucket"
],
"Resource": [
"arn:aws:s3:::<YOUR_BUCKET_NAME>"
]
},
{
"Sid": "ManageBucket",
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:PutObject",
"s3:PutObjectAcl",
"s3:DeleteObject"
],
"Resource": [
"arn:aws:s3:::<YOUR_BUCKET_NAME>/*"
]
}
]
}
Access Key ID
and Secret Access Key
for use in configuring your stream in rtdl.createStream
call body for creating a data lake on AWS S3.
{
"active": true,
"message_type": "test-msg-aws",
"file_store_type_id": 2,
"region": "us-west-1",
"bucket_name": "testBucketAWS",
"folder_name": "testFolderAWS",
"partition_time_id": 1,
"compression_type_id": 1,
"aws_access_key_id": "[aws_access_key_id]",
"aws_secret_access_key": "[aws_secret_access_key]"
}
createStream
curl call for creating a data lake on AWS S3.
curl --location --request POST 'http://localhost:80/createStream' \
--header 'Content-Type: application/json' \
--data-raw '{
"active": true,
"message_type": "test-msg-aws",
"file_store_type_id": 2,
"region": "us-west-1",
"bucket_name": "testBucketAWS",
"folder_name": "testFolderAWS",
"partition_time_id": 1,
"compression_type_id": 1,
"aws_access_key_id": "[aws_access_key_id]",
"aws_secret_access_key": "[aws_secret_access_key]"
}'
For more detailed instructions, see our Send data to rtdl docs.
All data should be sent to the ingest
endpoint of the ingest service on port 8080 -- e.g. http://localhost:8080/ingest.
stream_id
in the payload and rtdl will add it to your lake.
{
"stream_id":"837a8d07-cd06-4e17-bcd8-aef0b5e48d31",
"name":"user1",
"array":[1,2,3],
"properties":{"age":20}
}
You can optionally add message_type
should you choose to override the message_type
specified while creating the stream.
rtdl will default to a message type rtdl_default
if message type is absent in both stream definition and actual message.rtdl has a multi-service architecture composed of a new generation of open source tools to process and access your data and custom-built services to interact with them more easily. To learn more about rtdl's services and architecture, visit our Architecture docs.
Contributions are always welcome!
See our CONTRIBUTING for ways to get started.
This project adheres to the rtdl code of conduct - a
direct adaptation of the Contributor Covenant,
version 2.1.