Awesome Scalability Toolbox Save

My opinionated list of products and tools used for high-scalability projects

Project README

Index

Diagrams
API documentation
Message queues
Load balancers, reverse proxy, accelerators, web servers
Service mesh
API Gateway
Structured and unstructured data storage
Distributed consensus management, service discovery and configuration
CRDT and Operational transformation
Infrastructure provisioning
Containers
Kubernetes
Jsonnet
RPC, Communication between system nodes
gRPC
Service monitoring, metrics collection / graphing
Infrastructure information management
Distributed request tracing
Load testing
Log management
Feature Flags
Deployment tools
CI (Continuous Integration)
CDNs
Domain registrars
AWS
Networking
SDN
SRE (Site Reliability Engineering)
Disk storage
TLS
HTTP/3 and QUIC
Authorization and Authentication
Cryptography
UUID
Hashing
Videos
Real User Monitoring
QA Automation
Tools
Misc

Diagrams

PlantUML
Mermaid
C4 (multi-level architecture diagrams)
Structurizr (multi-level DSL)
IcePanel (multi-level modelling and diagrams)
dbdiagrams (DB ERD)
Workload Discovery on AWS
Cloudcraft (AWS only)
vega (visualization from JSON) and vega light
arc42

API documentation

OpenAPI
Swagger
Wording, definition syntax and units for RFC specification creation
API Stylebook
Spectral (API linter)
Dredd (API tester)
Zally (API linter)
GraphQL, UI client

Message queues

Real-time (<1ms): Aeron, Chronicle Queue
Brokerless: ZeroMQ, nanomsg, NSQ, nng
Kafka, Kafka Web UI solutions: AKHQ, Kafdrop, Kowl, Lenses Box
Redpanda (Kafka compatible)
RabbitMQ
Pulsar
RocketMQ
MemQ (thoughput optimized)
NATS

Load balancers, reverse proxy, accelerators, web servers

HAProxy, Unofficial Web UI
Envoy and Dropbox migration to Envoy from nginx
nginx, nginx config
OpenResty
Varnish
Tomcat
Træfik
Tarantool (mail.ru)
lightttpd
katran (BPF/XDP L4LB, Facebook)
GLB Director (DPDK L4LB, Github)
Cloudflare Unimog design

Service mesh

Linkerd
Envoy
Envoy introduction
Learn Envoy
Consul Connect
Kuma (from Kong)
Kong Mesh
xDS control protocol
Rotor (xDS, Turbine Labs)
ModSecurity for Envoy (WAF)
Envoy Java control plane
Istio service mesh controller
Istio introduction
Conduit (Rust, linkerd devs)
Netflix Vizceral (observability)
Kiali (observability, Istio)
Vistio (observability, Istio)

API Gateway

AWS API Gateway
Kong
Cloudflare API Gateway
KrakenD

Structured and unstructured data storage

DynamoDB and it's internal design (2022)
PostgreSQL
Postgres Pro (PostgreSQL)
RDS Postgres vs Aurora Postgres 13
MySQL, ProxySQL (for MySQL), mydumper (MySQL multi-threaded backup/restore)
RocksDB (InnoDB replacement by Facebook), Using NVM in Facebook (RocksDB)
Vitess (MySQL auto horizontal scaling)
MariaDB (MySQL)
Percona (MySQL)
MongoDB
Scylla (Cassandra done right), ScyllaDB with Optane
Cassandra
CockroachDB
Aerospike
FoundationDB
TiDB
JSON in Postgre 10.x, 11.x, PostgreSQL 9.6 vs Mongo 3.4
Why Uber Engineering Switched from Postgres to MySQL and Follow up 1, 2, 3, 4, 5, 6, 7
Redis, Community Slack Channel
Redis modules: 5 open source modules, JSON module
Redis UI: RedisInsight, AnotherRedisDesktopManager, Redis-UI, Redis Desktop Manager
iredis (improved CLI for Redis)
KeyDB (Redis fork with I/O multithreading and offloading to flash)
Memcached, extstore storage shim, Caching beyond RAM: the case for NVMe
Memcached-SR with BMC(BPF Memory Cache) and it's paper with video
Segcache (in-memory storage optimized for small objects with short TTL, Twitter)
FASTER (Microsoft), official site
Anna (experimental, Berkeley RISE Lab), white paper
LogDevice (Facebook, distributed storage for sequential data)
OrientDB (graph)
Database isolation levels
The Log-Structured Merge-Tree (LSM-Tree) whitepaper
B+ tree
YCSB (Yahoo! Cloud Serving Benchmark)

Distributed consensus management, service discovery and configuration

Raft protocol
Paxos protocol
Paxos made simple
Paxos Made Live - An Engineering Perspective
Consul
etcd
Vault
Secure Production Identity Framework For Everyone (SPIFFE)
ZooKeeper

CRDT and Operational transformation

Operational Transformation
White papers: Original Jupiter document (1995), Jupiter Made Abstract, and Then Refined (2020)
Libraries: sharedb, ottypes, libot
Articles: Collaborative Editing in CodeMirror
CRDT
Libraries: Automerge, Yjs, Diamond Types (speed oriented), Reference CRTS implementation, Yjs (port to Rust), teletype (Atom, deprecated)
CRDT benchmarking
Collection of whitepapers and articles

Infrastructure provisioning

Terraform
Terragrunt
Terraform best practices
Terraform AWS modules
Infracost - calculate Terraform deployment costs (AWS)
modules.tf - Convert Cloudcraft diagrams to Terraform code
Pulumi
Crossplain

Deployment tools

Ansible
Teletraan

CI (Continuous Integration)

Github Actions
TeamCity
Jenkins
Jenkins X (for k8s apps)
JetBrains Space
Tekton Pipelines (k8s native using CRD)

Containers

Docker
Docker Registries: Harbor, Quay
Awesome Docker list
docker-autoheal (restart on unhealthy event)
Kubernetes
Container Network Interface
Mesosphere
Mesos
gVisor (sandbox runtime)
Weave Scope (monitoring)
SysDig (monitoring)

Kubernetes

Lens (k8s IDE)
k9s (alternative cli)
minikube
kubectl
Krew (kubectl plugin manager), list of plugins
kustomize
Helm
Knative (run serverless apps on top of Istio
List of K8s application management tools
Kompose (Docker Compose to k8s)
ksonnet
kubecfg
Skaffold
Draft
Kubespray (cluster setup)
kops (cluster setup)
kubectx & kubens (switch clusters and namespaces
goldpinger (nodes connectivity test/display
kube-ps1 (bash prompt)
stern (pod and container logs tailing)
click (cli for large clusters)
Telepresence (for k8s services development)
Cilium
Calico
AWS VPC Kubernetes CNI driver using IPvlan
Contour (Ingress controller using Envoy)
Gimbal (Ingress load balancer to many clusters)
Vault with Kubernetes and Video on improvements
Weave Scope (Monitoring, visualisation & management for k8s)
Guide to Kubernetes networking (part 1), Part 2
Kubernetes Security - Best Practice Guide
RBAC and user managemenet generation using Web UI
Chaos mesh

Jsonnet

jsonnet
jsonnet builds
Visual Studio Code plugin
IntelliJ plugin (alpha)
Style guide (Databricks)

RPC, Communication between system nodes

gRPC
Dubbo (China version of gRPC)
Protocol Buffers
Thrift
Cap'n Proto
MessagePack
FlatBuffers
Motan
Aeron
ZeroMQ
SMF

gRPC

Awesome gRPC list
gRPC status codes
gRPC Field Mask and Netflix guide to using it Get operations and Update operations
gRPC field presence (v3.15+)
Insomnia (test client)
Postman (test client)
Hoppscotch (test client)
Kreay (test client)
httpYac (test client)
milkman test client (via gRPC plugin)
improbabl test client (Web)
bloomrpc test client (GUI)
gRPC 2 years in production

Service monitoring, metrics collection / graphing

Grafana
Grafonnet-lib (generate dashboards for Grafana)
Graphite
Prometheus
Thanos (Prometheus long term storage)
Cortex (Prometheus long term storage)
OpenMetrics
eBPF exporter (Prometheus)
Node Exporter (Prometheus)
cAdvisor (container monitoring)
ClichHouse (Yandex)
Druid (Imply)
Pinot (Linkedin)
Architecture analysis of ClickHouse, Druid and Pinot
HTTP Analytics for 6M requests per second using ClickHouse
NetData
Vector (host monitoring)
okmeter
Datadog
TimescaleDB
KairosDB
Zabbix
PagerDuty

Infrastructure information management

Osquery (Facebook)
Kolide Fleet (osquery)
Doorman (osquery)
OSSEC

Distributed request tracing

Dapper, a Large-Scale Distributed Systems Tracing Infrastructure (Google)
OpenTelemetry
OpenTracing and Jaeger introduction
TraceContext propagation format
Jaeger (Uber)
Zipkin
Lightstep
Tempo (Grafana)
Skywalking
AWS X-Ray

Load testing

Yandex.Tank (C++, Python, Go)
Overload (storage for Yandex.Tank results)
Gatling (Scala)
k6
Locust (Python)
Vegeta (HTTP 1.1/2)
h2load (HTTP 1.1/2)
autocannon (HTTP 1.1)

Log management

What you need to know about real-time logs
Vector
fluentd
Logstash
Graylog2
syslog-ng
rsyslog
fluentbit
filebit
Kibana
Loki
Splunk
GoAccess
Bookkeeper
LogDevice (Facebook)
Online solutions:
Loggly
Logentries
Papertrail
Scalyr
Sumo Logic
Humio

Feature Flags

Overview site
FF4J
Togglz (Java)
Unleash (simple)
LaunchDarkly (cloud provider)
piranha (Uber tool to refactor feature flag code)

CDNs

Cloudflare
CloudFront (AWS)
Fastly
Akamai
Traffic Control (Self-hosted CDN)

Domain registrars

MarkMonitor
Cloudflare

AWS

AWS Infrastructure overview
awscli
awless
S3 Browser
CloudBerry S3 Explorer
Analyze S3 speed from your location
Analyze AWS S3 and CloudFront logs + GoAccess
EC2 instance cheat sheet
S3 meta information
AWS DNS ALIAS record (vs CNAME)
Understanding IAM

Networking

AWS Networking Fundamentals overview: Networking Fundamentals, Application networking foundations, PrivateLink, Advanced VPC fundamentals
Scalable Reliable Datagram (SRD is available via ENA in AWS instances)
"Soft-unicast" for egress traffic
Proxies primer: based on HTTP Connect and QUIC (MASQUE)
Streams support in modern browsers and Current browser support state
Understanding all DNS records
Understanding cost of bandwidth, AWS egress cost analysis
Peering database
WebTransport protocol (improving on WebSockets and WebRTC use cases)
chrony (NTP) and Facebook measuring chrony vs ntpd
BPF introduction
XDP
BPFd (remote BPF by Google)
bpftrace (high-level langauge for writing eBPF programs)
BCC (Tools for BPF-based Linux IO analysis, networking, monitoring, and more)
How to achieve low latency with 10Gbps Ethernet (Cloudflare)
BBR: Congestion-based congestion control, BBR, the new kid on the TCP block
Making Linux TCP Fast
SYN packet handling in the wild (Cloudflare)
How TCP backlog works in Linux
Understanding TCP close states
Bind before connect
SYNC Cookies
On SO_REUSEADDR and SO_REUSEPORT
On Linux history of poll(), select() and epoll(), More on Linux epoll
Monitoring and Tuning the Linux Networking Stack: Receiving Data
Monitoring and Tuning the Linux Networking Stack: Sending Data
MIT's TCP ex Machina: Computer-Generated Congestion Control
Introduction to modern network load balancing and proxying (Envoy)
BGP in 2017
CoreDNS
Knot DNS
Knot Resolver
Maglev: A Fast and Reliable Software Network Load Balancer
MaxMind GeoIP databases
IPVS
Open vSwitch
kTLS in Linux (TLS in kernel space 4.13+), white paper and Intro in Go
DPDK
FD.io
RIPE NCC network information
JLS2009: Generic receive offload
High-Speed Trading: Lines, Radios, and Cables – Oh My
Solving problem with Nagle's algorithm and delayed ACK using TCP_NODELAY
IPFS
S/Kademlia: A Practicable Approach Towards Secure Key-Based Routing
Linux AnyIP
Listen on all ports for AnyIP range on the server
TCP Tracepoints (Linux 4.15/6+)
Kernel Connection Multiplexor (KCM) and more details
Blocking-resistant communication through domain fronting
Anatomy of Linux DNS lookup, part 2, part 3, part 4, part 5
Equal-cost multi-path routing (ECMP)
How LinkedIn used TCP Anycast to make the site faster
Roughtime protocol
List of reserved IPv4 ranges (IANA IPv4 Special-Purpose Address Registry)
List of reserved IPv6 ranges (IANA IPv6 Global Unicast Address Assignments)
Wikipedia on reserved IP addresses
TCP window scaling, timestamps and SACK
DNS SVCB and HTTPS records RFC (draft)
Networking ASICS overview in 2020
How NAT traversal works
Ethernet and IP Networking 101

SDN

Панельная дискуссия «SDN 10 лет после хайпа»
Stratum
p4 language
p4 Runtime
OpenFlow
SAI (Switch Abstraction Interface)
ONOS
OpenNFP
OpenConfig

SRE (Site Reliability Engineering)

Napking math numbers for estimating hardware and software performance
USENIX SREcon APAC 2022: Computing Performance: What's on the Horizon (Great overview by Brendan Gregg)
Google Site Reliability Engineering book
Experience from running Uber payment service
Best practices of on-call (Increment journal issue)
High Performance Browser Networking book
The Docker Book
Site Reliability Engineer HandBook
Linux Performance tools and materials
Understanding swap in Linux and Video: Linux Memory Management at Scale: Under the Hood
How Much Memory Does the Process Really Take on Linux?
U2F devices review
Optimizing web servers for high throughput and low latency (Dropbox)
Shipilev Close Encounters of The Java Memory Model Kind
On disk IO - part 1, part 2, part 3, part 4, part 5
Transparent Hugepages: measuring the performance impact
Introduction 2016 NUMA Deep Dive Series
Understanding PCIe Configuration for Maximum Performance
Netflix Serving 100 Gbps from an Open Connect Appliance
Aphyr Hermitage - info and testing of database isolation levels
A collection of postmortems
Jeff Dean's latency numbers plotted over time
Sakila test DB
Monitoring in the time of Cloud Native
Tyler McMullen - Load Balancing is Impossible
What every programmer should know about memory
What every programmer should know about floating point, floating points format explained, Floating point GUI site, shorter explanation
Chaos Engineering information map
A Gentle Introduction to Erasure Codes
The PMCs of EC2: Measuring IPC
AWS EC2 Virtualization evolution
DNS zone visualization
How Netflix Tunes EC2
Write-Behind Logging
Cache-Oblivious Algorithms and Data Structures
Oracle Graal (Hotspot replacement)
Understanding How Graal Works - a Java JIT Compiler Written in Java
Understanding disk usage in Linux
On time and UTC
The tail at scale (reducing latency long tail)
Optimizing ScyllaDB to run inside Docker container
Using PMM with EverSQL to optimize queries and part 2
Learn where some of the network sysctl variables fit into the Linux/Kernel network flow
Understanding CORS: Mozilla page, Stackoverflow on CORS
A self-service CA for OpenSSH
Shipilev JVM Anatomy Park
How does a relational database work
Using systemd timers instead of cronjobs
Story of age, cache-control headers and prefetching mechanism in modern browsers
Is Your Linux Version Hiding Interrupt CPU Usage From You?
HTTP Caching headers best practices

Disk storage

On Direct vs Buffered I/O and atomic writes, SO on atomic writes from storage specification side, LWN on atomic writes
Minio (local storage with AWS S3 API)
libzbc (direct disk access)
SMR drives at Dropbox
Intel VROC overview and performance testing
Blb (distributed object storage system developed by Upthere)
Configuring OpenZFS to run 24x NVMe drives for high-load MySQL
Achieving 11M IOPS & 66 GB/s IO on a Single ThreadRipper Workstation and follow up video

TLS

The Illustrated TLS Connection
A Readable Specification of TLS 1.3
Sonar
TLS information
Mutuals TLS (mTLS)
Mozilla server side TLS information
BadTLS (SSL testing)
testssl.sh
Mozilla Observatory
HTTP security headers testing
Qualys SSL tests
High-Tech Bridge SSL test
HTTP security tools
HSTS preloading
SRI hash generator
Client side TLS test
DNS CAA helper
DNS over TLS
Encrypted Client Hello (ECH) standard, ECH background
TLS Delegated Credentials
Oblivious DNS over HTTPS RFC (draft)
ECH (TLS Encrypted Client Hello) RFC, Introduction to ECH

HTTP/3 and QUIC

HTTP/3 for everyone (video)
HTTP/3 test site (Fastly)
HTTP/3 Explained (book)
The Illustrated QUIC Connection
msquic (QUIC protocol implementation from Microsoft)
quiche (QUIC protocol implementation from Cloudflare)

Authorization and Authentication

OAuth 2.0 information: Practical information, book, online version of the book, best practices (RFC), browser-bases apps guideline (RFC), RFC
AppAuth (OAuth 2.0 client library)
JSON Web Token (JWT)
JSON Web Signature (JWS)
JSON Web Encryption (JWE)
JWT playground
CBOR Web Token (CWT)
CBOR information
CBOR playground

Cryptography

OpenSSL
BoringSSL (Google)
s2n (AWS)
LibreSSL (OpenBSD OpenSSL fork)
Google Tink
Thesis (encryption framework)
Acra (DB encryption proxy)
Ascon (2023 winner of lightweight cryptography)
Lightweight cryptography algorithms (NIST)
Cryptography Engineering: Design Principles and Practical Applications (book)
Introduction to Modern Cryptography, Second Edition (book)
Security Engineering, 2nd edition (book)
Crypto 101 (concepts, book)
Applied Cryptography Engineering
Ensuring Randomness with Linux's Random Number Generator
Should we MAC-then-encrypt or encrypt-then-MAC?
Authenticated Encryption: Relations among notions and analysis of the generic composition paradigm
How to choose an Authenticated Encryption mode
Awesome cryptography repository
Mind Your Keys? A Security Evaluation of Java Keystores
Hash-based message authentication code
Authenticated Encryption with Associated Data (AEAD)
AES-GCM (AEAD)
AES-GCM-SIV
GCM blockcipher mode
OCB blockcipher mode
ChaCha20 design (stream)
Poly1305 (MAC)
ChaCha20 and Poly1305 (AEAD)
AEGIS-128X (fast authentication cipher with AVX/AES acceleration)
Understanding RSA terms
Elliptic curve introduction
Elliptic Curve Cryptography: a gentle introduction
Safe elliptic curvers
Curve25519
Hybrid Public Key Encryption (HPKE) RFC, Example of HPKE usage in Cloudflare
Fully Homomorphic Encryption library (Google, C++)
Understanding HKDF
Database Cryptography Fur the Rest of Us
Intro to Linux Kernel Key Retention Service

Hashing

smhasher testing suite
Article "Programmers Don’t Understand Hash Functions"
Fast Positive Hash
Meow hash
HighwayHash and SipHash (Google)
SipHash (original)
BLAKE3 (crypto)
BLAKE2 (crypto)
xxHash
MurmurHash3
argon2 (password hashing)
Dieharder: A Random Number Test Suite
yescrypt (KDF and password hashing)
"How to Encipher Messages on a Small Domain. Deterministic Encryption and the Thorp Shuffle" (encryption hashing whitepaper)

UUID

UUID version 6/7/8 RFC draft
UUID version 6/7/8 RFC work in progress
UUID version 7 playground
TypeID (type-safe extension of UUIDv7)
Why UUIDv7? (RU)
KSUID

Real User Monitoring

boomerang library and How to use boomerang
Custome backend required for boomerang - could use boomcatch and statsd
Commercial solution is Akamai mPulse
sitespeed.io tools
Matomo
User access information from logs: GoAccess and AWStats
Compress data from ResourceTiming API
Javascript Performance APIs
Javascript Navigation Timing API

QA Automation

Learn headless browser automation
Playwright
QA Wolf (Playwright scripts generation)
Headless Recorder (Playwright/Puppeteer scripts generation)
Puppeteer
Selenium
Cypress
mimesis (fake data generator)

Tools

htop
gtop
nvtop
k6 (load testing)
dnstrace
upx
bat
httpie
smenu
awesome tmux
py-spy (python profiler)
kubespy
up
doh
fx
jid
dive
nnn
ethr
termshark (CLI UI for Wireshark)
xdpcap (tcpdump for XDP)
flan (nmap based vulnerability scanner)
broot (files)
bandwidth
sandmap
duf (advanced du)

Misc

High Scalability/Availability/Stability articles list
Another github repo

Videos

Kafka 2017 Summit
CppCon 2017
@Scale 2017
Strange Loop 2017
FOSDEM 2018
Computer Architecture course taught at ETH Zürich in Fall 2017
GrafanaCon 2018
SREcon 2018
KubeCon + CloudNativeCon 2018
Networking @Scale 2018
Highload++ Siberia 2018
GrafanaCon 2019
SREcon 2020 Americas
FAST '21

Open Source Agenda is not affiliated with "Awesome Scalability Toolbox" Project. README Source: krootee/awesome-scalability-toolbox

Open Source Agenda Badge

Open Source Agenda Rating