Incubator Pegasus Versions Save

Apache Pegasus - A horizontally scalable, strongly consistent and high-performance key-value store

v2.5.0

4 months ago

Apache Pegasus 2.5.0 is a feature list, the change-list is summarized here: https://github.com/apache/incubator-pegasus/issues/1584

Release Note

Behavior changes

  • Use official releases of RocksDB for Pegasus, instead of the modified versions based on a fork of it due to some historical reasons. Since Pegasus 2.1.0, actually most of the modifications have been removed, while few of them were left to keep backward compatible. Pegasus 2.5.0 would turn to official release, thus it should be noted that to use 2.5.0, the server must be upgraded from 2.1, 2.2, 2.3 or 2.4, to ensure that in MANIFEST files there are no introduced tags by the modified versions which could not be recognized by the official releases. #1048
  • Now the logs in servers and C++ clients are in increase severity order of DEBUG, INFO, WARNING, ERROR and FATAL, which means the inverse order between DEBUG and INFO has been corrected. #1200
  • All shared log files would be flushed and removed for garbage collection, while before 2.5.0 there is at least 1 shared log file which is never removed, though long before that the private logs were written as WAL instead of shared logs. #1594
  • No longer support EOL OS versions, including Ubuntu 16.04 and CentOS 6. #1553, #1557

New Features

  • Add a new ACL based on Apache Ranger to provide fine-grained access control to global-level, database-level and table-level resources. On the other hand, it is also compatible with the old coarse-grained ACL. #1054
  • Add support to query and update table-level RocksDB options at runtime, where currently only num_levels and write_buffer_size are supported; other options would be added gradually, if necessary. #1511
  • Add a new rename command for cpp-shell, allowing users to rename a table. #1272
  • Add a configuration [network] enable_udp to control if UDP service is started. The service would not be started when set to false. #1132
  • Add support to dump the statistical information while using jemalloc. #1133
  • Support success_if_exist option for the interface of creating table to cpp-shell, java and go clients. #1148
  • Add a new interface listApps to the Java client to list all tables. #1471
  • Add a new option [replication] crash_on_slog_error to make it possible to exit the replica server if the shared log failed to be replayed, instead of trashing all the replicas on the server. #1574
  • Pegasus could be built on more platforms: Ubuntu 22.04/Clang 14, M1 MacOS. #1350, #1094
  • Pegasus could be developed and built in a docker environment via VSCode(https://code.visualstudio.com/docs/devcontainers/containers), which is more friendly to newbies. #1544

Performance Improvements

  • Improve the performance of count_data of cpp-shell by only transferring the number of records rather than the real data. #1091

Bug fixes

  • Fix a bug that the RocksDB library is not built in Release version, which may cause terrible performance issues. #1340
  • Fix a bug in the Go client that the startScanPartition() operation could not be performed correctly if some partitions was migrated. #1106
  • Fix a bug that some RockDB options could not be loaded correctly if updating the config file and restarting the replica server. #1108
  • Create a stat table to avoid errors reported from cpp-collector. #1155
  • Fix bugs where wrong error code is passed to callback and directory is not rolled back correctly for failure while resetting the mutation log with log files of another directory. #1208
  • Fix a bug in admin-cli that reports an incorrect error doesn't have enough space when executing the partition-split start command. #1289
  • Fix a bug in Java client that the batchGetByPartitions() API may throw IndexOutOfBoundsException exception if partial partitions get response failed. #1411
  • Trash the corrupted replica to the path <app_id>.<pid>.pegasus.<timestamp>.err when the RocksDB instance reports corruption error while executing write operations instead of leaving it in the same place, to avoid endless corruption errors. Also trash the corrupted replica to that path when corruption errors occur while executing read operations (instead of doing nothing). The trashed replica can be recovered automatically in a cluster deployment, or must be repaired manually in singleton deployment. #1422
  • Fix a bug of replica server crash if .init-info file is missing. #1428
  • Fix a bug of the Go client may hang if some replica servers are down. #1444
  • Fix a bug that the replica server would crash when opening a replica with a corrupted RocksDB instance, whose directory will be marked as trash then. #1450
  • Fix a bug that there is no interval between two attempts once ERR_BUSY_CREATING or ERR_BUSY_DROPPING error is received while creating or dropping a table by cpp-shell. #1453
  • Fix a bug of the replica server crash in the Ingestion procedure of Bulk Load data. #1563
  • Mark the data directory as failed and stop assigning new replicas to it once IO errors are found for it. #1383

Configuration Changes

There's no configuration removed by 2.5.0. All of the following configurations are added:

[network]
+ enable_udp = true
[metrics]
+ entity_retirement_delay_ms = 600000
[security]
+ enable_ranger_acl = false
[ddl_client]
+ ddl_client_max_attempt_count = 3
+ ddl_client_retry_interval_ms = 10000
[replication]
+ ignore_broken_disk = true
[ranger]
+ legacy_table_database_mapping_policy_name = __default__
[pegasus.server]
+ rocksdb_write_global_seqno = false
[replication]
+ crash_on_slog_error = false

All of the pull requests that are related to the added configurations:

Contributors

Thanks all of the contributors!

@acelyc111 @AlexNodex @Apache9 @empiredan @foreverneverer @GehaFearless @liangyuanpeng @littlepangdi @ninsmiracle @padmejin @Praying @ruojieranyishen @shalk @Smityz @totalo @WHBANG

v2.4.0

1 year ago

Release Note

Apache Pegasus 2.4.0 is a feature list, the change-list is summarized here: https://github.com/apache/incubator-pegasus/issues/1032

New module

Apache Pegasus contains many ecological projects. In the past, they were maintained in different repository. Starting from this version, they will be gradually donated to the official Apache Pegasus repository. Currently, the following projects are donated to Apache Pegasus:

  • RDSN: in the past, rdsn was linked to Apache Pegasus as a sub module of GIT. Today, it has officially become the core module of Apache Pegasus.
  • PegasusClient: Pegasus supports multiple clients. Now, the following client projects will be donated to Apache Pegasus: Pegasus-Java-Client, Pegasus-Scala-Client, Pegasus-Golang-Client, Pegasus-Python-Client, Pegasus-NodeJS-Client
  • PegasusDocker: Pegasus supports building with docker. In the current version, the official provides dockfile samples of various build environments, and uses githubaction to build corresponding docker images and upload them to DockerHub
  • PegasusShell: Pegasus has used C + + to build shell tools. In the latest version, we have built new shell tools using golang, including AdminCli and for admin and Pegic for user.

New architecture

In the test, we found that the shared-log engine with a single queue will cause a throughput bottleneck. Thanks to the optimization of random writes by concurrent writes of SSDs, we removed the shared-log written in sequence and only kept the private-log as the WAL. After the test, this will bring about a 15-20% improvement in the performance.

New feature

Support replica count update dynamically

In the past, once a table was created, its replica count could not be changed. The new version supports the function of dynamic change of table replica count. User can increase or decrease the count of a serving table, which is transparent to the foreground.

New batchGet interface

The old batchGet interface is only a simple encapsulation of the get interface. It does not have the batch capability. The new interface optimizes the batch operation. It will aggregate multiple requests according to the partiition-hash rules, and then send the aggregated requests in the unified partition to the corresponding nodes of the server atomically. This will improve the throughput of online writing.

Client request limiter

Burst requests from the client will be piled up for the task queue. To avoid this situation, we added queue-controller to limit the task traffic in extreme scenarios.

In the past, Pegasus only controlled the write traffic. In the new version, we also supported the read traffic, which will enhance the stability of the cluster in emergencies.

Jemalloc memory management

Jemalloc is an excellent memory management library. In the past, we only used tcmalloc for memory management. In the new version, we also support jemalloc, the detail bench result see Jemalloc Performance

Multi architecture support

We have added support for MacOS and aarch64 systems, which will improve Pegasus' cross platform capability.

Client Feature

The Java client adds a table creation and deletion interface, and supports batchGetbypartition to adapt the batchGet interface of the server

Go client adapts to RPC interfaces such as bulkload, compact and disk-add on the server side

AdminCli supports node-migrator, node-capacity-balance, table-migrator, table-partition-split command and other functions.

Feature enhancement

Bulkload

The bulkload has added a lot of optimizations for performance, including using direct-io to perform data download, repair duplicate-check, optimize ingest-task strategy and other features to avoid the impact of IO-load on request latency during bulkload.

Bukload supports concurrent tasks of multiple tables meanwhile. However, it should be noted that due to the low-level speed limit, concurrency only allows multiple tables to queue to execute tasks, and does not improve the overall task efficiency unser same rate

Duplication

Duplication removes the dependence on remote file systems and supports checkpoint file transmission between clusters

Duplication supports batch-sending of log files to improve the synchronization efficiency of incremental data

The new duplication, when the user creates the task, the server will first copy the checkpoint files across the cluster, and then automatically synchronize the incremental logs, greatly simplifying the previous process

Other Important

PerfCounter

In the monitoring system, we optimized the CPU cache performance problems caused by false-share issue, and rebuilt the monitoring point system

ManualCompaction

We have added a control interface for ManualCompaction to the latest version so that users can easily trigger a ManualCompaction task and query the current progress in real time

NFS in Learn

NFS is a module for checkpoint transfer between nodes. In the past, the system has been affected by checkpoint transfer. In the new version, we have provided disk-level fine-grained rate control to reduce the impact of checkpoint transfer.

The new link tracker supports data upload for monitoring systems to obtain link delay statistics

Environment Variables

We changed the deny_write environment, now it can also turn on read-deny at the same time and provide different response information to the client

Cold backup

backup speed will affect request latency, new version we provide dynamic configuration for HDFS upload speed during backup

RocksdB log size limit

sometimes rocksdb logs take up more space, which is limited by the new version

MetaServer

Supports Host domain name configuration

Bug fix

In the latest version, we focused on fix the following problems:

Server

  • Node crash caused by ASIO's thread safety problem
  • IO amplification caused by improper handling of RPC body
  • Data overflow caused by unreasonable type declaration in AIO module
  • Unexpected error when replica is closed

Client

  • The batchMultiGet interface of the Java client cannot obtain data completely
  • The go client cannot access when the server enable the request-drop configuration
  • The go client cannot recovery when encounter the ERR_INVALID_STATE and so on

Performance

In this benchmark, we use the new machine, for the result is more reasonable, we re-run the Pegasus Server 2.3:

  • Machine parameters: DDR4 16G * 8 | Intel Silver4210*2 2.20Ghz/3.20Ghz | SSD 480G * 8 SATA
  • Cluster Server: 3 * MetaServerNode 5 * ReplicaServerNode
  • YCSB Client: 3 * ClientNode
  • Request Length: 1KB(set/get)
Case client and thread R:W R-QPS R-Avg R-P99 W-QPS W-Avg W-P99
Write Only 3 clients * 15 threads 0:1 - - - 56,953 787 1,786
Read Only 3 clients * 50 threads 1:0 360,642 413 984 - - -
Read Write 3 clients * 30 threads 1:1 62,572 464 5,274 62,561 985 3,764
Read Write 3 clients * 15 threads 1:3 16,844 372 3,980 50,527 762 1,551
Read Write 3 clients * 15 threads 1:30 1,861 381 3,557 55,816 790 1,688
Read Write 3 clients * 30 threads 3:1 140,484 351 3,277 46,822 856 2,044
Read Write 3 clients * 50 threads 30:1 336,106 419 1,221 11,203 763 1,276

Known issues

We have upgraded the ZK client version to 3.7. When the ZK version of the server is smaller than this version, the connection may be timeout.

When configuring periodic manual-compaction tasks with environment variables, there may be a calculation error and cause immediate start.

新增特性

  • 新增了动态修改表的副本数功能,允许在运行时修改一张表的副本数
  • 支持读操作的流量控制
  • 支持动态设置不同task的队列长度
  • 支持表级读写开关
  • 支持Jemalloc
  • 支持aarch64平台
  • 支持macOS平台的编译

Java Client

  • 支持batchGetByPartitions接口,它将发往同一partition的get请求打包,以提升性能
  • 支持建表接口createApp
  • 支持删表接口dropApp

Go Client

  • 支持Bulk Load控制接口
  • 支持Manual Compact控制接口
  • 支持磁盘级的数据迁移接口

Admin CLI

  • 支持Bulk Load的控制工具
  • 支持Manual Compact的控制工具
  • 支持Duplication的控制工具
  • 支持Partition Split的控制命令
  • 支持节点数据迁移、表迁移、磁盘容量均衡等控制工具

功能/性能优化

  • 移除shared log只保留private log,简化系统架构,提升系统性能
  • BulkLoad:进行了若干优化,包括降低下载文件和ingest文件的IO负载,优化错误处理逻辑,提升接口的易用性等
  • Duplication:进行了若干优化,包括不再借助如HDFS等外部文件系统而可自行迁移历史数据,批量发送plog以提升性能,提升操作的易用性等
  • Manual Compaction:支持更丰富的查询、控制操作
  • 流量控制:数据迁移、数据备份等功能也支持了流量控制
  • MetaServer列表支持FQDN
  • 限制RocksDB的日志大小
  • 开始实现新的metrics框架(在本次版本中未启用)

代码重构

  • 将Pegasus的子项目rDSN,各语言的client库,CLI访问及控制工具库等项目合入到Pegasus主项目
  • 移除thrift自动生成的Cpp和Java文件

Bugfix

  • 修复高流量访问时,因多线程竞争而引发的crash问题
  • 修复因消息的body size未设置为引发的磁盘和网络流量放大问题
  • 修复当log大小超过2G再进行flush引发的crash问题
  • 修复XFS文件系统上断电而引发的分片元信息丢失的问题
  • 修复关闭分片时,日志报RocksDB的Shutdown in progress的问题
  • 修复开启Prometheus后,因表名中带有-符号而引发的crash问题
  • 修复RocksDb相关的的recent.flush.completed.count,recent.flush.output.bytes指标不更新的问题
  • 修复日志中的文件名被改写成compiler_depend.ts的问题
  • 修复一次性备份数据发生超时,引发的crash的问题
  • 修复分片的数据目录变空时,不报错而能正常启动的问题
  • 修复Python3 client处理str类型出错的问题

基础建设

  • 将镜像仓库迁移到DockerHub的apache/pegasus空间
  • 完善并精细化控制GitHub的workflow,使得CI过程更稳定且省时

性能测试

测试环境

  • Framework: YCSB
  • Server: DDR4 16G * 8, Intel Silver4210*2 2.20Ghz/3.20Ghz, SSD 480G * 8 SATA
  • OS: Centos7 5.4.54-2.0.4.std7c.el7.x86_64
  • Cluster: 3 * Meta Server + 5 * Replica Server
  • YCSB Client: 3 * ClientNode
  • Request Size: 1KB (set/get)

测试结果

Case client and thread R:W R-QPS R-Avg R-P99 W-QPS W-Avg W-P99
Write Only 3 clients * 15 threads 0:1 - - - 56,953 787 1,786
Read Only 3 clients * 50 threads 1:0 360,642 413 984 - - -
Read Write 3 clients * 30 threads 1:1 62,572 464 5,274 62,561 985 3,764
Read Write 3 clients * 15 threads 1:3 16,844 372 3,980 50,527 762 1,551
Read Write 3 clients * 15 threads 1:30 1,861 381 3,557 55,816 790 1,688
Read Write 3 clients * 30 threads 3:1 140,484 351 3,277 46,822 856 2,044
Read Write 3 clients * 50 threads 30:1 336,106 419 1,221 11,203 763 1,276

配置变更

+ [pegasus.server]
+ rocksdb_max_log_file_size = 8388608
+ rocksdb_log_file_time_to_roll = 86400
+ rocksdb_keep_log_file_num = 32

+ [replication]
+ plog_force_flush = false
  
- mutation_2pc_min_replica_count = 2
+ mutation_2pc_min_replica_count = 0 # 0 means it's value based table max replica count
  
+ enable_direct_io = false # Whether to enable direct I/O when download files from hdfs, default false
+ direct_io_buffer_pages = 64 # Number of pages we need to set to direct io buffer, default 64 which is recommend in my test.
+ max_concurrent_manual_emergency_checkpointing_count = 10
  
+ enable_latency_tracer_report = false
+ latency_tracer_counter_name_prefix = trace_latency
  
+ hdfs_read_limit_rate_mb_per_sec = 200
+ hdfs_write_limit_rate_mb_per_sec = 200
  
+ duplicate_log_batch_bytes = 0 # 0 means no batch before sending
  
+ [nfs]
- max_copy_rate_megabytes = 500
+ max_copy_rate_megabytes_per_disk = 0
- max_send_rate_megabytes = 500
+ max_send_rate_megabytes_per_disk = 0
  
+ [meta_server]
+ max_reserved_dropped_replicas = 0
+ bulk_load_verify_before_ingest = false
+ bulk_load_node_max_ingesting_count = 4
+ bulk_load_node_min_disk_count = 1
+ enable_concurrent_bulk_load = false
+ max_allowed_replica_count = 5
+ min_allowed_replica_count = 1
  
+ [task.LPC_WRITE_REPLICATION_LOG_SHARED]
+ enable_trace = true # true will mark the task will be traced latency if open global trace

Contributors

@acelyc111 @cauchy1988 @empiredan @foreverneverer @GehaFearless @GiantKing @happydongyaoyao @hycdong @levy5307 @lidingshengHHU @neverchanje @padmejin @Smityz @totalo @WHBANG @xxmazha @ZhongChaoqiang

v2.3.0

2 years ago

Apache Pegasus 2.3.0 is a feature release. The change-list is summarized here: https://github.com/apache/incubator-pegasus/issues/818.

New Features

Partition split

For the Pegasus table, its partition count is fixed while creating, during table total storage growth, sometimes a single partition may store too much data, leading to performance downgrade. Partition split supports the scalability for the Pegasus table, each original partition will be divided into two partitions. Please check https://github.com/apache/incubator-pegasus/issues/754 for more details and the design document can be found: partition-split-design.

User-defined compaction strategy

Pegasus supports update value TTL in table-level, this is supported by RocksDB compaction filter. To make the function more flexible and expand its usage, we provide several compaction rules and compaction operations. Users filter the table data through rules and execute compaction strategy by operations. Please check https://github.com/apache/incubator-pegasus/issues/773 for more details.

Cluster load balance

Pegasus meta server will trigger load balance when replica counts of replica servers are not balanced. Pegasus now only provides a table-level balance strategy, meaning if all tables are balanced, the meta server will consider cluster balanced. In some cases, especially if a cluster has many replica servers and small-partition-count tables, the whole cluster replica is not balanced. Cluster load balance is a new load balance strategy, it will make cluster replica balanced and all tables replica balanced. Please check https://github.com/apache/incubator-pegasus/issues/761 for more details.

One time backup

Pegasus used to provide policy to manage table backup, users can create a backup policy with information such as start time, interval etc, then add tables into policy, those tables will trigger backup by policy start time, and execute backup through interval. It is too complex to trigger backup immediately, as a result, we provide one time backup, and we plan to remove the backup policy in further releases. Please check https://github.com/apache/incubator-pegasus/issues/755 for more details.

Optimizations and improvements

Backup request throttling

We found that sometimes the backup request may lead to read QPS increasing rapidly, so we add the rate limiter for it to delay or reject unexpected requests. Please check https://github.com/XiaoMi/rdsn/pull/855 for more details.

Drop timeout user request

In current implementation, Pegasus will put requests received by clients into a queue, this request will be executed until all requests before it are executed. In fact, each request has a timeout option, if it's inqueue-time exceeds its client timeout, this request is not necessary to be executed, because the client has already considered it as timeout. In this release, we support this option and add perf-counters for it. Please check https://github.com/apache/incubator-pegasus/issues/786 for more details. In our production cluster, we noticed that adding this option can decrease thread queue length, long-tail requests and the possibility of memory rapid growth.

When replica server disk spaces are all used, any write requests will be failed, besides the server instance will generate coredump. To avoid such a situation, we provide an option to configure disk space threshold, if available disk space is below such value, all user write requests will be rejected. Besides, we provide broken disk checks while starting a replica server and add a new disk path dynamically. Please check https://github.com/apache/incubator-pegasus/issues/787 and https://github.com/apache/incubator-pegasus/issues/788 for more details.

Read enhancement

In the previous implementation, all read operations will be executed in one threadpool, we separate single-read and range-read into different thread pools to mitigate possible slow range read block all read requests. Besides, we also do enhancement about range read iteration count. Please check https://github.com/apache/incubator-pegasus/pull/782 and https://github.com/apache/incubator-pegasus/issues/808 for more details.

New perf-counters

In this release, we also added many new counters facilities to manage and monitor the system, such as table-level compaction counters, table-level RocksDB read/write amplification and hit rate counters, replica server session counters etc. Please reference https://github.com/apache/incubator-pegasus/issues/818 for all counters.

Please reference https://github.com/apache/incubator-pegasus/issues/818 for all enhancements.

Fixed issues

Duplication has been released in previous releases, it can work well in most normal cases, but in some corner cases especially for error handling cases, it still has some bugs. In this release, we fix some duplication related bugs to make this function more robust.

Release 2.2.0 has a known issue that shutting down a Pegasus instance who had already connected to HDFS to execute bulk load or backup is highly possible to core. In this release, we fix related bugs.

Asan bug fix

In this release, we fix many bugs detected by Asan, including potential memory leaks etc.

Please reference https://github.com/apache/incubator-pegasus/issues/818 for all fix lists.

Performance

We use YCSB to get the following performance test result. We deploy a test cluster with two meta servers, five replica servers and one collector server. The target table has 64 partitions.

Case client and thread R:W R-QPS R-Avg R-P99 W-QPS W-Avg W-P99
Write Only 3 clients * 15 threads 0:1 - - - 42386 1060 6628
Read Only 3 clients * 50 threads 1:0 331623 585 2611 - - -
Read Write 3 clients * 30 threads 1:1 38766 1067 15521 38774 1246 7791
Read Write 3 clients * 15 threads 1:3 13140 819 11460 39428 863 4884
Read Write 3 clients * 15 threads 1:30 1552 937 9524 46570 930 5315
Read Write 3 clients * 30 threads 3:1 93746 623 6389 31246 996 5543
Read Write 3 clients * 50 threads 30:1 254534 560 2627 8481 901 3269

Known issues

Carefully using drop timeout user requests. In our heavy-throughput cluster, we had added this option and found out that the replica server might generate coredump occasionally. The coredump had happened in previous versions, recorded in https://github.com/apache/incubator-pegasus/issues/387. Now there is no evidence showing the connection between the coredump and function, but we should notice it through our observation.

Upgrading Notes

2.3.0 can only be upgraded from 2.x. Servers whose version is 1.x should upgrade to 2.0.x firstly before upgrading to 2.3.0.

Contributors

Thanks to everyone who contribute to Apache Pegasus (incubating) 2.3.0, they are: acelyc111 cauchy1988 empiredan hycdong levy5307 lidingshengHHU neverchanje padmejin foreverneverer Smityz zhangyifan27 ZhongChaoqiang


新功能

Partition split

Pegasus表在创建之后分片个数就是固定的,然而随着表数据量增长,有时单分片可能非预期存储过大数据,单分片过大可能影响性能。Partition split能够扩展表的分片个数,使每个分片一分为二,并且较小影响读写。可以查看https://github.com/apache/incubator-pegasus/issues/754 了解更多细节,或者通过设计文档 partition-split-design 了解更多设计细节。

用户自定义compaction策略

Pegasus支持表级TTL功能,这个功能是通过RocksDB的compaction filter实现的。为了方便用户更灵活得修改TTL,扩展compaction功能,我们开发了用户自定义compaction策略功能,用户可以配置compaction操作来执行不同的compaction策略,并通过compaction规则筛选出表中待执行的数据。可以查看https://github.com/apache/incubator-pegasus/issues/773 了解更多细节。

集群负载均衡

当replica server上的replica个数不均衡时,meta server会触发负载均衡功能。目前Pegasus只提供表级负载均衡策略,即当一个集群中每张表在replica server上是均衡的,meta server就认为整个集群是均衡的。然而,在部分场景下,特别是集群replica server节点数较多,集群中存在大量小分片表时,即使每张表是均衡的,整个集群也不是均衡的。在这个版本中,我们添加了集群负载均衡功能,在保障不改变表均衡的情况下让整个集群的replica个数均衡。可以查看https://github.com/apache/incubator-pegasus/issues/761 了解更多细节。

一次性冷备份

在之前的实现中,Pegasus使用备份策略来管理表的备份,用户可以创建一个备份策略,包含备份开始之间,备份周期等信息,之后再表添加到这个策略中,就能够按策略配置的时间触发备份并周期性进行。我们现在添加了单次备份功能,即立刻开始备份,并且我们计划后续去掉备份策略这个复杂的功能,可以查看https://github.com/apache/incubator-pegasus/issues/755 了解更多细节。

优化与改进

Backup request限流

我们发现backup request在部分场景下,可能会产生大量突增,影响性能,因此我们为backup request添加了限流功能,可以查看https://github.com/XiaoMi/rdsn/pull/855 了解更多细节。

丢弃排队超时的用户请求

服务端接收请求后,会先将请求加入队列,只有当队列前的请求被执行,该请求才会被调度执行。如果一个请求的排队时间已经超过了客户端的超时时间,这个请求在客户端已经超时了,服务端也无需再执行了。在这个版本中,我们支持这个配置并新增了对应的Perf-Counter。可以查看https://github.com/apache/incubator-pegasus/issues/786 了解更多细节。在线上环境,我们发现添加这个配置能够减少线程队列长度、长尾请求,以及改善内存突增。

磁盘保护相关功能

如果replica server的磁盘空间耗尽,不但用户写请求会失败,而且会造成进程coredump。为了避免出现这种情况,我们新增了磁盘空间阈值配置,如果replica server的可用磁盘空间低于这个值,用户写请求都会被拒绝。另外,我们还添加了启动时坏盘检查,动态添加新盘等功能。可以查看https://github.com/apache/incubator-pegasus/issues/787 和 https://github.com/apache/incubator-pegasus/issues/788 了解更多细节。

读优化

在之前的实现中,所有读请求都在同一个线程池执行,在这个版本中,我们将单条读和多条读拆分到不同的线程池执行。同时,我们还针对多条读迭代个数不准确等问题进行优化和改善。可以查看https://github.com/apache/incubator-pegasus/pull/782 和https://github.com/apache/incubator-pegasus/issues/808 了解更多细节。

新Perf-counter

在这个版本中,我们添加多个counter方便管理和观察集群,例如表级compaction counter、表级RocksDB读写放大、命中率counter,replica server session个数counter等。可以通过https://github.com/apache/incubator-pegasus/issues/818 查看所有新增Perf-counter。

更多优化与改进可以参见:https://github.com/apache/incubator-pegasus/issues/818

已修复的问题

热备份功能bug fix

热备份在之前版本就发布了,在大多数场景能够正常使用,但是在一些错误处理的情况下,还有一些bug。在这个版本中,我们修复了多个热备份相关bug让功能更健壮。

优雅退出bug fix

2.2.0版本有一个已知问题,在关闭曾经连接过HDFS的Pegasus进程时,很可能会产生coredump。在这个版本中,我们修复了这些优雅退出相关的bug。

Asan bug fix

在这个版本中,我们修复了多个Asan检查出来的bug,例如潜在的内存泄漏等。

完整的Bug fix列表可以参考:https://github.com/apache/incubator-pegasus/issues/818

性能测试结果

我们部署的测试集群有2台meta server,5台replica server和1台collector,被测试的表有64个分片,使用YCSB进行性能测试,详细的测试结果如下:

Case client and thread R:W R-QPS R-Avg R-P99 W-QPS W-Avg W-P99
Write Only 3 clients * 15 threads 0:1 - - - 42386 1060 6628
Read Only 3 clients * 50 threads 1:0 331623 585 2611 - - -
Read Write 3 clients * 30 threads 1:1 38766 1067 15521 38774 1246 7791
Read Write 3 clients * 15 threads 1:3 13140 819 11460 39428 863 4884
Read Write 3 clients * 15 threads 1:30 1552 937 9524 46570 930 5315
Read Write 3 clients * 30 threads 3:1 93746 623 6389 31246 996 5543
Read Write 3 clients * 50 threads 30:1 254534 560 2627 8481 901 3269

已知问题

请谨慎配置丢弃排队超时的用户请求这一功能,我们在线上大吞吐环境下,配置丢弃超时的用户请求,会偶现https://github.com/apache/incubator-pegasus/issues/387 中的coredump。这个bug在之前的版本中也极小概率出现过,目前尚未定位到这个coredump与新增配置有关,但根据现象特此说明。

升级提示

2.3.0只能从2.x版本升级,如果当前server版本是1.x,需要先升级到2.0.x,才能再升级到2.3.0.

致谢

感谢所有为Apache Pegasus (incubating) 2.3.0版本做出贡献的开发者: acelyc111 cauchy1988 empiredan hycdong levy5307 lidingshengHHU neverchanje padmejin foreverneverer Smityz zhangyifan27 ZhongChaoqiang

v2.2.0

2 years ago

Apache Pegasus 2.2.0 is a feature release. The change-list is summarized here: https://github.com/apache/incubator-pegasus/issues/696.

Upgrading Notes

  • 2.2.0 can only be upgraded from 2.0.x or 2.1.x. ReplicaServers of prior versions should upgrade to 2.0.x 2.1.x firstly before upgrading to 2.2.0.

New Features

Modification of configs in runtime

In order to modify a configuration, we had to rolling-update all the servers to load the new config file. This is particularly cumbersome for us to operate the service. Now, this feature allows us to dynamically update a config item using HTTP API without service downtime. Please check XiaoMi/rdsn#719, XiaoMi/rdsn#704, XiaoMi/rdsn#682 for more details.

Hot-key detection

The hotspot workload is dangerous as it continuously strikes the availability of one partition. In this situation, we formerly needed to search the slow-log (abnormal reads) on servers and find the hot-key, which is usually the key that appears most frequently in the logs. Now this job can be automated. When a hot partition is confirmed, we can send a "hot-key detection RPC" to the corresponding ReplicaServer. This RPC triggers analysis of the incoming requests to the partition. The analysis will end when it finds the hot-key or the process times out. The related issue: apache/incubator-pegasus#495

Rate-limiting of reads

Pegasus had the throttling of writes but no support of the throttling of reads. This version introduces QPS-based throttling of reads.

Disk-migrator

Support migrate replica between two disk volumes in one ReplicaServer node to balance disk capacity. And admin-cli tools support the feather

Optimizations and improvements

Support HDFS as a remote storage provider of Bulk-Load/Backup/Restore

For most of our use-cases of bulk-load, HDFS is the default choice of remote storage for the Spark-generated files. In this version, we add support for HDFS as well as the rate-limiting of HDFS downloading/uploading.

HTTP APIs of the Bulk-Load procedure

Bulk Load in version 2.1.0 is relatively primitive. We make this feature full-feathered in this version with a list of APIs that can help us to automate the procedure. (We in Xiaomi have developed a tool called BulkLoad Manager that simplifies the management of tasks. It utilizes the newly exposed APIs. We plan to open-source this project soon.)

Fixed Issues

  • From this version, we have support for various C++ compilers, including:
    • GCC 5.4.0 (ubuntu1604)
    • GCC 7.5.0 (ubuntu1804)
    • GCC 9.4.0 (ubuntu2004)
    • Clang9
    • Clang10
  • The continuous building test is here: https://github.com/pegasus-kv/pegasus-docker

Known Issues

  • We have found shutting down a PegasusServer that had run bulkload before is highly propable to coredump. This will usually not cause serious problem (as long as the server you shutown has no active replica). But users shall be aware of this issue before puting BulkLoad into production.

Apache Pegasus 2.2.0 是一个功能版本。所有的改动都被总结在: https://github.com/apache/incubator-pegasus/issues/696

升级提示

2.2.0 只能从 2.0.x 或 2.1.x 版本升级。此前版本的 ReplicaServers 首先需要升级至 2.0.x 或 2.1.x 。

新功能

动态配置修改

为了修改一个配置项, 过去我们需要重启所有集群内服务节点来加载新的配置文件. 这对我们的服务运维造成较大的麻烦. 而现在, 我们可以使用这个功能, 通过HTTP API来动态修改配置项, 从而不影响服务. 你可以查看 XiaoMi/rdsn#719, XiaoMi/rdsn#704, XiaoMi/rdsn#682 来了解更多细节.

热点Key检测

由于热点流量会持续影响一个分片的可用性, 它是一个非常危险的问题. 此前我们遇到这个问题时, 我们会从服务节点上搜索慢查询日志(异常读), 然后找到出现最频繁的key, 这个key一般就是热点key. 现在这个流程可以被自动化了. 当我们确认一个分片是热点分片时, 我们可以向对应的ReplicaServer发送"热点检测RPC". 这个RPC会触发对该分片的请求热点分析. 一旦热点key被找到, 或者流程超时, 则分析结束. 相关 issue: apache/incubator-pegasus#495 使用文档: http://pegasus.apache.org/administration/hotspot-detection

读限流

Pegasus有写限流但没有读限流的支持. 在这个版本我们引入了基于QPS的读限流.

磁盘数据迁移

支持磁盘分片数据迁移,以均衡同一机器下不同磁盘的容量使用,用户可以使用admin-cli工具来使用这一特性

优化和改进

支持HDFS用于BulkLoad/Backup/Restore

在我们绝大多数的BulkLoad用户场景,HDFS都是Spark生成文件的默认存储。在这一版本中,我们提供了HDFS的支持,以及HDFS文件上传下载限流的支持。

Bulk Load流程的HTTP接口

2.1.0版本的Bulk Load是相对初级的。我们在这一版本中将该功能打造得更为完整。我们提供了一系列API,它们可以用于BulkLoad流程的自动化。 (在小米,我们开发了BulkLoad Manager运维工具,它简化了BulkLoad的任务管理。它使用了该版本引入的新API。我们计划近期将该工具开源。)

已修复的问题

  • 自当前版本后,我们提供了多种编译器支持,包括:
    • GCC 5.4.0 (ubuntu1604)
    • GCC 7.5.0 (ubuntu1804)
    • GCC 9.4.0 (ubuntu2004)
    • Clang9
    • Clang10
  • 这里是持续集成的编译测试: https://github.com/pegasus-kv/pegasus-docker

已知问题

  • 我们发现关停一个曾运行过BulkLoad任务的PegasusServer很容易造成coredump。这一般不会造成严重问题(只要在关停时,你的PegasusServer上没有replica提供服务)。但想要将BulkLoad功能放到生产环境的用户应当要注意这个问题。

Thanks to everyone who contribute to Apache Pegasus (incubating) 2.2.0, they are: @Shuo-Jia @Smityz @Sunt-ing @ZhongChaoqiang @acelyc111 @empiredan @fredster33 @hycdong @levy5307 @neverchanje @vongosling @zhangyifan27

v1.12-comp-with-2.0

3 years ago

NOTE:

  1. Don't use this version except in case of 2
  2. This version is ONLY used for temporary downgrading from v2.0.0, in case of some serious unexpectable bugs occured after upgrading from v1.x, and the servers can't continuously running on v2.0.0
  3. This version is forward compatible with v2.0.0
  4. This version is NOT backward compatible with v1.12.3 or older versions

What's new in this version

  • Read/Write pegasus value in new data version (#459)
  • Bump rocksdb to v6.6.4 (#473)
  • Support to open rocksdb with or without meta CF (#549)

v2.0.0

3 years ago

Release Note

NOTE: 2.0.0 is backward-compatible only, which means servers upgraded to this version can't rollback to previous versions.

To see the detailed release information, please refer to https://github.com/XiaoMi/pegasus/issues/536.

The following are the highlights in this release:

Duplication

Duplication is the solution of Pegasus for intra-cluster data copying in real-time. We currently limit our master-master duplication for 'PUT' and 'MULTI_PUT' only. See this document for more details: https://pegasus-kv.github.io/administration/duplication.

Backup Request

Backup Request is a way to eliminate tail latency by sacrificing minor data consistency, fallback reading from a random secondary when the primary read failed to finish at the expected time. See the discussion here: https://github.com/XiaoMi/pegasus/issues/251.

RocksDB Meta CF

Pegasus currently has a hacked version of RocksDB that stores a few metadata in the manifest file, which makes our RocksDB incompatible with the official version. In this version, we exploit an additional column family (called 'Meta CF') to store those metadata.

To finally get rid of the legacy RocksDB, you must first upgrade the ReplicaServer to 2.0.0.

Bloom Filter Optimization

This time we support metrics for the utilization of bloom filters in Pegasus. And for critical scenarios, we provide configurations for performance tuning on bloom filters. See https://github.com/XiaoMi/pegasus/pull/522, https://github.com/XiaoMi/pegasus/pull/521.

Cold-Backup FDS Limit

This feature adds throttling on download and upload during cold-backup. See https://github.com/XiaoMi/rdsn/pull/443.

Adding Node Optimization

We previously suffer from the effect brought by data migration when adding one or more nodes into a cluster. In some latency-critical scenarios (here we mostly focus on read-latency) this (3~10 times increase in latency) usually implies the service briefly unavailable.

In 2.0.0 we support a strategy that the newly added nodes do not serve read requests until most migrations are done. Although the new nodes still participate in write-2PC and the overall migration workload doesn't decrease, the read latency significantly improved thanks to this job.

Be aware that this feature requires merely pegasus-tools to be 2.0.0, you don't have to upgrade the server to 2.0.0. See https://github.com/XiaoMi/pegasus/pull/528.

v1.12.3

4 years ago

Release Notes of v1.12.3

The following are the highlights in this release:

New Hotspot Algorithm Support

A new algorithm of hotspot detection is supported here, which makes the following improvements:

  1. It defined the concept of threshold, and ready for automatic monitoring.
  2. It has higher stability

Thanks to @Smityz .

Related PR:

Optimization

There are some updates used to optimize memory usage and reduce performance cost.

Related PR:

Upgrade from the previous version

[replication]
- allow_non_idempotent_write = false
+ max_allowed_write_size = 1048576 # default = 1MB
+ cold_backup_checkpoint_reserve_minutes = 10

[pegasus.server]
# cluster level restriction {3000, 30MB, 1000, 30s}
+ rocksdb_multi_get_max_iteration_count = 3000
+ rocksdb_multi_get_max_iteration_size = 31457280
+ rocksdb_max_iteration_count = 1000
+ rocksdb_iteration_threshold_time_ms = 30000
+ rocksdb_use_direct_reads = true
+ rocksdb_use_direct_io_for_flush_and_compaction = true
+ rocksdb_compaction_readahead_size = 2097152
+ rocksdb_writable_file_max_buffer_size = 1048576
- perf_counter_cluster_name = %{cluster.name}
- perf_counter_enable_falcon = false
- perf_counter_enable_prometheus = false
+ perf_counter_sink = <falcon | prometheus>

[pegasus.collector]
- cluster = %{cluster.name}
+ hotspot_detect_algorithm = <hotspot_algo_qps_variance | hotspot_algo_qps_skew> 

[replication]
+ cluster_name = %{cluster.name}

v1.12.2

4 years ago

Release Notes of v1.12.2

This patch release mainly provides some performance improvement for scan/multiget operations, and also provides more HTTP interfaces.

What's new in this version

  • feat(shell): support resolve IP address in dsn/utility (XiaoMi/rdsn#337)
  • fix(meta): bind task_tracker for each task in meta server (XiaoMi/rdsn#344)
  • fix: fix the bug that threads don't stop when shared_io_service is released (XiaoMi/rdsn#350)
  • feat: add http interface to get perf counter info (XiaoMi/rdsn#349)
  • fix: fix meta response state to client during configuration query (XiaoMi/rdsn#354)
  • feat(http): add HTTP interfaces to query backup info (XiaoMi/rdsn#342)
  • fix(http): add uri decoder for http_server (XiaoMi/rdsn#357)
  • feat(http): print help info on root path of the http server (XiaoMi/rdsn#367)
  • feat(shell): support resolve IP address in some shell commands (#426)
  • feat(collector): add statistics for estimate key number of table (#437)
  • feat: add http_server for info collector (#442)
  • feat(rocksdb): Adapt prefix bloom filter to speedup scans by hashkey (#438)

Upgrade from the previous version

Prefix bloom filter is enabled by default, it provides some performance improvement for scan operations, and data is compatible. However, you can disable it by

[pegasus.server]
  rocksdb_filter_type = common

v1.12.1

4 years ago

Release Notes of v1.12.1

This is a patch release. We strongly recommend you to upgrade to this version instead of using v1.12.0.

What's new in this version

  • fix: fix log bug in fmt logging (XiaoMi/rdsn#346)
  • feat: optimize tcmalloc release memory (XiaoMi/rdsn#343)
  • feat: update config for tcmalloc release memory optimization (#433)
  • feat: add a interface to get perf-counters info of all partitions of all apps (#417)
  • feat(collector): add statistics for estimate key number of partition (#435)

Upgrade from the previous version

[replication]
- mem_release_interval_ms
+ mem_release_check_interval_ms = 3600000
+ mem_release_max_reserved_mem_percentage = 10