English | 中文
Shared memory is the fastest inter-process communication (IPC) method in Linux, but it does not define how processes communicating through shared memory synchronize, or how shared memory is managed. There is not much information in the open source community about this. This article defines a process synchronization mechanism, shared memory management mechanism, error fallback mechanism, and other aspects from a performance perspective, forming a high-performance IPC solution that is suitable for production environments based on shared memory IPC.
As shown in the above figure, when the Client process and the Server process communicate for the first time, they need to establish a TCP connection or a Unix Domain Socket connection, and then complete the protocol initialization through this connection. The initialization involves processes such as initialization and mapping of shared memory. After initialization is complete, full-duplex communication can be achieved through shared memory.
The relationship between the Client process and the Server process is N:N. A Client process can connect to multiple Server processes, and a Server process can also connect to multiple Client processes.
The multiple threads of Client processes and Server processes can also have an N:N relationship.
The communication protocol defines the message format transmitted over TCP connections or Unix Domain Socket connections.
The definition of a message passed between a Client process and a Server process over a connection is as follows:
|length||4 byte||Total length of the message, including the header.|
|magic||2 byte||The Magic number is used to identify the protocol itself and is fixed as 0x7758.|
|version||1 byte||The Protocol version number is used for subsequent iterative updates.|
|type||1 byte||Message type|
|payload||Customization||The message body is defined by the message type, and its length and format are customizable.|
|ExchangeMetadata||Protocol negotiation involves exchanging metadata, which includes the list of currently supported features. For more details, please refer to Chapter 4: Protocol Initialization.|
|ShareMemoryByFilePath||Shared memory is mapped through file path mapping. For more details, please refer to Chapter 4: Protocol Initialization.|
|ShareMemoryByMemfd||Shared memory is mapped through mmefd. For more details, please refer to Chapter 4: Protocol Initialization.|
|AckReadyRecvFD||Prepared to receive memfd, please refer to chapter 4: Protocol Initialization for details.|
|AckShareMemory||Shared memory mapping is completed. For more details, please refer to Chapter 4: Protocol Initialization.|
|SyncEvent||Synchronization events are used to notify the other process to handle new data. For more details, please refer to Section 6.1: Process Synchronization.|
|FallbackData||When shared memory is insufficient, data is sent through the connection. For more details, please refer to Chapter 7: Error Recovery.|
|HotRestart||Hot upgrade. For more details, please refer to Chapter 8: Hot Upgrade.|
|HotRestartAck||Hot upgrade is completed. For more details, please refer to Chapter 8: Hot Upgrade.|
u16str’s encoding format is
string length 2 Byte | string body
|ExchangeMetadata||The payload for metadata uses the JSON format, which provides sufficient extensibility.|
|ShareMemoryByFilePath||Queue path (u16str)|
|ShareMemoryByMemfd||Queue path (u16str)|
|FallbackData||Metadata 8 byte|
|HotRestart||The behavior of hot upgrade is related to the programming interface and is not constrained at the protocol layer.|
The current defined feature list
Protocol initialization mainly completes the connection between the Client and the Server, and then maps shared memory through this connection, which includes three parts.
Shared memory is created and initialized by the Client process, and is shared with the Server process through a connection. The subsequent management involves three issues:
This chapter describes a simple and efficient mechanism that achieves allocation, reclamation, and cleaning of shared memory across processes with an O(1) algorithm complexity. The core idea is to divide the continuous shared memory area into multiple slices and organize them in the form of a linked list. Allocation is done from the head of the list, and reclamation is done to the tail of the list.
The BufferManager is a continuous shared memory area that contains a Header and multiple BufferLists. Each BufferList contains multiple continuous fixed-length buffers.
|List size||Indicates how many BufferLists the BufferManager is managing.|
|Version||The version number of the data structure, which is used to identify the layout of the data structure itself in shared memory. Different versions may have different layouts.|
|Remain length||Indicates how much shared memory is remaining after accounting for the 8-byte Header.|
|size||The number of free, unallocated BufferSlice instances in the BufferList.|
|capacity||The maximum number of BufferSlice instances that the BufferList can hold.|
|head||The offset of the head node relative to the start address in shared memory, which is used to allocate BufferSlice|
|tail||The offset of the tail node relative to the start address in shared memory, which is used to reclaim BufferSlice.|
|specification||The capacity of each BufferSlice, in bytes.|
|push count||A counter for the total number of times that BufferSlice instances have been reclaimed in the BufferList, used for monitoring.|
|pop count||A counter for the total number of times that BufferSlice instances have been allocated in the BufferList, used for monitoring.|
||The actual data contained within the
||The maximum amount of data that can be contained within the
||The offset from the start of the
||A pointer to the next
||A two-byte field containing 16 individual flags. The first flag is
||Two unused bytes that are not currently defined.|
By mapping a continuous shared memory region, we can obtain three data structures: BufferManager, BufferList, and BufferSlice. With BufferList, we can easily allocate and recycle shared memory across processes.
Starting from an example, let's understand the process of allocating and recycling about BufferSlice.
Assumption: There are three slices in the BufferList, and each slice has a capacity of 1 KB.
In the BufferListHeader, head points to the first BufferSlice in the list, tail points to the last BufferSlice (BufferSlice 3), and size is 3.
At this point, in the BufferListHeader, head points to the second BufferSlice in the list, tail points to the last BufferSlice (BufferSlice 3), and size is now 2. BufferSlice 1 has been allocated and is no longer pointing to BufferSlice 2.
At this point, in the BufferListHeader, head points to the second BufferSlice in the list, tail points to the first BufferSlice (BufferSlice 1), and size is now 3. BufferSlice 3 now points to BufferSlice 1, which has been recycled.
Regarding race conditions:
Assuming there are N BufferSlices, the algorithm complexity of allocation and recycling operations is O(1) in itself, and even if there are concurrent conflicts, the probability of occurrence is very low. Every occurrence of concurrent conflict requires one CAS (Compare and Swap) operation. The maximum number of CAS operations (conflict times) that occur during one allocation or recycling operation is approximately equal to the number of BufferSlices, which is O(N).
After a process exits, the cleanup of shared memory depends on the way the shared memory was initially mapped.
The client and server processes require some form of synchronization mechanism to ensure that when one party writes data, the other party can receive a notification and process it. This chapter will describe an efficient synchronization mechanism that can reduce latency and improve performance in the face of high-concurrency requests. For low-frequency requests, it will not introduce incremental delay.
The connection established between the client and server is used to send SyncEvent (defined in section 3.2) for inter-process notification. When the sending party writes data to shared memory, it notifies the other party to process it via SyncEvent.
The IO queue is a data structure mapped in shared memory used to describe the position and other metadata of IPC data in shared memory for easy reading by the other party.
QueueHeader is used to describe and maintain the IO queue.
|capacity||Maximum number of Events that the IO queue can hold|
|head/tail||Monotonically increasing, used to maintain dequeue/enqueue|
|flags||Each flag is 1 bit, currently only the Working flag is defined, which is described in detail in section 6.4.|
Event is used to describe the data for IPC communication.
Sections 6.1 and 6.2 have described how to read data from shared memory and event notification between processes. This section describes the process of one-time communication.
The process of the server replying to the client's data is similar to the above process, with the only difference being that the IO Queue is isolated. For the client, the SendQueue is equivalent to the Server's ReceiveQueue, and the Server's SendQueue is equivalent to the Client's ReceiveQueue.
In a concurrent online scenario, the client may initiate multiple requests, where each request corresponds to an IO Event. The first request sets the workingFlag to 1 and sends a SyncEvent. Before the server processes the IO Queue, all subsequent requests sent by the client can be aware that the server is processing the IOEvent through the workingFlag set to 1, so they do not need to send a SyncEvent to notify the server.
In this way, we can implement a batch harvesting IO operation with a single SyncEvent.
This chapter describes how to handle the scenario where there is not enough shared memory available.
When the shared memory is depleted, new data is no longer communicated through shared memory. Instead, the established connection between Client and Server can be used via TCP loopback or Unix Domain Socket. Due to the limited socket buffer of a single connection, it can also provide backpressure to alleviate the pressure on the Server.
When the IO queue is full, it means that the server is under high pressure. At this time, the client will be blocked until there is available space in the IO queue. This is similar to backpressure, which is similar to when the Unix socket buffer is full, and the write operation will either fail (EAGAIN) or block. It is not advisable to introduce too many operations to increase complexity.
Hot upgrade refers to updating the server process code without affecting the client's ability to write data. There are many options for hot upgrade technologies. Considering that IPC provides services in the form of a library embedded in the application process, a simple application-level hot upgrade can be chosen.
On the Choice of Process Synchronization Mechanism
In Linux production environments, high-performance process synchronization mechanisms such as TCP loopback, Unix domain socket, and event fd are commonly used. The performance of event fd benchmark is slightly better, but passing fd across processes will introduce too much complexity, and its performance improvement on IPC is not significant. Therefore, reusing the connection established between the client and the server will reduce complexity and can also be used on systems without event fd.
In addition, this solution is suitable for online scenarios that have relatively high real-time requirements. For offline scenarios, using high-interval sleep and polling the IO queue is also a good choice, but be aware that sleep itself requires system calls and has higher overhead than sending SyncEvent through the connection.
Perhaps it is possible to alleviate the problem of insufficient shared memory by dynamically adding mapped shared memory expectations at runtime, but it will bring more complexity to management. The workload of an application scenario is relatively fixed, and now computer memory resources are no longer scarce, so a relatively large block of shared memory can be allocated at the beginning for subsequent use. The size of the shared memory buffer is somewhat similar to the unix socket buffer. When it is filled, it means that the server is overloaded and cannot handle and release it in time, and it needs to be buffered. At this time, allocating more shared memory is useless.
About hot upgrade
Server hot upgrade has a standard listener fd hot migration plan, but it emphasizes lossless and imperceptible. However, shared memory communication provides services through lib, which has been embedded in the application. Performing hot upgrade at the application layer is much simpler compared to listener fd hot migration.
Why need IO queue
The IO queue is mainly introduced for batch harvesting of IO, avoiding frequent sending and receiving of SyncEvent on the connection, and the impact of this process on performance cannot be ignored.