1 ****************************
3 ****************************
5 RTRS (RDMA Transport) is a reliable high speed transport library
6 which provides support to establish optimal number of connections
7 between client and server machines using RDMA (InfiniBand, RoCE, iWarp)
8 transport. It is optimized to transfer (read/write) IO blocks.
10 In its core interface it follows the BIO semantics of providing the
11 possibility to either write data from an sg list to the remote side
12 or to request ("read") data transfer from the remote side into a given
15 RTRS provides I/O fail-over and load-balancing capabilities by using
16 multipath I/O (see "add_path" and "mp_policy" configuration entries in
17 Documentation/ABI/testing/sysfs-class-rtrs-client).
19 RTRS is used by the RNBD (RDMA Network Block Device) modules.
27 An established connection between a client and a server is called rtrs
28 session. A session is associated with a set of memory chunks reserved on the
29 server side for a given client for rdma transfer. A session
30 consists of multiple paths, each representing a separate physical link
31 between client and server. Those are used for load balancing and failover.
32 Each path consists of as many connections (QPs) as there are cpus on
35 When processing an incoming write or read request, rtrs client uses memory
36 chunks reserved for him on the server side. Their number, size and addresses
37 need to be exchanged between client and server during the connection
38 establishment phase. Apart from the memory related information client needs to
39 inform the server about the session name and identify each path and connection
42 On an established session client sends to server write or read messages.
43 Server uses immediate field to tell the client which request is being
44 acknowledged and for errno. Client uses immediate field to tell the server
45 which of the memory chunks has been accessed and at which offset the message
48 Module parameter always_invalidate is introduced for the security problem
49 discussed in LPC RDMA MC 2019. When always_invalidate=Y, on the server side we
50 invalidate each rdma buffer before we hand it over to RNBD server and
51 then pass it to the block layer. A new rkey is generated and registered for the
52 buffer after it returns back from the block layer and RNBD server.
53 The new rkey is sent back to the client along with the IO result.
54 The procedure is the default behaviour of the driver. This invalidation and
55 registration on each IO causes performance drop of up to 20%. A user of the
56 driver may choose to load the modules with this mechanism switched off
57 (always_invalidate=N), if he understands and can take the risk of a malicious
58 client being able to corrupt memory of a server it is connected to. This might
59 be a reasonable option in a scenario where all the clients and all the servers
60 are located within a secure datacenter.
63 Connection establishment
64 ------------------------
66 1. Client starts establishing connections belonging to a path of a session one
67 by one via attaching RTRS_MSG_CON_REQ messages to the rdma_connect requests.
68 Those include uuid of the session and uuid of the path to be
69 established. They are used by the server to find a persisting session/path or
70 to create a new one when necessary. The message also contains the protocol
71 version and magic for compatibility, total number of connections per session
72 (as many as cpus on the client), the id of the current connection and
73 the reconnect counter, which is used to resolve the situations where
74 client is trying to reconnect a path, while server is still destroying the old
77 2. Server accepts the connection requests one by one and attaches
78 RTRS_MSG_CONN_RSP messages to the rdma_accept. Apart from magic and
79 protocol version, the messages include error code, queue depth supported by
80 the server (number of memory chunks which are going to be allocated for that
81 session) and the maximum size of one io, RTRS_MSG_NEW_RKEY_F flags is set
82 when always_invalidate=Y.
84 3. After all connections of a path are established client sends to server the
85 RTRS_MSG_INFO_REQ message, containing the name of the session. This message
86 requests the address information from the server.
88 4. Server replies to the session info request message with RTRS_MSG_INFO_RSP,
89 which contains the addresses and keys of the RDMA buffers allocated for that
92 5. Session becomes connected after all paths to be established are connected
93 (i.e. steps 1-4 finished for all paths requested for a session)
95 6. Server and client exchange periodically heartbeat messages (empty rdma
96 messages with an immediate field) which are used to detect a crash on remote
97 side or network outage in an absence of IO.
99 7. On any RDMA related error or in the case of a heartbeat timeout, the
100 corresponding path is disconnected, all the inflight IO are failed over to a
101 healthy path, if any, and the reconnect mechanism is triggered.
104 *for each connection belonging to a path and for each path:
105 RTRS_MSG_CON_REQ ------------------->
106 <------------------- RTRS_MSG_CON_RSP
108 *after all connections are established:
109 RTRS_MSG_INFO_REQ ------------------->
110 <------------------- RTRS_MSG_INFO_RSP
111 *heartbeat is started from both sides:
112 -------------------> [RTRS_HB_MSG_IMM]
113 [RTRS_HB_MSG_ACK] <-------------------
114 [RTRS_HB_MSG_IMM] <-------------------
115 -------------------> [RTRS_HB_MSG_ACK]
120 * Write (always_invalidate=N) *
122 1. When processing a write request client selects one of the memory chunks
123 on the server side and rdma writes there the user data, user header and the
124 RTRS_MSG_RDMA_WRITE message. Apart from the type (write), the message only
125 contains size of the user header. The client tells the server which chunk has
126 been accessed and at what offset the RTRS_MSG_RDMA_WRITE can be found by
129 2. When confirming a write request server sends an "empty" rdma message with
130 an immediate field. The 32 bit field is used to specify the outstanding
131 inflight IO and for the error code.
134 usr_data + usr_hdr + rtrs_msg_rdma_write -----------------> [RTRS_IO_REQ_IMM]
135 [RTRS_IO_RSP_IMM] <----------------- (id + errno)
137 * Write (always_invalidate=Y) *
139 1. When processing a write request client selects one of the memory chunks
140 on the server side and rdma writes there the user data, user header and the
141 RTRS_MSG_RDMA_WRITE message. Apart from the type (write), the message only
142 contains size of the user header. The client tells the server which chunk has
143 been accessed and at what offset the RTRS_MSG_RDMA_WRITE can be found by
144 using the IMM field, Server invalidate rkey associated to the memory chunks
145 first, when it finishes, pass the IO to RNBD server module.
147 2. When confirming a write request server sends an "empty" rdma message with
148 an immediate field. The 32 bit field is used to specify the outstanding
149 inflight IO and for the error code. The new rkey is sent back using
150 SEND_WITH_IMM WR, client When it recived new rkey message, it validates
151 the message and finished IO after update rkey for the rbuffer, then post
152 back the recv buffer for later use.
155 usr_data + usr_hdr + rtrs_msg_rdma_write -----------------> [RTRS_IO_REQ_IMM]
156 [RTRS_MSG_RKEY_RSP] <----------------- (RTRS_MSG_RKEY_RSP)
157 [RTRS_IO_RSP_IMM] <----------------- (id + errno)
160 * Read (always_invalidate=N)*
162 1. When processing a read request client selects one of the memory chunks
163 on the server side and rdma writes there the user header and the
164 RTRS_MSG_RDMA_READ message. This message contains the type (read), size of
165 the user header, flags (specifying if memory invalidation is necessary) and the
166 list of addresses along with keys for the data to be read into.
168 2. When confirming a read request server transfers the requested data first,
169 attaches an invalidation message if requested and finally an "empty" rdma
170 message with an immediate field. The 32 bit field is used to specify the
171 outstanding inflight IO and the error code.
174 usr_hdr + rtrs_msg_rdma_read --------------> [RTRS_IO_REQ_IMM]
175 [RTRS_IO_RSP_IMM] <-------------- usr_data + (id + errno)
176 or in case client requested invalidation:
177 [RTRS_IO_RSP_IMM_W_INV] <-------------- usr_data + (INV) + (id + errno)
179 * Read (always_invalidate=Y)*
181 1. When processing a read request client selects one of the memory chunks
182 on the server side and rdma writes there the user header and the
183 RTRS_MSG_RDMA_READ message. This message contains the type (read), size of
184 the user header, flags (specifying if memory invalidation is necessary) and the
185 list of addresses along with keys for the data to be read into.
186 Server invalidate rkey associated to the memory chunks first, when it finishes,
187 passes the IO to RNBD server module.
189 2. When confirming a read request server transfers the requested data first,
190 attaches an invalidation message if requested and finally an "empty" rdma
191 message with an immediate field. The 32 bit field is used to specify the
192 outstanding inflight IO and the error code. The new rkey is sent back using
193 SEND_WITH_IMM WR, client When it recived new rkey message, it validates
194 the message and finished IO after update rkey for the rbuffer, then post
195 back the recv buffer for later use.
198 usr_hdr + rtrs_msg_rdma_read --------------> [RTRS_IO_REQ_IMM]
199 [RTRS_IO_RSP_IMM] <-------------- usr_data + (id + errno)
200 [RTRS_MSG_RKEY_RSP] <----------------- (RTRS_MSG_RKEY_RSP)
201 or in case client requested invalidation:
202 [RTRS_IO_RSP_IMM_W_INV] <-------------- usr_data + (INV) + (id + errno)
203 =========================================
204 Contributors List(in alphabetical order)
205 =========================================
206 Danil Kipnis <danil.kipnis@profitbricks.com>
207 Fabian Holler <mail@fholler.de>
208 Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
209 Jack Wang <jinpu.wang@profitbricks.com>
210 Kleber Souza <kleber.souza@profitbricks.com>
211 Lutz Pogrell <lutz.pogrell@cloud.ionos.com>
212 Milind Dumbare <Milind.dumbare@gmail.com>
213 Roman Penyaev <roman.penyaev@profitbricks.com>