12 d) Implementation overview
18 g) Other contingencies
19 2) Writing a user pass-through handler
20 a) Discovering and configuring TCMU uio devices
21 b) Waiting for events on the device(s)
22 c) Managing the command ring
29 TCM is another name for LIO, an in-kernel iSCSI target (server).
30 Existing TCM targets run in the kernel. TCMU (TCM in Userspace)
31 allows userspace programs to be written which act as iSCSI targets.
32 This document describes the design.
34 The existing kernel provides modules for different SCSI transport
35 protocols. TCM also modularizes the data storage. There are existing
36 modules for file, block device, RAM or using another SCSI device as
37 storage. These are called "backstores" or "storage engines". These
38 built-in modules are implemented entirely as kernel code.
43 In addition to modularizing the transport protocol used for carrying
44 SCSI commands ("fabrics"), the Linux kernel target, LIO, also modularizes
45 the actual data storage as well. These are referred to as "backstores"
46 or "storage engines". The target comes with backstores that allow a
47 file, a block device, RAM, or another SCSI device to be used for the
48 local storage needed for the exported SCSI LUN. Like the rest of LIO,
49 these are implemented entirely as kernel code.
51 These backstores cover the most common use cases, but not all. One new
52 use case that other non-kernel target solutions, such as tgt, are able
53 to support is using Gluster's GLFS or Ceph's RBD as a backstore. The
54 target then serves as a translator, allowing initiators to store data
55 in these non-traditional networked storage systems, while still only
56 using standard protocols themselves.
58 If the target is a userspace process, supporting these is easy. tgt,
59 for example, needs only a small adapter module for each, because the
60 modules just use the available userspace libraries for RBD and GLFS.
62 Adding support for these backstores in LIO is considerably more
63 difficult, because LIO is entirely kernel code. Instead of undertaking
64 the significant work to port the GLFS or RBD APIs and protocols to the
65 kernel, another approach is to create a userspace pass-through
66 backstore for LIO, "TCMU".
72 In addition to allowing relatively easy support for RBD and GLFS, TCMU
73 will also allow easier development of new backstores. TCMU combines
74 with the LIO loopback fabric to become something similar to FUSE
75 (Filesystem in Userspace), but at the SCSI layer instead of the
76 filesystem layer. A SUSE, if you will.
78 The disadvantage is there are more distinct components to configure, and
79 potentially to malfunction. This is unavoidable, but hopefully not
80 fatal if we're careful to keep things as simple as possible.
85 - Good performance: high throughput, low latency
86 - Cleanly handle if userspace:
93 - Allow future flexibility in user & kernel implementations
94 - Be reasonably memory-efficient
95 - Simple to configure & run
96 - Simple to write a userspace backend
99 Implementation overview
100 -----------------------
102 The core of the TCMU interface is a memory region that is shared
103 between kernel and userspace. Within this region is: a control area
104 (mailbox); a lockless producer/consumer circular buffer for commands
105 to be passed up, and status returned; and an in/out data buffer area.
107 TCMU uses the pre-existing UIO subsystem. UIO allows device driver
108 development in userspace, and this is conceptually very close to the
109 TCMU use case, except instead of a physical device, TCMU implements a
110 memory-mapped layout designed for SCSI commands. Using UIO also
111 benefits TCMU by handling device introspection (e.g. a way for
112 userspace to determine how large the shared region is) and signaling
113 mechanisms in both directions.
115 There are no embedded pointers in the memory region. Everything is
116 expressed as an offset from the region's starting address. This allows
117 the ring to still work if the user process dies and is restarted with
118 the region mapped at a different virtual address.
120 See target_core_user.h for the struct definitions.
125 The mailbox is always at the start of the shared memory region, and
126 contains a version, details about the starting offset and size of the
127 command ring, and head and tail pointers to be used by the kernel and
128 userspace (respectively) to put commands on the ring, and indicate
129 when the commands are completed.
131 version - 1 (userspace should abort if otherwise)
134 - TCMU_MAILBOX_FLAG_CAP_OOOC:
135 indicates out-of-order completion is supported.
136 See "The Command Ring" for details.
139 The offset of the start of the command ring from the start
140 of the memory region, to account for the mailbox size.
142 The size of the command ring. This does *not* need to be a
145 Modified by the kernel to indicate when a command has been
148 Modified by userspace to indicate when it has completed
149 processing of a command.
154 Commands are placed on the ring by the kernel incrementing
155 mailbox.cmd_head by the size of the command, modulo cmdr_size, and
156 then signaling userspace via uio_event_notify(). Once the command is
157 completed, userspace updates mailbox.cmd_tail in the same way and
158 signals the kernel via a 4-byte write(). When cmd_head equals
159 cmd_tail, the ring is empty -- no commands are currently waiting to be
160 processed by userspace.
162 TCMU commands are 8-byte aligned. They start with a common header
163 containing "len_op", a 32-bit value that stores the length, as well as
164 the opcode in the lowest unused bits. It also contains cmd_id and
165 flags fields for setting by the kernel (kflags) and userspace
168 Currently only two opcodes are defined, TCMU_OP_CMD and TCMU_OP_PAD.
170 When the opcode is CMD, the entry in the command ring is a struct
171 tcmu_cmd_entry. Userspace finds the SCSI CDB (Command Data Block) via
172 tcmu_cmd_entry.req.cdb_off. This is an offset from the start of the
173 overall shared memory region, not the entry. The data in/out buffers
174 are accessible via tht req.iov[] array. iov_cnt contains the number of
175 entries in iov[] needed to describe either the Data-In or Data-Out
176 buffers. For bidirectional commands, iov_cnt specifies how many iovec
177 entries cover the Data-Out area, and iov_bidi_cnt specifies how many
178 iovec entries immediately after that in iov[] cover the Data-In
179 area. Just like other fields, iov.iov_base is an offset from the start
182 When completing a command, userspace sets rsp.scsi_status, and
183 rsp.sense_buffer if necessary. Userspace then increments
184 mailbox.cmd_tail by entry.hdr.length (mod cmdr_size) and signals the
185 kernel via the UIO method, a 4-byte write to the file descriptor.
187 If TCMU_MAILBOX_FLAG_CAP_OOOC is set for mailbox->flags, kernel is
188 capable of handling out-of-order completions. In this case, userspace can
189 handle command in different order other than original. Since kernel would
190 still process the commands in the same order it appeared in the command
191 ring, userspace need to update the cmd->id when completing the
192 command(a.k.a steal the original command's entry).
194 When the opcode is PAD, userspace only updates cmd_tail as above --
195 it's a no-op. (The kernel inserts PAD entries to ensure each CMD entry
196 is contiguous within the command ring.)
198 More opcodes may be added in the future. If userspace encounters an
199 opcode it does not handle, it must set UNKNOWN_OP bit (bit 0) in
200 hdr.uflags, update cmd_tail, and proceed with processing additional
206 This is shared-memory space after the command ring. The organization
207 of this area is not defined in the TCMU interface, and userspace
208 should access only the parts referenced by pending iovs.
214 Other devices may be using UIO besides TCMU. Unrelated user processes
215 may also be handling different sets of TCMU devices. TCMU userspace
216 processes must find their devices by scanning sysfs
217 class/uio/uio*/name. For TCMU devices, these names will be of the
220 tcm-user/<hba_num>/<device_name>/<subtype>/<path>
222 where "tcm-user" is common for all TCMU-backed UIO devices. <hba_num>
223 and <device_name> allow userspace to find the device's path in the
224 kernel target's configfs tree. Assuming the usual mount point, it is
227 /sys/kernel/config/target/core/user_<hba_num>/<device_name>
229 This location contains attributes such as "hw_block_size", that
230 userspace needs to know for correct operation.
232 <subtype> will be a userspace-process-unique string to identify the
233 TCMU device as expecting to be backed by a certain handler, and <path>
234 will be an additional handler-specific string for the user process to
235 configure the device, if needed. The name cannot contain ':', due to
238 For all devices so discovered, the user handler opens /dev/uioX and
241 mmap(NULL, size, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0)
243 where size must be equal to the value read from
244 /sys/class/uio/uioX/maps/map0/size.
250 If a new device is added or removed, a notification will be broadcast
251 over netlink, using a generic netlink family name of "TCM-USER" and a
252 multicast group named "config". This will include the UIO name as
253 described in the previous section, as well as the UIO minor
254 number. This should allow userspace to identify both the UIO device and
255 the LIO device, so that after determining the device is supported
256 (based on subtype) it can take the appropriate action.
262 Userspace handler process never attaches:
264 - TCMU will post commands, and then abort them after a timeout period
267 Userspace handler process is killed:
269 - It is still possible to restart and re-connect to TCMU
270 devices. Command ring is preserved. However, after the timeout period,
271 the kernel will abort pending tasks.
273 Userspace handler process hangs:
275 - The kernel will abort pending tasks after a timeout period.
277 Userspace handler process is malicious:
279 - The process can trivially break the handling of devices it controls,
280 but should not be able to access kernel memory outside its shared
284 Writing a user pass-through handler (with example code)
285 =======================================================
287 A user process handing a TCMU device must support the following:
289 a) Discovering and configuring TCMU uio devices
290 b) Waiting for events on the device(s)
291 c) Managing the command ring: Parsing operations and commands,
292 performing work as needed, setting response fields (scsi_status and
293 possibly sense_buffer), updating cmd_tail, and notifying the kernel
294 that work has been finished
296 First, consider instead writing a plugin for tcmu-runner. tcmu-runner
297 implements all of this, and provides a higher-level API for plugin
300 TCMU is designed so that multiple unrelated processes can manage TCMU
301 devices separately. All handlers should make sure to only open their
302 devices, based opon a known subtype string.
304 a) Discovering and configuring TCMU UIO devices::
306 /* error checking omitted for brevity */
310 unsigned long long map_len;
313 fd = open("/sys/class/uio/uio0/name", O_RDONLY);
314 ret = read(fd, buf, sizeof(buf));
316 buf[ret-1] = '\0'; /* null-terminate and chop off the \n */
318 /* we only want uio devices whose name is a format we expect */
319 if (strncmp(buf, "tcm-user", 8))
322 /* Further checking for subtype also needed here */
324 fd = open(/sys/class/uio/%s/maps/map0/size, O_RDONLY);
325 ret = read(fd, buf, sizeof(buf));
327 str_buf[ret-1] = '\0'; /* null-terminate and chop off the \n */
329 map_len = strtoull(buf, NULL, 0);
331 dev_fd = open("/dev/uio0", O_RDWR);
332 map = mmap(NULL, map_len, PROT_READ|PROT_WRITE, MAP_SHARED, dev_fd, 0);
335 b) Waiting for events on the device(s)
340 int ret = read(dev_fd, buf, 4); /* will block */
342 handle_device_events(dev_fd, map);
346 c) Managing the command ring::
348 #include <linux/target_core_user.h>
350 int handle_device_events(int fd, void *map)
352 struct tcmu_mailbox *mb = map;
353 struct tcmu_cmd_entry *ent = (void *) mb + mb->cmdr_off + mb->cmd_tail;
354 int did_some_work = 0;
356 /* Process events from cmd ring until we catch up with cmd_head */
357 while (ent != (void *)mb + mb->cmdr_off + mb->cmd_head) {
359 if (tcmu_hdr_get_op(ent->hdr.len_op) == TCMU_OP_CMD) {
360 uint8_t *cdb = (void *)mb + ent->req.cdb_off;
363 /* Handle command here. */
364 printf("SCSI opcode: 0x%x\n", cdb[0]);
366 /* Set response fields */
368 ent->rsp.scsi_status = SCSI_NO_SENSE;
370 /* Also fill in rsp->sense_buffer here */
371 ent->rsp.scsi_status = SCSI_CHECK_CONDITION;
374 else if (tcmu_hdr_get_op(ent->hdr.len_op) != TCMU_OP_PAD) {
375 /* Tell the kernel we didn't handle unknown opcodes */
376 ent->hdr.uflags |= TCMU_UFLAG_UNKNOWN_OP;
379 /* Do nothing for PAD entries except update cmd_tail */
382 /* update cmd_tail */
383 mb->cmd_tail = (mb->cmd_tail + tcmu_hdr_get_len(&ent->hdr)) % mb->cmdr_size;
384 ent = (void *) mb + mb->cmdr_off + mb->cmd_tail;
388 /* Notify the kernel that work has been finished */
402 Please be careful to return codes as defined by the SCSI
403 specifications. These are different than some values defined in the
404 scsi/scsi.h include file. For example, CHECK CONDITION's status code