3 .NC "The Design of Unix IPC"
6 The ARGO implementation of
7 TP and CLNP was designed to fit into the AOS
10 All the standard protocol hooks are used.
11 To understand the design, it is useful to have
13 Leffler, Joy, and Fabry:
14 \*(lq4.2 BSD Networking Implementation Notes\*(rq July 1983.
15 This section describes the
16 design of the IPC support in the AOS kernel.
17 .sh 1 "Functional Unit Overview"
22 is a monolithic program of considerable size and complexity.
23 The code can be separated into parts of distinct function,
24 but there are no kernel processes per se.
25 The kernel code is either executed on behalf of a user
26 process, in which case the kernel was entered by a system call,
27 or it is executed on behalf of a hardware or software interrupt.
28 The following sections describe briefly the major functional units
31 .so figs/func_units.nr
33 shows the arrangement of these kernel units and
35 .sh 2 "The file system."
37 .sh 2 "Virtual memory support."
39 This includes protection, swapping, paging, and
41 .sh 2 "Blocked device drivers (disks, tapes)."
43 All these drivers share some minor functional units,
44 such as buffer management and bus support
45 for the various types of busses on the machine.
46 .sh 2 "Interprocess communication (IPC)."
49 support for various protocols,
50 buffer management, and a standard interface for inter-protocol
52 .sh 2 "Network interface drivers."
54 These drivers are closely tied to the IPC support.
55 They use the IPC's buffer management unit rather
56 than the buffers used by the blocked device drivers.
57 The interface between these drivers and the rest of the kernel
58 differs from the interface used by the blocked devices.
61 This is terminal support, including the user interface
62 and the device drivers.
63 .sh 2 "System call interface."
65 This handles signals, traps, and system calls.
68 The clock is used in various forms by many
70 .sh 2 "User process support (the rest)."
72 This includes support for accounting, process creation,
73 control, scheduling, and destruction.
77 The major functional unit that supports IPC
78 can be divided into the following smaller functional
80 .sh 3 "Buffer management."
82 All protocols share a pool of buffers called \fImbufs\fR:
91 +struct mbuf+*m_next;+/* next buffer in chain */
92 +u_long+m_off;+/* offset of data */
93 +short+m_len;+/* amount of data */
94 +short+m_type;+/* mbuf type (0 == free) */
95 +u_char+m_dat[MLEN];+/* data storage */
96 +struct mbuf+*m_act;+/* link in 2-d structure */
102 There are two forms of mbufs - small ones and large ones.
103 Small ones are 128 octets in
106 in the ARGO release. Small mbufs are copied by byte-to-byte
108 The data in these mbufs are kept in the character
109 array field \fIm_dat\fR in the mbuf structure
111 For this type of mbuf, the field \fIm_off\fR is positive,
112 and is the offset to the beginning of the data from
113 the beginning of the mbuf structure itself.
114 Large mbufs, called \fIclusters\fR, are page-sized
116 They may be \*(lqcopied\*(rq by multiply mapping the pages they occupy.
117 They consist of a page of memory plus a small mbuf structure
118 whose fields are used
119 to link clusters into chains, but whose \fIm_dat\fR array is
121 The \fIm_off\fR field of the structure
122 is the offset (positive or negative) from the
123 beginning of the mbuf structure to the beginning
124 of the data page part of the cluster.
125 In the case of clusters, the offset is always out of the
126 bounds of the \fIm_dat\fR array and so it is alway possible
127 to tell from the \fIm_off\fR field whether an mbuf structure
128 is part of a cluster or is a small mbuf.
129 All mbufs permanently reside in memory.
130 The mbuf management unit manages its own page table.
131 The mbuf manager keeps limited statistics on the quantities and
132 types of buffers in use.
133 Mbufs are used for many purposes, and most of these purposes
134 have a type associated with them.
135 Some of the types that buffers may take are
136 MT_FREE (not allocated), MT_DATA,
137 MT_HEADER, MT_SOCKET (socket structure),
138 MT_PCB (protocol control block),
139 MT_RTABLE (routing tables),
141 MT_SOOPTS (arguments passed to \fIgetsockopt()\fR and
143 Data are passed among functional units by means
144 of queues, the contents of which are
145 either chains of mbufs or groups of chains of mbufs.
146 Mbufs are linked into chains with the \fIm_next\fR field.
147 Chains of mbufs are linked into groups with the \fIm_act\fR
149 The \fIm_act\fR field allows a protocol to retain packet
150 boundaries in a queue of mbufs.
153 Routing decisions in the kernel are made by the procedure \fIrtalloc()\fR.
154 This procedure will scan the kernel routing tables (stored in mbufs)
155 looking for a route. A route is represented by
164 +u_long+rt_hash;+/* to speed lookups */
165 +struct sockaddr+rt_dst;+/* key */
166 +struct sockaddr+rt_gateway;+/* value */
167 +short+rt_flags;+/* up/down?, host/net */
168 +short+rt_refcnt;+/* # held references */
169 +u_long+rt_use;+/* raw # packets forwarded */
170 +struct ifnet+*rt_ifp;+/* interface to use */
175 When looking for a route, \fIrtalloc()\fR will first hash the entire destination
176 address, and scan the routing tables looking for a complete route. If a route
177 is not found, then \fIrtalloc()\fR will rescan the table looking for a route
178 which matches the \fInetwork\fR portion of the address. If a route is still
179 not found, then a default route is used (if present).
181 If a route is found, the entity which called \fIrtalloc()\fR can use information
182 from the \fIrtentry\fR structure to dispatch the datagram. Specifically, the
183 datagram is queued on the interface identified by the interface
184 pointer \fIrt_ifp\fR.
187 This is the protocol-independent part of the IPC support.
188 Each communication endpoint (which may or may not be associated
189 with a connection) is represented by the following structure:
198 +short+so_type;+/* type, e.g. SOCK_DGRAM */
199 +short+so_options;+/* from socket call */
200 +short+so_linger;+/* time to linger @ close */
201 +short+so_state;+/* internal state flags */
202 +caddr_t+so_pcb;+/* network layer pcb */
203 +struct protosw+*so_proto;+/* protocol handle */
204 +struct socket+*so_head;+/* ptr to accept socket */
205 +struct socket+*so_q0;+/* queue of partial connX */
206 +short+so_q0len;+/* # partials on so_q0 */
207 +struct socket+*so_q;+/* queue of incoming connX */
208 +short+so_qlen;+/* # connections on so_q */
209 +short+so_qlimit;+/* max # queued connX */
211 ++short+sb_cc;+/* actual chars in buffer */
212 ++short+sb_hiwat;+/* max actual char count */
213 ++short+sb_mbcnt;+/* chars of mbufs used */
214 ++short+sb_mbmax;+/* max chars of mbufs to use */
215 ++short+sb_lowat;+/* low water mark (not used yet) */
216 ++short+sb_timeo;+/* timeout (not used ) */
217 ++struct mbuf+*sb_mb;+/* the mbuf chain */
218 ++struct proc+*sb_sel;+/* process selecting */
219 ++short+sb_flags;+/* flags, see below */
221 +short+so_timeo;+/* connection timeout */
222 +u_short+so_error;+/* error affecting connX */
223 +short+so_oobmark;+/* oob mark (TCP only) */
224 +short+so_pgrp;+/* pgrp for signals */
230 The socket code maintains a pair of queues for each socket,
231 \fIso_rcv\fR and \fIso_snd\fR.
232 Each queue is associated with a count of the number of characters
233 in the queue, the maximum number of characters allowed to be put
234 in the queue, some status information (\fIsb_flags\fR), and
235 several unused fields.
236 For a send operation, data are copied from the user's address space
237 into chains of mbufs.
238 This is done by the socket module, which then calls the underlying
239 transport protocol module to place the data
241 This is generally done by
242 appending to the chain beginning at \fIsb_mb\fR.
243 The socket module copies data from the \fIso_rcv\fR queue
244 to the user's address space to effect a receive operation.
245 The underlying transport layer is expected to have put incoming
246 data into \fIso_rcv\fR by calling procedures in this module.
248 .sh 3 "Transport protocol management."
250 All protocols and address types must be \*(lqregistered\*(rq in a
251 common way in order to use the IPC user interface.
252 Each protocol must have an entry in a protocol switch table.
253 Each entry takes the form:
262 +short+pr_type;+/* socket type used for */
263 +short+pr_family;+/* protocol family */
264 +short+pr_protocol;+/* protocol # from the database */
265 +short+pr_flags;+/* status information */
266 +++/* protocol-protocol hooks */
267 +int+(*pr_input)();+/* input (from below) */
268 +int+(*pr_output)();+/* output (from above) */
269 +int+(*pr_ctlinput)();+/* control input */
270 +int+(*pr_ctloutput)();+/* control output */
271 +++/* user-protocol hook */
272 +int+(*pr_usrreq)();+/* user request: see list below */
273 +++/* utility hooks */
274 +int+(*pr_init)();+/* initialization hook */
275 +int+(*pr_fasttimo)();+/* fast timeout (200ms) */
276 +int+(*pr_slowtimo)();+/* slow timeout (500ms) */
277 +int+(*pr_drain)();+/* free some space (not used) */
283 Associated with each protocol are the types of socket
284 abstractions supported by the protocol (\fIpr_type\fR), the
285 format of the addresses used by the protocol (\fIpr_family\fR),
286 the routines to be called to perform
287 a standard set of protocol functions (\fIpr_input\fR,...,\fIpr_drain\fR),
288 and some status information (\fIpr_flags\fR).
289 The field pr_flags keeps such information as
290 SS_ISCONNECTED (this socket has a peer),
291 SS_ISCONNECTING (this socket is in the process of establishing
293 SS_ISDISCONNECTING (this socket is in the process of being disconnected),
294 SS_CANTSENDMORE (this socket is half-closed and cannot send),
295 SS_CANTRCVMORE (this socket is half-closed and cannot receive).
296 There are some flags that are specific to the TCP concept
298 A flag SS_OOBAVAIL was added for the ARGO implementation, to support
299 the TP concept of out-of-band data (expedited data).
300 .sh 3 "Network Interface Drivers"
302 The drivers for the devices attaching a Unix machine to a network
303 medium share a common interface to the protocol
305 There is a common data structure for managing queues,
306 not surprisingly, a chain of mbufs.
307 There is a set of macros that are used to enqueue and
308 dequeue mbuf chains at high priority.
310 delivers an indication to a protocol entity when
311 an incoming packet has been placed on a queue by
315 .sh 3 "Support for individual protocols."
317 Each protocol is written as a separate functional unit.
318 Because all protocols share the clock and the mbuf pool, they
319 are not entirely insulated from each other.
320 The details of TP are described in a section that
322 .\"*****************************************************
327 shows the arrangement of the IPC support.
330 IPC was designed for DoD Internet protocols, all of
331 which run over DoD IP.
332 The assumptions that DoD Internet is the domain
333 and that DoD IP is the network layer
334 appear in the code and data structures in numerous places.
335 For example, it is assumed that addresses can be compared
336 by a bitwise comparison of 4 octets.
337 Another example is that the transport protocols all directly call
339 There are no hooks in the data structures through
340 which the transport layer can choose a network level protocol.
341 A third example is that the host's local addresses
342 are stored in the network interface drivers and the drivers
343 have only one address - an Internet address.
344 A fourth example is that headers are assumed to
345 fit in one small mbuf (112 bytes for data in AOS).
346 A fifth example is this:
347 It is assumed in many places that buffer space is managed
348 in units of characters or octets.
349 The user data are copied from user address space into the kernel mbufs
351 by the socket code, a protocol-independent part of the kernel.
352 This is fine for a stream protocol, but it means that a
353 packet protocol, in order to \*(lqpacketize\*(rq the data,
354 must perform a memory-to-memory copy
355 that might have been avoided had the protocol layer done the original
356 copy from user address space.
357 Furthermore, protocols that count credit in terms of packets or
358 buffers rather than characters do not work efficiently because
359 the computation of buffer space is not in the protocol module,
360 but rather it is in the socket code module.
361 This list of examples is not complete.
363 To summarize, adding a new transport protocol to the kernel consists of
364 adding entries to the tables in the protocol management
366 modifying the network interface driver(s) to recognize
367 new network protocol identifiers,
369 new system calls to the kernel and to the user library,
371 adding code modules for each of the protocols,
372 and correcting deficiencies in the socket code,
373 where the assumptions made about the nature of
374 transport protocols do not apply.