3 .NC "The Design of Unix IPC"
6 The ARGO implementation of
7 TP and CLNP was designed to fit into the AOS
10 All the standard protocol hooks are used.
11 To understand the design, it is useful to have
13 Leffler, Joy, and Fabry:
14 \*(lq4.2 BSD Networking Implementation Notes\*(rq July 1983.
15 This section describes the
16 design of the IPC support in the AOS kernel.
17 .sh 1 "Functional Unit Overview"
22 is a monolithic program of considerable size and complexity.
23 The code can be separated into parts of distinct function,
24 but there are no kernel processes per se.
25 The kernel code is either executed on behalf of a user
26 process, in which case the kernel was entered by a system call,
27 or it is executed on behalf of a hardware or software interrupt.
28 The following sections describe briefly the major functional units
31 .so ../wisc/figs/func_units.nr
33 shows the arrangement of these kernel units and
35 .sh 2 "The file system."
37 .sh 2 "Virtual memory support."
39 This includes protection, swapping, paging, and
41 .sh 2 "Blocked device drivers (disks, tapes)."
43 All these drivers share some minor functional units,
44 such as buffer management and bus support
45 for the various types of busses on the machine.
46 .sh 2 "Interprocess communication (IPC)."
49 support for various protocols,
50 buffer management, and a standard interface for inter-protocol
52 .sh 2 "Network interface drivers."
54 These drivers are closely tied to the IPC support.
55 They use the IPC's buffer management unit rather
56 than the buffers used by the blocked device drivers.
57 The interface between these drivers and the rest of the kernel
58 differs from the interface used by the blocked devices.
61 This is terminal support, including the user interface
62 and the device drivers.
63 .sh 2 "System call interface."
65 This handles signals, traps, and system calls.
68 The clock is used in various forms by many
70 .sh 2 "User process support (the rest)."
72 This includes support for accounting, process creation,
73 control, scheduling, and destruction.
77 The major functional unit that supports IPC
78 can be divided into the following smaller functional
80 .sh 3 "Buffer management."
82 All protocols share a pool of buffers called \fImbufs\fR.
83 The internal structure has changed considerably since 4.3:
92 +struct mbuf+*m_next;+/* next buffer in chain */
93 +struct mbuf+*m_act;+/* link in 2-d structure */
94 +u_long+m_len;+/* amount of data */
95 +char *+m_data;+/* location of data */
96 +short+m_type;+/* type of data */
97 +short+m_flags;+/* note if EOR, Packet HDR, Ext. stored */
98 +++/* If packet header add: */
99 int+m_pkthdr.len;+/* total packet length */
100 struct ifnet+*m_pkthdr.recvif;+/* rcv interface*/
101 +++/* If external storage add: */
102 +char +*m_ext.ext_buf;+/* start of buffer */
103 +void+(*m_ext.ext_free)();+/* free routine if not the usual */
104 +u_int+m_ext.ext_size;+/* size of buffer, for ext_free */
105 +++/* For non external */
106 +char+m_dat[depending];+/* done by unions, etc. */
112 There are two forms of mbufs - with and without external storage.
113 Small ones are 128 octets in 4.4BSD.
114 The data in these mbufs are located
115 in the mbuf structure itself.
116 Large mbufs, called \fIclusters\fR, are page-sized
118 They may be \*(lqcopied\*(rq by multiply mapping the pages they occupy.
119 They consist of a page of memory plus a small mbuf structure
120 whose fields are used
121 to link clusters into chains, but whose \fIm_dat\fR array is
123 The \fIm_data\fR field of the structure
124 is a pointer to the active data in all cases.
125 The remainder of the description in the argo document
126 is generally obsolete, and I am merely deleting the
127 rest of it at this point.
130 Routing decisions in the kernel are made by the procedure \fIrtalloc()\fR.
131 This procedure will scan the kernel routing tables (stored in mbufs)
133 The argo document here also is quite obsolete.
134 We know keep a tree structure routing table,
135 and do matching under masks.
136 The structure for the routing entry contains tree related
137 stuff pointers (parent, l-r child for internal nodes, mask and address
138 for external nodes), and may be completely revised again
139 to make use of patricia trees.
141 If a route is not found, then a default route is used (if present).
143 If a route is found, the entity which called \fIrtalloc()\fR can use information
144 from the \fIrtentry\fR structure to dispatch the datagram. Specifically, the
145 datagram is queued on the interface identified by the interface
146 pointer \fIrt_ifp\fR.
149 This is the protocol-independent part of the IPC support.
150 Each communication endpoint (which may or may not be associated
151 with a connection) is represented by the following structure:
160 +short+so_type;+/* type, e.g. SOCK_DGRAM */
161 +short+so_options;+/* from socket call */
162 +short+so_linger;+/* time to linger @ close */
163 +short+so_state;+/* internal state flags */
164 +caddr_t+so_pcb;+/* network layer pcb */
165 +struct protosw+*so_proto;+/* protocol handle */
166 +struct socket+*so_head;+/* ptr to accept socket */
167 +struct socket+*so_q0;+/* queue of partial connX */
168 +short+so_q0len;+/* # partials on so_q0 */
169 +struct socket+*so_q;+/* queue of incoming connX */
170 +short+so_qlen;+/* # connections on so_q */
171 +short+so_qlimit;+/* max # queued connX */
173 ++short+sb_cc;+/* actual chars in buffer */
174 ++short+sb_hiwat;+/* max actual char count */
175 ++short+sb_mbcnt;+/* chars of mbufs used */
176 ++short+sb_mbmax;+/* max chars of mbufs to use */
177 ++short+sb_lowat;+/* low water mark (not used yet) */
178 ++short+sb_timeo;+/* timeout (not used ) */
179 ++struct mbuf+*sb_mb;+/* the mbuf chain */
180 ++struct proc+*sb_sel;+/* process selecting */
181 ++short+sb_flags;+/* flags, see below */
183 +short+so_timeo;+/* connection timeout */
184 +u_short+so_error;+/* error affecting connX */
185 +short+so_oobmark;+/* oob mark (TCP only) */
186 +short+so_pgrp;+/* pgrp for signals */
192 The socket code maintains a pair of queues for each socket,
193 \fIso_rcv\fR and \fIso_snd\fR.
194 Each queue is associated with a count of the number of characters
195 in the queue, the maximum number of characters allowed to be put
196 in the queue, some status information (\fIsb_flags\fR), and
197 several unused fields.
198 For a send operation, data are copied from the user's address space
199 into chains of mbufs.
200 This is done by the socket module, which then calls the underlying
201 transport protocol module to place the data
203 This is generally done by
204 appending to the chain beginning at \fIsb_mb\fR.
205 The socket module copies data from the \fIso_rcv\fR queue
206 to the user's address space to effect a receive operation.
207 The underlying transport layer is expected to have put incoming
208 data into \fIso_rcv\fR by calling procedures in this module.
210 .sh 3 "Transport protocol management."
212 All protocols and address types must be \*(lqregistered\*(rq in a
213 common way in order to use the IPC user interface.
214 Each protocol must have an entry in a protocol switch table.
215 Each entry takes the form:
224 +short+pr_type;+/* socket type used for */
225 +short+pr_family;+/* protocol family */
226 +short+pr_protocol;+/* protocol # from the database */
227 +short+pr_flags;+/* status information */
228 +++/* protocol-protocol hooks */
229 +int+(*pr_input)();+/* input (from below) */
230 +int+(*pr_output)();+/* output (from above) */
231 +int+(*pr_ctlinput)();+/* control input */
232 +int+(*pr_ctloutput)();+/* control output */
233 +++/* user-protocol hook */
234 +int+(*pr_usrreq)();+/* user request: see list below */
235 +++/* utility hooks */
236 +int+(*pr_init)();+/* initialization hook */
237 +int+(*pr_fasttimo)();+/* fast timeout (200ms) */
238 +int+(*pr_slowtimo)();+/* slow timeout (500ms) */
239 +int+(*pr_drain)();+/* free some space (not used) */
245 Associated with each protocol are the types of socket
246 abstractions supported by the protocol (\fIpr_type\fR), the
247 format of the addresses used by the protocol (\fIpr_family\fR),
248 the routines to be called to perform
249 a standard set of protocol functions (\fIpr_input\fR,...,\fIpr_drain\fR),
250 and some status information (\fIpr_flags\fR).
251 The field pr_flags keeps such information as
252 SS_ISCONNECTED (this socket has a peer),
253 SS_ISCONNECTING (this socket is in the process of establishing
255 SS_ISDISCONNECTING (this socket is in the process of being disconnected),
256 SS_CANTSENDMORE (this socket is half-closed and cannot send),
257 SS_CANTRCVMORE (this socket is half-closed and cannot receive).
258 There are some flags that are specific to the TCP concept
260 A flag SS_OOBAVAIL was added for the ARGO implementation, to support
261 the TP concept of out-of-band data (expedited data).
262 .sh 3 "Network Interface Drivers"
264 The drivers for the devices attaching a Unix machine to a network
265 medium share a common interface to the protocol
267 There is a common data structure for managing queues,
268 not surprisingly, a chain of mbufs.
269 There is a set of macros that are used to enqueue and
270 dequeue mbuf chains at high priority.
272 delivers an indication to a protocol entity when
273 an incoming packet has been placed on a queue by
277 .sh 3 "Support for individual protocols."
279 Each protocol is written as a separate functional unit.
280 Because all protocols share the clock and the mbuf pool, they
281 are not entirely insulated from each other.
282 The details of TP are described in a section that
284 .\"*****************************************************
286 .so ../wisc/figs/unix_ipc.nr
289 shows the arrangement of the IPC support.
292 IPC was designed for DoD Internet protocols, all of
293 which run over DoD IP.
294 The assumptions that DoD Internet is the domain
295 and that DoD IP is the network layer
296 appear in the code and data structures in numerous places.
297 An example is that the transport protocols all directly call
299 There are no hooks in the data structures through
300 which the transport layer can choose a network level protocol.
301 Another example is that headers are assumed to
302 fit in one small mbuf (112 bytes for data in AOS).
303 Another example is this:
304 It is assumed in many places that buffer space is managed
305 in units of characters or octets.
306 The user data are copied from user address space into the kernel mbufs
308 by the socket code, a protocol-independent part of the kernel.
309 This is fine for a stream protocol, but it means that a
310 packet protocol, in order to \*(lqpacketize\*(rq the data,
311 must perform a memory-to-memory copy
312 that might have been avoided had the protocol layer done the original
313 copy from user address space.
314 Furthermore, protocols that count credit in terms of packets or
315 buffers rather than characters do not work efficiently because
316 the computation of buffer space is not in the protocol module,
317 but rather it is in the socket code module.
318 This list of examples is not complete.
320 To summarize, adding a new transport protocol to the kernel consists of
321 adding entries to the tables in the protocol management
323 modifying the network interface driver(s) to recognize
324 new network protocol identifiers,
326 new system calls to the kernel and to the user library,
328 adding code modules for each of the protocols,
329 and correcting deficiencies in the socket code,
330 where the assumptions made about the nature of
331 transport protocols do not apply.
333 (Touchy touchy, aren't we!?! -- Sklower)