1 .\" $NetBSD: trans_design.nr,v 1.2 1998/01/09 06:34:53 perry Exp $
3 .NC "The Design of the ARGO Transport Entity"
6 The design of the AOS kernel IPC support to some
9 Each protocol must provide the following
10 protocol hooks, which are procedures called through a
12 (an array of type \fIprotosw\fR as described in
15 Called when data are to be passed up from a lower layer.
17 Called when data are to be passed down from a higher layer.
19 Called when the system is brought up.
21 Called every 200 milliseconds by the clock functional unit.
23 Called every 500 milliseconds by the clock functional unit.
25 This is meant to be called when buffer space is low.
26 Each protocol is expected to provide this routine to free
27 non-critical buffer space.
28 This is not yet called anywhere.
30 Used for exchanging information between
31 protocols, such as notifying a transport protocol of changes
32 in routing or configuration information.
33 .ip "pr_ctloutput()" 5
34 Supports the protocol-dependent
40 Called by the socket code to pass along a \*(lquser request\*(rq -
41 in other words a service primitive.
42 This call is also used for other protocol functions.
43 The functions served by the \fIpr_usrreq()\fR routine are:
45 Creates a protocol control block and attaches it to a given socket.
46 Called as a result of a \fIsocket()\fR system call.
47 .ip " PRU_DISCONNECT" 10
48 Called as a result of a
49 \fIclose()\fR system call.
50 Initiates disconnection.
52 Disassociates a protocol control block from a socket and recycles
53 the buffer space used for the protocol control block.
54 Called after PRU_DISCONNECT.
55 .ip " PRU_SHUTDOWN" 10
56 Called as a result of a
57 \fIshutdown()\fR system call.
58 If the protocol supports the notion of half-open connections,
59 this closes the connection in one direction or both directions,
60 depending on the arguments passed to
63 Gives an address to a socket.
64 Called as a result of a
65 \fIbind()\fR system call, also
67 socket without a bound address is used.
68 In the latter case, an unused transport suffix is located and
71 Called as a result of a
72 \fIlisten()\fR system call.
73 Marks the socket as willing to queue incoming connection
76 Called as a result of a
77 \fIconnect()\fR system call.
78 Initiates a connection request.
80 Called as a result of an
81 \fIaccept()\fR system call.
82 Dequeues a pending connection request, or blocks waiting for
83 a connection request to arrive.
84 In the latter case, it marks the socket as willing to accept
87 The protocol module is expected to have put incoming data
88 into the socket's receive buffer, \fIso_rcv\fR.
89 When a receive primitive is used
90 (\fIrecv(), recvmsg(), recvfrom(),
91 read(), readv(), \fRand
92 \fIrecvv()\fR system calls)
93 the socket code module copies data from the
94 \fIso_rcv\fR to the user's
96 The protocol module may arrange to be informed each time the socket code
97 does this, in which case the socket code calls \fIpr_usrreq\fR(PRU_RCVD)
98 after the data were copied to the user.
100 This performs the protocol-dependent part of a send primitive
101 (\fIsend(), sendmsg(), sendto(), write(), writev(),
102 \fRand \fIsendv()\fR system calls).
104 (procedures \fIsendit() and \fIsosend()\fR)
105 moves outgoing data from the user's
106 address space into a chain of \fImbufs\fR.
107 The socket code takes as much data from the user as it
108 determines will fit into the outgoing socket buffer, so_snd.
109 It passes this much data in the form of an mbuf chain to the protocol
110 via \fIpr_usrreq\fR(PRU_SEND).
111 If there are more data than
112 the so_snd can accommodate,
113 the socket code, which is running on behalf of a user process,
114 puts the user process to sleep.
115 The protocol module is expected to wake up the user process when
116 more room appears in so_snd.
118 Called when a socket is closed and that socket
119 is accepting connections and has
121 connection requests or
122 partially open connections.
123 .ip " PRU_CONTROL" 10
124 Called as a result of an
125 \fIioctl()\fR system call.
127 Called as a result of an
128 \fIfstat()\fR system call.
130 Performs the work of receiving \*(lqout-of-band\*(rq data.
131 The socket module has already allocated an mbuf into which
132 the protocol module is expected to put the incoming
133 \*(lqout-of-band\*(rq data.
134 The socket code will then move the data from this mbuf
135 to the user's address space.
136 .ip " PRU_SENDOOB" 10
137 Performs the work of sending \*(lqout-of-band\*(rq data.
138 The socket module has already moved the data
139 from the user's address space into a chain of mbufs,
140 which it now passes to the protocol module.
141 .ip " PRU_SOCKADDR" 10
142 Supports the system call
144 Puts the socket's bound address into an mbuf.
145 .ip " PRU_PEERADDR" 10
146 Supports the system call
148 Puts the peer's address into an mbuf.
149 .ip " PRU_CONNECT2" 10
150 This is used in the Unix domain to support pipes.
151 It is not generally supported by transport protocols.
152 .ip " PRU_FASTTIMO, PRU_SLOWTIMO" 10
153 These are superfluous.
154 None of the transport protocols uses them.
155 .ip " PRU_PROTORCV, PRU_PROTOSEND" 10
156 None of the transport protocols uses these.
157 .ip " PRU_SENDEOT" 10
158 This was added to support TP.
159 This indicates that the end of the data sent in this
160 send primitive should
161 be marked by the protocol as the end of the TSDU.
162 .sh 1 "The Interface Between the Transport Entity and Lower Layers"
164 The transport layer may run over a network layer such as IP
165 or the ISO connectionless network layer,
166 or it may run over a multi-purpose layer such as the service
168 X.25 is viewed as a network layer when
169 TP runs over X.25, and as a
171 when IP is running over X.25.
172 The software interface between data link and network layers differs
173 considerably from the software interface between transport and network
175 For this reason some modification of the transport-to-lower-layer
176 interface is necessary to support the suite of protocols included in
179 In AOS it is assumed that the transport layer will run over one
180 and only one network layer, and therefore it may call the
181 network layer output procedure directly.
182 In order to allow TP to run over a set of lower layers,
183 all domain-specific functions have been put into a set of routines
184 that are called indirectly through a domain-specific switch table.
185 The primary reason for this is that the transport and network
186 layers share information, mostly information pertaining to addresses.
187 The protocol control blocks for different network layers
188 differ, so the transport layer cannot just directly
189 access the network layer's pcb.
190 Similarly, a network layer may not directly access the transport
191 pcb because a multitude of transport protocols can run over each
192 of the network protocols.
194 To permit different network-layer protocol control blocks to coexist
195 under one transport layer, all transport-dependent control
196 information was put into a transport-specific protocol control block.
197 A new field, \fIso_tpcb\fR,
198 was added to the \fIsocket\fR structure to hold a pointer to
199 the transport-layer protocol control block.
201 field \fCso_pcb\fR is used for the network layer pcb.
203 The following structure was added to allow domain-specific
204 functions to be called indirectly.
205 All these functions operate on a network-layer pcb.
215 +int+nlp_afamily;+/* address family */
216 +int+(*nlp_putnetaddr)();+/* puts addrs in pcb */
217 +int+(*nlp_getnetaddr)();+/* gets addrs from pcb */
218 +int+(*nlp_putsufx)();+/* transp suffix -> pcb */
219 +int+(*nlp_getsufx)();+/* gets t-suffix */
220 +int+(*nlp_recycle_suffix)();+/* zeroes suffix */
221 +int+(*nlp_mtu)();+/* get maximum
222 +++transmission unit size */
223 +int+(*nlp_pcbbind)();+/* bind to pcb */
224 +int+(*nlp_pcbconn)();+/* connect */
225 +int+(*nlp_pcbdisc)();+/* disconnect */
226 +int+(*nlp_pcbdetach)();+/* detach pcb */
227 +int+(*nlp_pcballoc)();+/* allocate a pcb */
228 +int+(*nlp_output)();+/* emit packet */
229 +int+(*nlp_dgoutput)();+/* emit datagram */
230 +caddr_t+nlp_pcblist;+/* list of pcbs
238 The switch is based on the address family chosen when the
239 \fIsocket()\fR system call is made prior to connection establishment.
240 This unfortunately ties the address family to the domain,
241 but the only alternative is to add an argument to the \fIsocket()\fR
242 system call to let the user specify the desired network layer.
243 In the case of a connection oriented environment with no multi-homing,
244 it would be possible to determine which network layer is to be
247 information, but to do this requires unrealistic assumptions
248 about the environment.
249 For these reasons, linking the address family to the network
250 layer protocol is seen as the least of the evils.
251 The transport suffixes are kept in the network layer's pcb
252 as well as in the transport layer because
253 full transport address pairs are used to identify a connection
254 in the Internet domain.
255 .sh 1 "The Architecture of the Transport Protocol Entity"
257 A set of protocol hooks is required
258 by the AOS IPC architecture.
259 These hooks are used by the protocol-independent parts of the kernel
260 to gain entry to protocol-specific code.
261 The protocol code can be entered in one of the following ways:
263 at boot time, when autoconfiguration
264 initializes each protocol through
270 a user program making a system call, through
271 the \fIpr_usrreq()\fR or \fIpr_ctloutput()\fR hooks, or
272 from a higher layer protocol using the
273 \fIpr_output()\fR hook,
275 from below, a device interrupt servicing an incoming packet
276 through the \fIpr_input()\fR and \fIpr_ctlinput()\fR hooks, and
278 from a clock interrupt through the \fIpr_slowtimo()\fR
280 \fIpr_fasttimo()\fR hook.
282 .so figs/trans_flow.nr
283 .\".so figs/trans_flow.grn
285 The protocol code can be divided into
286 the following modules, which are described in more detail below.
288 shows the flow of data and control
291 .ip "Timers and References:" 5
292 The code executed on behalf of \fIpr_slowtimo()\fR.
293 The fast timeout is not used by TP.
295 This is the finite state machine for TP.
297 This is the module that decodes incoming packets,
298 identifies or creates the pcb for which
299 the packet is destined, and creates an "event" to
302 This is the module that creates a packet header of a given type
303 with fields containing
304 values that are appropriate to the connection
305 on which the packet is being sent, appends data if necessary,
307 to the lower layer, according to the transport-to-lower-layer
310 This module packetizes data from the outbound
311 socket buffer, \fIso_snd\fR,
312 handles retransmissions of packetized data, and
313 drops packetized data from the retransmission queue.
315 This module reorders packets if necessary,
316 depacketizes data, passes it to the socket code module,
317 and determines when acknowledgments should be sent.
319 .sh 1 "Timers and References"
321 TP identifies sockets by \fIreference numbers\fR, or
323 which are \*(lqfrozen\*(rq (may not be reassigned)
324 until some locally defined time after
325 a connection is broken and its protocol control block
327 An array of \fIreference blocks\fR is maintained by TP.
328 The reference number of a reference block is its
330 When a reference block is in use it contains
331 a pointer to the pcb for the socket to which the
334 The system clock calls the \fIpr_slowtimo()\fR and
335 \fIpr_fasttimo()\fR hooks for each protocol in the protocol switch table
336 every 500 and 200 microseconds, respectively.
337 Each protocol handles its own timers its own way.
338 The timers in TP take two forms
339 - those that typically are cancelled and
340 those that usually expire.
341 The latter form may have more than one instantiation at any given
344 The two are implemented slightly
345 differently for the sake of performance.
347 The timers that normally expire
348 are kept in a queue, their values all relative
349 to the value of preceding timer.
350 Thus all timer values are decremented by a single
351 operation on the value of the first timer.
352 The timer is represented by the Ecallout structure:
361 +int+c_time;+/* incremental time */
362 +int+c_func;+/* function to call */
363 +u_int+c_arg1;+/* argument to routine */
364 +u_int+c_arg2;+/* argument to routine */
365 +int+c_arg3;+/* argument to routine */
366 +struct Ecallout+*c_next;
372 When an Ecallout structure migrates to the head
373 of the E timer list, and its \fIc_time\fR
374 field is decremented to zero,
375 the function stored in \fIc_func\fR is
376 called, with \fIc_arg1, c_arg2\fR, and \fIc_arg3\fR
378 Setting and cancelling these timers
379 are accomplished by a linear search and one
380 insertion or deletion from the timer queue.
381 This queue is linked to the
382 reference block associated with a communication endpoint.
383 This form used for the reference timer
384 and for the retransmission timers for data TPDUs.
386 The second form of timer, the type that
387 typically is cancelled, is used for several
388 timers - the inactivity timer, the sendack timer,
389 and the retransmission
390 timer for all types of TPDUs except data TPDUs.
399 +int+c_time;+/* incremental time */
400 +int+c_active;+/* this timer is active? */
406 All of these timers are stored
408 in the reference block.
409 These timers are decremented in one linear scan of
410 the reference blocks.
411 Cancelling, setting, and both
412 cancelling and resetting one of these timers is accomplished by a
413 single assignment to an array element.
416 This is the finite state machine for TP.
417 A connection is managed by the finite state machine (fsm).
418 All events that pertain to a connection cause the
419 finite state machine driver to be called.
420 The driver takes two arguments - the pcb for the connection
421 and an event structure.
422 The event structure contains a field that discriminates
423 the different types of events, and a union of
424 structures that are specific to the event types.
425 The driver evaluates a set of predicates based on the current
426 state of the finite state machine (which is kept in the pcb) and the event type.
427 The result of the predicate evaluation determines
428 a set of actions to take and a state transition.
429 The driver takes the actions and if they complete
430 without errors, the driver makes the state transition.
432 The states, event types, predicates, actions, and state transitions are all
433 specified as a \fIxebec transition file\fR.
434 \fIXebec\fR is a utility that takes a human-readable description
435 of a finite state machine
436 and produces a set of tables and C source code for the driver.
437 The driver procedure is called \fItp_driver()\fR.
438 It is located in a file generated by xebec,
440 For more details about xebec, see the manual page \fIxebec(1)\fR.
442 The transition file for TP is \fCtp.trans\fR,
443 and it is a good place to begin a perusal of the TP
447 This is the module that decodes an incoming packet,
448 locates or creates the pcb for which
449 the packet is destined, and creates an event to
451 The network layer passes a packet up to the appropriate
452 transport layer by indirectly calling a transport input
453 routine through the protocol switch table for the network
455 There is one protocol switch entry for TP for each domain in which
456 TP will run (Internet, ISO).
457 In the Internet domain, the protocol switch field \fIpr_input()\fR
458 takes the value \fItpip_input()\fR.
459 This procedure accepts a packet from IP, with the IP header
461 It extracts the network addresses from the IP header,
462 strips the IP header, and calls the domain-independent
463 input procedure for TP,
467 The multitude of options, the variable-length
468 nature of the options, the semantics of the
469 options, and the possible combinations of concatenated
472 It is sensitive to changes, and from
473 the point of view of a software maintenance, it is a
476 critical path of TP however, some compromise
477 was made between maintainability and efficiency.
478 Multiple copies of sections of code were avoided as much as
480 not for the sake of saving space, but rather for the sake
483 this detracts somewhat from the readability of the code.
485 Once a TPDU has been decoded and a pcb has been
486 identified for the TPDU,
487 the appropriate fields of the TPDU
488 are extracted and their values are placed in
490 Finally, \fItp_driver()\fR is called with
491 the event structure and the pcb as parameters.
494 This module creates a TPDU header of a given type
495 with field values that are appropriate to the connection
496 on which the TPDU is being sent, appends data if necessary,
498 to the lower layer according to the transport-to-lower-layer
500 Whenever a TPDU is to be sent to the peer or prospective peer,
501 the function \fItp_emit()\fR
502 is called, passing as arguments the pcb a TPDU type and several miscellaneous
503 other type-specific arguments, possibly including some data.
504 The data are in the form of an mbuf chain.
505 \fITp_emit()\fR prepends to the data an mbuf containing a TP header,
506 fills in the fields of the header according to the parameters
507 given, performs the checksum if appropriate, and
508 calls a domain-specific output routine.
509 For the Internet domain, this output routine is
510 \fItpip_output()\fR, which takes
511 as arguments the mbuf chain representing the TPDU,
512 and a network level pcb.
513 Some protocol errors cannot be associated with
515 but require that TP issue
516 an ER TPDU or a DR TPDU.
517 When these errors occur the routine
518 \fItp_error_emit()\fR is called.
519 This procedure creates the appropriate type of TPDU
520 and passes it to a domain-dependent routine for transmitting datagrams.
521 In the Internet domain,
522 \fItpip_output_dg()\fR is called.
523 This takes as arguments an mbuf chain representing the TPDU,
524 a source network address, and a destination network address.
528 .\".so figs/mbufsnd.grn
530 This module packetizes data from the outbound
531 socket buffer, \fIso_snd\fR,
532 handles retransmissions of packetized data, and
533 drops packetized data from the retransmission queue.
534 The major routine in this module is \fItp_send()\fR, which
535 takes a range of sequence numbers as arguments.
536 For each sequence number in the range,
537 it packetizes the an appropriate amount
538 of outbound data, and places the resulting TPDU on
539 a retransmission control queue subject to the
540 constraints imposed by the rules of expedited data,
541 maximum packet sizes, and end-of-TSDU markers.
543 The most complicating factor is that of managing
545 A normal datum may not be sent (for its first time) before the
546 acknowledgment of any expedited datum
547 that was received from the user after the
548 normal datum was received.
549 In order to enforce this rule,
550 each TPDU must be marked in some way
551 so that it will be known which expedited datum
552 must be delivered and acknowledged by the peer before this TPDU may be transmitted
554 Markers are placed in \fIso_snd\fR
556 outgoing expedited datum arrives from the user.
557 A marker is an mbuf structure with an \fIm_len\fR
558 of zero, but with the data area nevertheless containing
559 the sequence number of an expedited data TPDU.
560 The \fIm_type\fR of a marker is a new type, MT_XPD.
562 \fITp_send()\fR stops packetizing data when it encounters a marker
563 for an unacknowledged expedited datum.
564 If it encounters a marker for an expedited TPDU that has already
565 been acknowledged, the marker is jettisoned.
567 illustrates the structure of the sending socket buffer used
570 When \fItp_send()\fR moves data from mbufs on \fIso_snd\fR to the retransmission
571 control queue, it needs to know
572 how many octets of data can be placed in each TPDU.
573 The appropriate amount depends on, among other things,
574 the maximum transmission unit of the network layer
575 on the route the packet will take.
576 To determine the maximum transmission unit,
577 TP queries the network layer through
578 the domain-dependent switch table's field, \fInl_mtu\fR.
579 In the Internet domain, this resolves to \fItp_inmtu()\fR.
580 The header sizes for the network and transport layers
581 also affect the amount of data that can go into a packet,
582 and these sizes depend on the connection's characteristics.
584 Once the maximum amount of data per TPDU is determined,
585 \fItp_send()\fR can pull this amount off the \fIso_snd\fR queue to form
587 assign a TPDU sequence number,
588 and place the new TPDU on the
589 retransmission control queue.
590 The retransmission control queue is a list of mbuf chains.
591 Each mbuf chain represents one TPDU, preceded by an
601 +struct tp_rtc+*tprt_next;+/* next rtc struct in list */
602 +SeqNum+tprt_seq;+/* seq # of this TPDU */
603 +int+tprt_eot;+/* end of TSDU? */
604 +int+tprt_octets;+/* # octets in this TPDU */
605 +struct mbuf+*tprt_data;+/* ptr to the octets of data */
606 .\"/* Performance measurment info: */
607 .\"int tprt_window; /* in which call to tp_send() was
608 .\" * this TPDU formed?
610 .\"struct timeval tprt_sess_time; /* time session received the
611 .\" * majority of the data for this packet on send;
612 .\" * on recv, this is the time it's given to session
614 .\"struct timeval tprt_net_time; /* time first copy was given to net layer
615 .\" * on send; on receive it's the time received from
623 Once TPDUs are on the retransmission control queue,
624 they are retransmitted or dropped by the actions
626 The procedure \fItp_sbdrop()\fR
627 removes the TPDUs from the retransmission queue.
628 It takes a sequence number as an argument and drops
629 all TPDUs up to and including the TPDU with that sequence number.
631 When an AK TPDU arrives, the values from
632 its credit and sequence number fields
633 are passed to \fItp_goodack()\fR, which
634 determines whether or not the AK brought any news with it,
635 and therefore whether TP can send more data
637 If this AK acknowledges something heretofore unacknowledged,
638 \fItp_goodack()\fR drops the appropriate TPDU(s) from the retransmission
639 control list, computes the smoothed average round trip time
640 and standard deviation of the round trip time,
642 the retransmission timer based on these statistics.
643 It sets a flag in the pcb if the TP entity is obliged to
644 send the flow control confirmation parameter on its next
646 \fITp_goodack()\fR returns true if the AK brought some news with it,
647 either with respect to a change in credit or with respect to
650 The function \fItp_goodXack()\fR is called when an XAK TPDU
652 It takes the XAK sequence number as an argument and
653 determines if the XAK acknowledges the last XPD TPDU sent.
654 If so, it drops the expedited data from the outgoing
655 expedited data buffer.
656 By its definition in the TP specification,
657 the expedited data stream has a window
660 only one expedited datum (packet) can be buffered
662 \fITp_goodXack()\fR returns true if the XAK acknowledged
663 the last XPD TPDU sent and the data were dropped,
664 and it returns false if the acknowledgment caused no action to be taken.
667 .\".so figs/mbufrcv.grn
670 This module reorders incoming TPDUs if necessary,
671 depacketizes data, passes it to the socket code module,
672 and determines when acknowledgments should be sent.
675 takes an DT TPDU as an argument, and if the TPDU is not in
676 sequence, it saves the TPDU in a \fItp_rtc\fR structure in
677 a list, with the TPDUs
679 When the next expected TPDU arrives, the
680 list of out-of-order TPDUs is scanned for
681 more TPDUs in sequence, updating
682 a field in the pcb, \fItp_rcvnxt\fR which
683 always contains the sequence
685 the next expected TPDU.
686 If an acknowledgment is to be generated
687 at any time, the value of tp_rcvnxt goes into the
688 \fIYR-TU-NR\fR\** field of the acknowledgment TPDU.
691 This is the name used in ISO 8073 for the field
692 which indicates the sequence number of the next expected DT TPDU.
695 \fITp_stash()\fR returns true if an acknowledgment needs to be generated
696 immediately, false not.
697 The acknowledgment strategy is therefore implemented in this routine.
698 Acknowledgments may be generated for one or more of several reasons,
700 \fITp_stash()\fR increments a counter for each of these reasons
701 for which an acknowledgment is generated, and a counter for TPDUs
702 that are not acknowledged immediately.
703 .ip "ACK_STRAT_EACH" 5
704 The acknowledgment strategy in use calls for acknowledging each
705 data packet with an AK TPDU.
706 .ip "ACK_STRAT_FULLWIN" 5
707 The acknowledgment strategy in use calls for acknowledging
708 upon receiving the DT TPDU that represents the upper window
709 edge of the last advertised window.
711 A duplicate data TPDU was received.
713 A DT TPDU arrived in the window but out of order.
715 A DT TPDU arrived, and it had the end-of-TSDU flag set.
717 Upon receipt of a DT TPDU that is in order, and upon reordering
720 places the TSDUs into the socket's receive
721 socket buffer, \fIso->so_rcv\fR in mbuf chains, with
722 TSDUs delimited by mbufs of the \fIm_type\fR MT_EOT,
723 which is a new type with the ARGO kernel.
725 illustrates the structure of the receiving socket buffer used
728 A separate socket buffer, \fItpcb->tp_Xrcv\fR,
730 buffering expedited data.
731 Only one expedited data packet may reside in this buffer at a time
732 because the TP standard limits the size of the window on expedited flow
734 This means the data structures are straightforward;
735 there is no need to distinguish between separate TSDUs in this socket buffer.
738 by dividing the total amount of available
739 space in the receive buffer
740 by the negotiated maximum TPDU size.
741 TP can often offer a larger credit than this if it uses
742 an average of the measured actual TPDU sizes.
743 This strategy was once an option in the ARGO kernel,
744 but it was removed because unless the actual TPDU size
745 is constant, it leads to reneging of credit,
746 retransmissions, and decreased performance.
747 It does not work well when there is any fluctuation in the sizes
748 of TPDUs and it carries the penalty of lengthening the critical path
750 .sh 1 "Major Data Structures and Types"
752 In addition to the types commonly used in the kernel,
759 +typedef+unsigned char+u_char;
760 +typedef+unsigned int+u_int;
761 +typedef+unsigned short+u_short;
765 TP uses the following types:
771 +typedef+unsigned int+SeqNum
772 +typedef+unsigned short+RefNum;
773 +typedef+int+ProtoHook;
778 Sequence numbers can be either 7 or 31 bits.
779 An unsigned integer is used in all cases, and the proper type
780 of arithmetic is performed with bit masks.
781 Reference numbers are 16 bits.
782 ProtoHook is the type of the procedures that are in switch
784 although they are not functions,
785 are declared \fIint\fR rather than \fIvoid\fR
786 to be consistent with the rest of the kernel.
788 The following structures are fundamental
789 types used throughout TP,
790 in addition to those already described in the
792 "The Design of the Transport Entity".
801 +u_char+tpr_state;+/* REF_FROZEN...*/
802 +struct Ccallout+tpr_callout[N_CTIMERS];+/* C timers */
803 +struct Ecallout+tpr_calltodo;+/* E timers list */
804 +struct tp_pcb+*tpr_pcb;+/* --> PCB */
810 The reference structure is logically a part of the protocol
811 control block and it is linked to a pcb, but it may outlive
813 When a connection is dissolved, the pcb may be recycled
814 but the reference structure must remain until the reference
816 The field \fItpr_state\fR takes the values
817 REF_FROZEN (a reference timer is ticking),
818 REF_OPEN (in use, has timers and an associated pcb),
819 REF_OPENING (has a pcb but no timers), and
820 REF_FREE (free to reallocate).
822 The TP protocol control block is too large to fit into
823 one mbuf structure so it comprises two structures
825 \fItp_pcb\fR structure and the.
826 \fItp_pcb_aux\fR structure.
827 The \fItp_pcb_aux\fR structure contains
828 items that are used less frequently than those in
829 the former structure, since each access to these
830 items requires a second pointer dereference.
839 +struct sockbuf+tpa_Xsnd;+/* for expedited data */
840 +struct sockbuf+tpa_Xrcv;+/* for expedited data */
841 +u_char +tpa_vers;+/* protocol version */
842 +u_char +tpa_peer_acktime;+/* to compute DT TPDU
843 +++retrans timer value */
844 +SeqNum+tpa_Xsndnxt;+/* seq # of
845 +++next XPD to send */
846 +SeqNum+tpa_Xuna;+/* seq # of
848 +SeqNum+tpa_Xrcvnxt;+/* next XPD seq #
851 +u_short+tpa_domain;+/* domain AF_ISO,...*/
852 +u_short+tpa_fsuffixlen;+/* foreign suffix */
853 +u_char+tpa_fsuffix[MAX_TSAP_SEL_LEN];+
854 +u_short+tpa_lsuffixlen;+/* local suffix */
855 +u_char+tpa_lsuffix[MAX_TSAP_SEL_LEN];+
858 +/* AK subsequencing */
861 +u_short+tpa_s_subseq;+/* next subseq to send */
862 +u_short+tpa_r_subseq;+/* highest recv subseq */
868 The major portion of the protocol control block is in the
869 \fItp_pcb\fR structure:
876 .\" ***************************************
879 .\" The next line sets the spacing for the table: 1+3 17+3 17+3 13+3
881 .\"456789 123456789- 123456789 123456-789 123456789 1234567890
883 %struct tp_ref%*tp_refp;%
886 %%/* reference structure */%
887 .\" ***************************************
890 %struct tp_pcb_aux%*tp_aux;%
893 %%/*rest of tpcb (auxiliary struct)*/%
894 .\" ***************************************
897 %caddr_t%tp_npcb;%/* to ll pcb */
898 %struct nl_protosw%*tp_nlproto;%
901 % %/* domain-dependent routines */%
902 .\" ***************************************
905 %struct socket%*tp_sock;%/* back ptr */
906 .\" ***************************************
910 /* local and foreign reference numbers: */
915 .\" ***************************************
918 .\"456789 123456789 123456789 123456789 123456789 1234567890
920 /* Stuff for sequence space arithmetic:
921 * Maintaining 2 sequence spaces is a pain so we set these
922 * values once at connection establishment time. Sequence
923 * number arithmetic is a set of macros which uses these.
924 * Sequence numbers are stored as 32 bits.
925 * tp_seqmask tells which of the 32 bits is used.
926 * tp_seqibt is the lsb that is not used. When set,
927 * it indicates wraparound has occurred.
928 * tp_seqhalf is the value that is half the sequence space.
929 * (or half plus one).
933 %u_int%tp_seqmask;%/* mask */
934 %u_int%tp_seqbit;%/* wraparound */
935 %u_int%tp_seqhalf;%/* half space */
936 .\" ***************************************
940 /* flags: values are defined in tp_user.h.
941 * Here we keep such info as which options
942 * are in use: checksum, extended format,
943 * flow control in class 2, etc.
944 * See tp(4p) man page.
946 .\" ***************************************
949 %u_short%tp_state;%/* fsm */
953 % % /* # times to retransmit */%
954 .\" ***************************************
958 /* credit & sequencing info for SENDING: */
961 %u_short%tp_fcredit;%
962 % %/* remote real window */%
963 %u_short%tp_cong_win;%
964 % %/* remote congestion window */%
965 .\" ***************************************
969 % %/* seq # of lowest unacked DT */%
970 .\" ***************************************
973 %struct tp_rtc %*tp_snduna_rtc;%
976 % %/* ptr to mbufs containing lowest%
977 %% * unacked TPDUs sent so far%
979 .\" ***************************************
982 %SeqNum%tp_sndhiwat;%
985 % %/* highest DT sent yet */%
986 .\" ***************************************
989 %struct tp_rtc%*tp_sndhiwat_rtc;%
992 % %/* ptr to mbufs containing the last%
993 %% * DT sent - this is the last item %
994 %% * on the list that starts%
995 %% * at tp_snduna_rtc%
997 .\" ***************************************
1000 %int %tp_Nwindow;%/* for perf. measmt */
1001 .\" ***************************************
1005 /* credit & sequencing info for RECEIVING: */
1006 .\" ***************************************
1009 %SeqNum%tp_sent_lcdt;%
1010 %%/* cdt according to last AK sent */%
1011 %SeqNum%tp_sent_uwe;%
1012 % %/* upper window edge, according to%
1013 %% * the last AK sent %
1015 %SeqNum%tp_sent_rcvnxt;%
1016 % %/* rcvnxt, according to%
1017 %% * the last AK sent%
1019 .\" ***************************************
1022 %short%tp_lcredit;%/* local */
1023 .\" ***************************************
1029 % %/* next DT seq# we expect to recv */%
1030 .\" ***************************************
1033 %struct tp_rtc%*tp_rcvnxt_rtc;%
1036 % %/* ptr to mbufs containing unacked %
1037 %% * DTs received out of order, and %
1038 %% * which we haven't acknowledged%
1040 .\" ***************************************
1045 /* Items kept in the aux structure: */
1047 .\" ***************************************
1050 #define tp_vers%tp_aux->tpa_vers
1051 #define tp_peer_acktime%tp_aux->tpa_peer_acktime
1052 #define tp_Xsnd%tp_aux->tpa_Xsnd
1053 #define tp_Xrcv%tp_aux->tpa_Xrcv
1054 #define tp_Xrcvnxt%tp_aux->tpa_Xrcvnxt
1055 #define tp_Xsndnxt%tp_aux->tpa_Xsndnxt
1056 #define tp_Xuna%tp_aux->tpa_Xuna
1057 #define tp_domain%tp_aux->tpa_domain
1058 #define tp_fsuffixlen%tp_aux->tpa_fsuffixlen
1059 #define tp_fsuffix%tp_aux->tpa_fsuffix
1060 #define tp_lsuffixlen%tp_aux->tpa_lsuffixlen
1061 #define tp_lsuffix%tp_aux->tpa_lsuffix
1062 #define tp_s_subseq%tp_aux->tpa_s_subseq
1063 #define tp_r_subseq%tp_aux->tpa_r_subseq
1064 .\" ***************************************
1068 /* parameters per-connection controllable by user: */
1069 .\" ***************************************
1072 %struct%tp_conn_param%_tp_param;
1074 .\" ***************************************
1077 #define tp_Nretrans%_tp_param.p_Nretrans
1078 #define tp_dr_ticks%_tp_param.p_dr_ticks
1079 #define tp_cc_ticks%_tp_param.p_cc_ticks
1080 #define tp_dt_ticks%_tp_param.p_dt_ticks
1081 #define tp_xpd_ticks%_tp_param.p_x_ticks
1082 #define tp_cr_ticks%_tp_param.p_cr_ticks
1083 #define tp_keepalive_ticks%_tp_param.p_keepalive_ticks
1084 #define tp_sendack_ticks%_tp_param.p_sendack_ticks
1085 #define tp_refer_ticks%_tp_param.p_ref_ticks
1086 #define tp_inact_ticks%_tp_param.p_inact_ticks
1087 #define tp_xtd_format%_tp_param.p_xtd_format
1088 #define tp_xpd_service%_tp_param.p_xpd_service
1089 #define tp_ack_strat%_tp_param.p_ack_strat
1090 #define tp_rx_strat%_tp_param.p_rx_strat
1091 #define tp_use_checksum%_tp_param.p_use_checksum
1092 #define tp_tpdusize%_tp_param.p_tpdusize
1093 #define tp_class%_tp_param.p_class
1094 #define tp_winsize%_tp_param.p_winsize
1095 #define tp_netservice%_tp_param.p_netservice
1096 #define tp_no_disc_indications%_tp_param.p_no_disc_indications
1097 #define tp_dont_change_params%_tp_param.p_dont_change_params
1098 .\" ***************************************
1100 .\" ***************************************
1101 .\" ***************************************
1102 .\" ***************************************
1106 .\" The next line sets the spacing for the table: 1+3 17+3 17+3 13+3
1107 .\"456789 123456789- 123456789 123456-789 123456789 1234567890
1111 %%/* log2(the negotiated max size) */%
1114 %int%tp_l_tpdusize;%/* # bytes */
1115 .\" ***************************************
1116 %struct timeval%tp_rtt;%
1119 % %/* smoothed avg round-trip time */%
1120 %struct timeval%tp_rtv;%
1121 % %/* std deviation of round-trip time */%
1122 %struct timeval%tp_rttemit[ TP_RTT_NUM + 1 ];%
1123 %%/* times that the last TP_RTT_NUM %
1124 %% * DT_TPDUs were transmitted %
1126 .\" ***************************************
1128 % tp_sendfcc:1,%/* shall next ack %
1129 % %include flow control conf. param? */%
1130 .\" ***************************************
1133 % tp_trace:1,%/* is this pcb being traced?%
1134 %% * (not used yet) %
1136 .\" ***************************************
1137 % tp_perf_on:1,%/* statistics being kept? */%
1138 .\" ***************************************
1139 % tp_reneged:1,%/* have we reneged on credit%
1140 %% * since the last AK TPDU was sent? %
1142 % tp_decbit:4,%/* congestion experienced? */%
1143 % tp_flags:8,%/* see #defines below */%
1144 .\" ***************************************
1148 #define TPF_XPD_PRESENT%TPFLAG_XPD_PRESENT
1149 #define TPF_NLQOS_PDN%TPFLAG_NLQOS_PDN
1150 #define TPF_PEER_ON_SAMENET%TPFLAG_PEER_ON_SAMENET
1152 .\" ***************************************
1155 %struct tp_pmeas%*tp_p_meas;%
1158 % %/* ptr to mbuf to hold the perf.%
1159 %% * statistics structure %
1161 .\" ***************************************
1166 .\" end of tpcb structure (thank you)
1170 .sh 1 "Sequence Number Arithmetic"
1172 Sequence numbers in TP can be either 7 bits
1173 (\*(lqnormal format\*(rq)
1175 (\*(lqextended format\*(rq).
1176 Sequence numbers are unsigned integers,
1177 regardless of their format.
1178 Three fields are kept in the pcb to manage the sequence
1185 +u_int+tp_seqmask;+/* mask for seq space */
1186 +u_int+tp_seqbit;+/* bit for seq # wraparound */
1187 +u_int+tp_seqhalf;+/* half the seq space */
1193 is a bit mask indicating which bits are legitimate
1194 for a sequence number of either format.
1195 It takes the value 0x7f if 7-bit sequence numbers are in use,
1196 and 0x7fffffff if 31-bit sequence numbers are in use.
1198 is the bit that becomes set when a sequence number wraps around
1199 while being incremented.
1200 Its value is 0x80 for normal format, 0x80000000 for extended format.
1202 takes the value which is in the middle of the sequence space,
1203 0x40 for normal format,
1205 0x40000000 for extended format.
1219 extracts a sequence number from the location
1220 in which it is stored.
1228 +SEQ_GT(tpcb, seq, t)+is seq > t?
1229 +SEQ_GEQ(tpcb, seq, t)+is seq >= t?
1230 +SEQ_LT(tpcb, seq, t)+is seq < t?
1231 +SEQ_LEQ(tpcb, seq, t)+is seq <= t?
1232 +SEQ_INC(tpcb, seq)+seq\+\+
1233 +SEQ_DEC(tpcb, seq)+seq--
1234 +SEQ_SUB(tpcb, seq, amt)+seq -= amt
1235 +SEQ_ADD(tpcb, seq, amt)+seq \+= amt
1240 perform the indicated comparisons and arithmetic
1243 An example of how these macros
1244 are used is as follows.
1245 To determine if a sequence
1246 number \fIseq\fR is in a receive window
1248 \fIlwe\fR and \fIuwe\fR,
1256 #define+IN_RWINDOW(tpcb, seq, lwe, uwe)\\
1257 +( SEQ_GEQ(tpcb, seq, lwe) && SEQ_LT(tpcb, seq, uwe) )
1261 .sh 1 "TP Implementation Options"
1263 The transport protocol specification leaves several
1264 things to the discretion of the implementor,
1265 some of which may affect the performance
1266 of individual connections and
1267 aggregate performance.
1268 Wherever different strategies are likely to favor
1270 individual connections to the detriment of aggregate performance
1272 various strategies are under the control of options via the
1273 \fIgetsockopt()\fR and
1274 \fIsetsockopt()\fR system calls (see the manual pages
1275 \fIgetsockopt(2)\fR,
1280 In some cases the preferred strategies differ for the different
1281 subnetworks, so the strategies chosen will be determined
1282 by the subnetwork in use.
1285 The limitation of the maximum TPDU size to a power of two is
1286 unfortunate in the LAN environment.
1287 For example, if the maximum NSDU size is around 1500, as in the case of an
1289 using a maximum TPDU size of 1024 reduces
1290 the possible throughput by approximately 30%.
1291 TP negotiates a maximum TPDU size of 2048 and
1292 generates TPDUs of size around 1500.
1293 Obviously this works well only when the peer is known to be
1294 using the same scheme (so that the peer
1295 doesn't send TPDUs of size 2048 and cause its
1296 network layer to fragment the TPDUs).
1297 This is likely to be the case in a LAN where
1298 all protocol entities are under the same administrative
1300 The maximum TPDU size negotiated is under the control of the user,
1302 it is possible to prevent this scheme from being used
1304 when the peer is not on the same LAN, by
1305 setting the \fItp.tpdusize\fR parameter in the ARGO directory service
1307 something less than the network's maximum transmission
1309 .\"***********************************************************
1310 .sh 2 "Congestion Window Strategy"
1312 The congestion window strategy from the
1314 was adapted for use with TP.
1315 The strategy is intended to minimize the
1317 of transport's retransmission on an
1318 already congested network.
1320 A TP entity keeps two notions of the peer's window:
1321 the real window, which is that advertised by the peer
1322 in AK TPDUs, and the congestion window, which is a locally
1324 TP uses the smaller of the two windows when transmitting.
1325 The congestion window starts small, which keeps a
1326 new connection from overloading the network with a sudden
1328 immediately after connection establishement.
1329 This is called \fIslow start\fR.
1330 For each successful acknowledgment received, the congestion
1331 window grows by one, until eventually the real window
1333 If a retransmission timer expires, the congestion window
1334 is reset to size one.
1336 The congestion window strategy is used for class 4 unless
1337 the transport user requests that it not be used.
1338 The slow start strategy is used for traffic over a PDN
1340 the transport user requests that it not be used.
1341 Slow start is not used for traffic over a LAN unless
1342 its use is requested by the transport user.
1343 .\"***********************************************************
1344 .sh 2 "Retransmission strategies"
1346 A retransmission timer is invoked for each set of DT TPDUs
1347 sent in one send operation (call to \fItp_send()\fR).
1348 This set of packets is called the \fIsend window\fR for the purpose
1349 of this discusssion.
1353 depends on the remote credit and the amount of data
1354 in the local send buffers.
1355 When a retransmission timer goes off, the lower
1357 is reevaluated but the upper window edge is not reevaluated.
1359 There are several retransmission strategies implemented in
1361 The choice of strategies is the user's, and is made with the
1362 \fIsetsockopt()\fR system call.
1363 The strategies are summarized here:
1364 .ip "Retransmit LWE TPDU only:" 5
1365 Only the TPDU representing the new lower window edge
1367 This is the default retransmission strategy.
1368 .ip "Retransmit whole send window:" 5
1369 Retransmission begins with the new lower window edge
1370 and continues up to the old upper window edge.
1372 The value of the data retransmission timer
1373 adapts to the average round trip time and the standard deviation of
1374 the round trip time.
1375 A round trip time is the time that passes between
1376 the moment of a packet's first transmission and
1377 the moment it is first acknowledged.
1378 The average round trip time
1379 is kept by the sending side of TP, using
1381 smoothing the average:
1387 #define+TP_RTT_ALPHA+3
1388 #define+TP_RTV_ALPHA+2
1390 #define+SMOOTH(alpha, old, new) \\
1391 +(((new-old) >> alpha ) \+ (old) )
1396 The times included in the average are chosen as follows.
1398 each packet's initial transmission is kept (for the last
1399 \fIN\fR packets, where \fIN\fR is a defined constant).
1400 When an AK TPDU arrives, ARGO TP subtracts the initial transmission
1401 time for the lowest unacknowledged sequence number that was
1402 acknowledged by this AK TPDU from the current time,
1403 and apply the resulting time to the average.
1404 Hence, not all packets are included in this average,
1405 which is as it should be since
1406 the purpose of this measurement is
1407 to find a good value for the retransmission timer.
1409 Each time part of a window is retransmitted,
1410 the retransmission timer for that window is increased.
1411 This does not affect the retransmission timers for other windows.
1412 .\"***********************************************************
1413 .sh 2 "Acknowledgment strategies"
1415 The transport protocol specification
1416 requires acknowledgments to be sent immediately
1418 of CC TPDUs (in class 4), XPD TPDUs, and DT TPDUs containing an
1419 EOT marker, and at other times as required for flow control,
1420 otherwise acknowledgments may be delayed.
1421 In addition to the times when an acknowledgment is required,
1422 ARGO TP transmits an AK TPDU whenever the user receives some data,
1423 thereby increasing the size of the window.
1424 For those times when
1425 immediate acknowledgment is optional,
1426 ARGO TP offers two acknowledgment strategies:
1427 .ip " Acknowledge each TPDU" 10
1428 Upon receipt of a DT TPDU and AK TPDU is sent.
1429 .ip " Acknowledge full window" 10
1430 Acknowledgment is issued
1431 upon receipt of enough data to
1432 consume the last advertised credit.
1435 requires a timer to trigger an acknowledgment
1436 in case the peer doesn't send the entire window
1438 This timer is called the
1439 \fIsendack timer\fR.
1440 The upper bound on the value of this timer
1441 is called the \fIlocal acknowledgment time\fR.
1442 The local acknowledgment time may be "advertised" to the
1443 peer during connection establishment, and the
1444 peer may choose to use this value to
1445 adjust its retransmission timers.
1446 The ARGO TP entity advertises its local acknowledgment time
1447 on a CR TPDU, but it is not
1449 the remote acknowledge time, should the peer
1452 ARGO TP adapts its sendack timer
1453 to the behavior of the connection.
1455 Under the assumption that the round trip time is
1460 the round trip time in the other direction,
1461 ARGO TP uses the measured average round trip time
1462 to adjust the sendack timer.
1464 The choice of strategies is made with the
1465 \fIsetsockopt()\fR system call.
1466 The default strategy is
1468 delay acknowledgments until the most recently advertised window is filled.