1 .\" Copyright (C) Caldera International Inc. 2001-2002. All rights reserved.
3 .\" Redistribution and use in source and binary forms, with or without
4 .\" modification, are permitted provided that the following conditions are
7 .\" Redistributions of source code and documentation must retain the above
8 .\" copyright notice, this list of conditions and the following
11 .\" Redistributions in binary form must reproduce the above copyright
12 .\" notice, this list of conditions and the following disclaimer in the
13 .\" documentation and/or other materials provided with the distribution.
15 .\" All advertising materials mentioning features or use of this software
16 .\" must display the following acknowledgement:
18 .\" This product includes software developed or owned by Caldera
19 .\" International, Inc. Neither the name of Caldera International, Inc.
20 .\" nor the names of other contributors may be used to endorse or promote
21 .\" products derived from this software without specific prior written
24 .\" USE OF THE SOFTWARE PROVIDED FOR UNDER THIS LICENSE BY CALDERA
25 .\" INTERNATIONAL, INC. AND CONTRIBUTORS ``AS IS'' AND ANY EXPRESS OR
26 .\" IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
27 .\" WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
28 .\" DISCLAIMED. IN NO EVENT SHALL CALDERA INTERNATIONAL, INC. BE LIABLE
29 .\" FOR ANY DIRECT, INDIRECT INCIDENTAL, SPECIAL, EXEMPLARY, OR
30 .\" CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
31 .\" SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR
32 .\" BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY,
33 .\" WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE
34 .\" OR OTHERWISE) RISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN
35 .\" IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
37 .\" @(#)iosys 8.1 (Berkeley) 6/8/93
40 .EH 'PSD:3-%''The UNIX I/O System'
41 .OH 'The UNIX I/O System''PSD:3-%'
47 AT&T Bell Laboratories
50 This paper gives an overview of the workings of the UNIX\(dg
52 \(dgUNIX is a Trademark of Bell Laboratories.
55 It was written with an eye toward providing
56 guidance to writers of device driver routines,
57 and is oriented more toward describing the environment
58 and nature of device drivers than the implementation
59 of that part of the file system which deals with
62 It is assumed that the reader has a good knowledge
63 of the overall structure of the file system as discussed
64 in the paper ``The UNIX Time-sharing System.''
65 A more detailed discussion
67 ``UNIX Implementation;''
68 the current document restates parts of that one,
69 but is still more detailed.
71 conjunction with a copy of the system code,
72 since it is basically an exegesis of that code.
76 There are two classes of device:
80 The block interface is suitable for devices
81 like disks, tapes, and DECtape
82 which work, or can work, with addressible 512-byte blocks.
83 Ordinary magnetic tape just barely fits in this category,
84 since by use of forward
86 backward spacing any block can be read, even though
87 blocks can be written only at the end of the tape.
88 Block devices can at least potentially contain a mounted
90 The interface to block devices is very highly structured;
91 the drivers for these devices share a great many routines
92 as well as a pool of buffers.
94 Character-type devices have a much
95 more straightforward interface, although
96 more work must be done by the driver itself.
98 Devices of both types are named by a
103 These numbers are generally stored as an integer
104 with the minor device number
105 in the low-order 8 bits and the major device number
106 in the next-higher 8 bits;
111 are available to access these numbers.
112 The major device number selects which driver will deal with
113 the device; the minor device number is not used
114 by the rest of the system but is passed to the
115 driver at appropriate times.
116 Typically the minor number
117 selects a subdevice attached to
118 a given controller, or one of
119 several similar hardware interfaces.
121 The major device numbers for block and character devices
122 are used as indices in separate tables;
123 they both start at 0 and therefore overlap.
132 system calls is to set up entries in three separate
134 The first of these is the
137 which is stored in the system's per-process
140 This table is indexed by
141 the file descriptor returned by the
145 and is accessed during
149 or other operation on the open file.
150 An entry contains only
151 a pointer to the corresponding
155 which is a per-system data base.
156 There is one entry in the
163 This table is per-system because the same instance
164 of an open file must be shared among the several processes
165 which can result from
167 after the file is opened.
171 flags which indicate whether the file
172 was open for reading or writing or is a pipe, and
173 a count which is used to decide when all processes
174 using the entry have terminated or closed the file
175 (so the entry can be abandoned).
176 There is also a 32-bit file offset
177 which is used to indicate where in the file the next read
178 or write will take place.
179 Finally, there is a pointer to the
180 entry for the file in the
183 which contains a copy of the file's i-node.
185 Certain open files can be designated ``multiplexed''
186 files, and several other flags apply to such
188 In such a case, instead of an offset,
189 there is a pointer to an associated multiplex channel table.
190 Multiplex channels will not be discussed here.
194 table corresponds precisely to an instance of
198 if the same file is opened several times,
200 entries in this table.
202 there is at most one entry
205 table for a given file.
206 Also, a file may enter the
208 table not only because it is open,
209 but also because it is the current directory
210 of some process or because it
211 is a special file containing a currently-mounted
216 table differs somewhat from the
217 corresponding i-node as stored on the disk;
218 the modified and accessed times are not stored,
219 and the entry is augmented
220 by a flag word containing information about the entry,
221 a count used to determine when it may be
222 allowed to disappear,
223 and the device and i-number
224 whence the entry came.
225 Also, the several block numbers that give addressing
226 information for the file are expanded from
227 the 3-byte, compressed format used on the disk to full
231 During the processing of an
235 call for a special file,
236 the system always calls the device's
238 routine to allow for any special processing
239 required (rewinding a tape, turning on
240 the data-terminal-ready lead of a modem, etc.).
244 routine is called only when the last
245 process closes a file,
246 that is, when the i-node table entry
247 is being deallocated.
248 Thus it is not feasible
249 for a device to maintain, or depend on,
250 a count of its users, although it is quite
252 implement an exclusive-use device which cannot
253 be reopened until it has been closed.
263 table entry are used to set up the
269 which respectively contain the (user) address
270 of the I/O target area, the byte-count for the transfer,
271 and the current location in the file.
272 If the file referred to is
273 a character-type special file, the appropriate read
274 or write routine is called; it is responsible
275 for transferring data and updating the
276 count and current location appropriately
278 Otherwise, the current location is used to calculate
279 a logical block number in the file.
280 If the file is an ordinary file the logical block
281 number must be mapped (possibly using indirect blocks)
282 to a physical block number; a block-type
283 special file need not be mapped.
284 This mapping is performed by the
287 In any event, the resulting physical block number
288 is used, as discussed below, to
289 read or write the appropriate device.
291 Character Device Drivers
295 table specifies the interface routines present for
297 Each device provides five routines:
298 open, close, read, write, and special-function
302 Any of these may be missing.
303 If a call on the routine
307 on non-exclusive devices that require no setup)
310 entry can be given as
312 if it should be considered an error,
315 on read-only devices)
321 structure also contains a pointer to the
323 structure associated with the terminal.
327 routine is called each time the file
328 is opened with the full device number as argument.
329 The second argument is a flag which is
330 non-zero only if the device is to be written upon.
334 routine is called only when the file
335 is closed for the last time,
336 that is when the very last process in
337 which the file is open closes it.
338 This means it is not possible for the driver to
339 maintain its own count of its users.
340 The first argument is the device number;
341 the second is a flag which is non-zero
342 if the file was open for writing in the process which
348 is called, it is supplied the device
350 The per-user variable
353 the number of characters indicated by the user;
354 for character devices, this number may be 0
357 is the address supplied by the user from which to start
359 The system may call the
360 routine internally, so the
363 is supplied that indicates,
368 refers to the system address space instead of
376 characters from the user's buffer to the device,
379 for each character passed.
380 For most drivers, which work one character at a time,
383 is used to pick up characters
384 from the user's buffer.
385 Successive calls on it return
386 the characters to be written until
388 goes to 0 or an error occurs,
389 when it returns \(mi1.
391 takes care of interrogating
396 Write routines which want to transfer
397 a probably large number of characters into an internal
398 buffer may also use the routine
399 .I "iomove(buffer, offset, count, flag)"
400 which is faster when many characters must be moved.
408 bytes from the start of the buffer;
412 (which is 0) in the write case.
414 the caller is responsible for making sure
415 the count is not too large and is non-zero.
416 As an efficiency note,
418 is much slower if any of
419 .I "buffer+offset, count"
426 routine is called under conditions similar to
430 is guaranteed to be non-zero.
431 To return characters to the user, the routine
433 is available; it takes care of housekeeping
436 and returns \(mi1 as the last character
439 is returned to the user;
440 before that time, 0 is returned.
442 is also usable as with
446 but the same cautions apply.
448 The ``special-functions'' routine
453 system calls as follows:
457 is a pointer to the device's routine,
459 is the device number,
466 the device is supposed to place up to 3 words of status information
467 into the vector; this will be returned to the caller.
473 the device should take up to 3 words of
474 control information from
478 Finally, each device should have appropriate interrupt-time
480 When an interrupt occurs, it is turned into a C-compatible call
481 on the devices's interrupt routine.
482 The interrupt-catching mechanism makes
483 the low-order four bits of the ``new PS'' word in the
484 trap vector for the interrupt available
485 to the interrupt handler.
486 This is conventionally used by drivers
487 which deal with multiple similar devices
488 to encode the minor device number.
489 After the interrupt has been processed,
490 a return from the interrupt handler will
491 return from the interrupt itself.
493 A number of subroutines are available which are useful
494 to character device drivers.
495 Most of these handlers, for example, need a place
496 to buffer characters in the internal interface
497 between their ``top half'' (read/write)
498 and ``bottom half'' (interrupt) routines.
499 For relatively low data-rate devices, the best mechanism
500 is the character queue maintained by the
505 A queue header has the structure
508 int c_cc; /* character count */
509 char *c_cf; /* first character */
510 char *c_cl; /* last character */
513 A character is placed on the end of a queue by
520 The routine returns \(mi1 if there is no space
521 to put the character, 0 otherwise.
522 The first character on the queue may be retrieved
525 which returns either the (non-negative) character
526 or \(mi1 if the queue is empty.
528 Notice that the space for characters in queues is
529 shared among all devices in the system
530 and in the standard system there are only some 600
531 character slots available.
532 Thus device handlers,
533 especially write routines, must take
534 care to avoid gobbling up excessive numbers of characters.
536 The other major help available
537 to device handlers is the sleep-wakeup mechanism.
539 .I "sleep(event, priority)"
540 causes the process to wait (allowing other processes to run)
544 at that time, the process is marked ready-to-run
545 and the call will return when there is no
553 has happened, that is, causes processes sleeping
554 on the event to be awakened.
557 is an arbitrary quantity agreed upon
558 by the sleeper and the waker-up.
559 By convention, it is the address of some data area used
560 by the driver, which guarantees that events
563 Processes sleeping on an event should not assume
564 that the event has really happened;
565 they should check that the conditions which
566 caused them to sleep no longer hold.
568 Priorities can range from 0 to 127;
569 a higher numerical value indicates a less-favored
570 scheduling situation.
571 A distinction is made between processes sleeping
572 at priority less than the parameter
574 and those at numerically larger priorities.
576 be interrupted by signals, although it
577 is conceivable that it may be swapped out.
578 Thus it is a bad idea to sleep with
579 priority less than PZERO on an event which might never occur.
580 On the other hand, calls to
583 may never return if the process is terminated by
584 some signal in the meantime.
585 Incidentally, it is a gross error to call
587 in a routine called at interrupt time, since the process
588 which is running is almost certainly not the
589 process which should go to sleep.
590 Likewise, none of the variables in the user area
592 should be touched, let alone changed, by an interrupt routine.
595 wishes to wait for some event for which it is inconvenient
596 or impossible to supply a
598 (for example, a device going on-line, which does not
599 generally cause an interrupt),
601 .I "sleep(&lbolt, priority)
604 is an external cell whose address is awakened once every 4 seconds
605 by the clock interrupt routine.
608 .I "spl4( ), spl5( ), spl6( ), spl7( )"
610 set the processor priority level as indicated to avoid
611 inconvenient interrupts from the device.
613 If a device needs to know about real-time intervals,
615 .I "timeout(func, arg, interval)
617 This routine arranges that after
619 sixtieths of a second, the
623 as argument, in the style
625 Timeouts are used, for example,
626 to provide real-time delays after function characters
627 like new-line and tab in typewriter output,
628 and to terminate an attempt to
629 read the 201 Dataphone
631 if there is no response within a specified number
633 Notice that the number of sixtieths of a second is limited to 32767,
634 since it must appear to be positive,
635 and that only a bounded number of timeouts
636 can be going on at once.
639 is called at clock-interrupt time, so it should
640 conform to the requirements of interrupt routines
643 The Block-device Interface
645 Handling of block devices is mediated by a collection
646 of routines that manage a set of buffers containing
647 the images of blocks of data on the various devices.
648 The most important purpose of these routines is to assure
649 that several processes that access the same block of the same
650 device in multiprogrammed fashion maintain a consistent
651 view of the data in the block.
652 A secondary but still important purpose is to increase
653 the efficiency of the system by
654 keeping in-core copies of blocks that are being
656 The main data base for this mechanism is the
659 Each buffer header contains a pair of pointers
660 .I "(b_forw, b_back)"
661 which maintain a doubly-linked list
662 of the buffers associated with a particular
665 .I "(av_forw, av_back)"
666 which generally maintain a doubly-linked list of blocks
667 which are ``free,'' that is,
668 eligible to be reallocated for another transaction.
669 Buffers that have I/O in progress
670 or are busy for other purposes do not appear in this list.
672 also contains the device and block number to which the
673 buffer refers, and a pointer to the actual storage associated with
675 There is a word count
676 which is the negative of the number of words
677 to be transferred to or from the buffer;
678 there is also an error byte and a residual word
679 count used to communicate information
680 from an I/O routine to its caller.
681 Finally, there is a flag word
682 with bits indicating the status of the buffer.
683 These flags will be discussed below.
685 Seven routines constitute
686 the most important part of the interface with the
688 Given a device and block number,
693 return a pointer to a buffer header for the block;
694 the difference is that
696 is guaranteed to return a buffer actually containing the
697 current data for the block,
700 returns a buffer which contains the data in the
701 block only if it is already in core (whether it is
702 or not is indicated by the
705 In either case the buffer, and the corresponding
706 device block, is made ``busy,''
707 so that other processes referring to it
708 are obliged to wait until it becomes free.
710 is used, for example,
711 when a block is about to be totally rewritten,
712 so that its previous contents are
714 still, no other process can be allowed to refer to the block
715 until the new data is placed into it.
719 routine is used to implement read-ahead.
720 it is logically similar to
722 but takes as an additional argument the number of
723 a block (on the same device) to be read asynchronously
724 after the specifically requested block is available.
726 Given a pointer to a buffer,
730 makes the buffer again available to other processes.
731 It is called, for example, after
732 data has been extracted following a
734 There are three subtly-different write routines,
735 all of which take a buffer pointer as argument,
736 and all of which logically release the buffer for
737 use by others and place it on the free list.
740 buffer on the appropriate device queue,
741 waits for the write to be done,
742 and sets the user's error flag if required.
744 places the buffer on the device's queue, but does not wait
745 for completion, so that errors cannot be reflected directly to
748 does not start any I/O operation at all,
750 the buffer so that if it happens
751 to be grabbed from the free list to contain
752 data from some other block, the data in it will
757 is used when one wants to be sure that
758 I/O takes place correctly, and that
759 errors are reflected to the proper user;
760 it is used, for example, when updating i-nodes.
762 is useful when more overlap is desired
763 (because no wait is required for I/O to finish)
764 but when it is reasonably certain that the
765 write is really required.
767 is used when there is doubt that the write is
768 needed at the moment.
771 is called when the last byte of a
773 system call falls short of the end of a
774 block, on the assumption that
777 will be given soon which will re-use the same block.
779 as the end of a block is passed,
781 is called, since probably the block will
782 not be accessed again soon and one might as
783 well start the writing process as soon as possible.
785 In any event, notice that the routines
789 dedicate the given block exclusively to the
790 use of the caller, and make others wait,
792 .I "brelse, bwrite, bawrite,"
795 must eventually be called to free the block for use by others.
797 As mentioned, each buffer header contains a flag
798 word which indicates the status of the buffer.
800 one important channel for information between the drivers and the
801 block I/O system, it is important to understand these flags.
802 The following names are manifest constants which
803 select the associated flag bits.
805 This bit is set when the buffer is handed to the device strategy routine
806 (see below) to indicate a read operation.
809 is defined as 0 and does not define a flag; it is provided
810 as a mnemonic convenience to callers of routines like
812 which have a separate argument
813 which indicates read or write.
816 to 0 when a block is handed to the the device strategy
817 routine and is turned on when the operation completes,
818 whether normally as the result of an error.
819 It is also used as part of the return argument of
821 to indicate if 1 that the returned
822 buffer actually contains the data in the requested block.
824 This bit may be set to 1 when
826 is set to indicate that an I/O or other error occurred.
829 byte of the buffer header may contain an error code
833 is 0 the nature of the error is not specified.
834 Actually no driver at present sets
836 the latter is provided for a future improvement
837 whereby a more detailed error-reporting
838 scheme may be implemented.
840 This bit indicates that the buffer header is not on
841 the free list, i.e. is
842 dedicated to someone's exclusive use.
843 The buffer still remains attached to the list of
844 blocks associated with its device, however.
849 which calls it) searches the buffer list
850 for a given device and finds the requested
851 block with this bit on, it sleeps until the bit
854 This bit is set for raw I/O transactions that
855 need to allocate the Unibus map on an 11/70.
857 This bit is set on buffers that have the Unibus map allocated,
860 routine knows to deallocate the map.
862 This flag is used in conjunction with the
865 Before sleeping as described
869 Conversely, when the block is freed and the busy bit
874 is given for the block header whenever
877 This strategem avoids the overhead
880 every time a buffer is freed on the chance that someone
883 This bit may be set on buffers just before releasing them; if it
885 the buffer is placed at the head of the free list, rather than at the
887 It is a performance heuristic
888 used when the caller judges that the same block will not soon be used again.
892 to indicate to the appropriate device driver
893 that the buffer should be released when the
894 write has been finished, usually at interrupt time.
895 The difference between
899 is that the former starts I/O, waits until it is done, and
901 The latter merely sets this bit and starts I/O.
902 The bit indicates that
904 should be called for the buffer on completion.
908 before releasing the buffer.
911 while searching for a free block,
912 discovers the bit is 1 in a buffer it would otherwise grab,
913 it causes the block to be written out before reusing it.
919 table contains the names of the interface routines
920 and that of a table for each block device.
922 Just as for character devices, block device drivers may supply
928 called respectively on each open and on the final close
930 Instead of separate read and write routines,
931 each block device driver has a
933 routine which is called with a pointer to a buffer
935 As discussed, the buffer header contains
936 a read/write flag, the core address,
937 the block number, a (negative) word count,
938 and the major and minor device number.
939 The role of the strategy routine
940 is to carry out the operation as requested by the
941 information in the buffer header.
942 When the transaction is complete the
954 In cases where the device
955 is capable, under error-free operation,
956 of transferring fewer words than requested,
957 the device's word-count register should be placed
958 in the residual count slot of
960 otherwise, the residual count should be set to 0.
961 This particular mechanism is really for the benefit
962 of the magtape driver;
963 when reading this device
964 records shorter than requested are quite normal,
965 and the user should be told the actual length of the record.
967 Although the most usual argument
968 to the strategy routines
969 is a genuine buffer header allocated as discussed above,
970 all that is actually required
971 is that the argument be a pointer to a place containing the
972 appropriate information.
975 routine, which manages movement
976 of core images to and from the swapping device,
977 uses the strategy routine
979 Care has to be taken that
980 no extraneous bits get turned on in the
983 The device's table specified by
986 byte to contain an active flag and an error count,
987 a pair of links which constitute the
988 head of the chain of buffers for the device
989 .I "(b_forw, b_back),"
990 and a first and last pointer for a device queue.
991 Of these things, all are used solely by the device driver
993 except for the buffer-chain pointers.
994 Typically the flag encodes the state of the
995 device, and is used at a minimum to
996 indicate that the device is currently engaged in
997 transferring information and no new command should be issued.
998 The error count is useful for counting retries
1000 The device queue is used to remember stacked requests;
1001 in the simplest case it may be maintained as a first-in
1003 Since buffers which have been handed over to
1004 the strategy routines are never
1005 on the list of free buffers,
1006 the pointers in the buffer which maintain the free list
1007 .I "(av_forw, av_back)"
1008 are also used to contain the pointers
1009 which maintain the device queues.
1011 A couple of routines
1012 are provided which are useful to block device drivers.
1014 arranges that the buffer to which
1016 points be released or awakened,
1019 strategy module has finished with the buffer,
1020 either normally or after an error.
1021 (In the latter case the
1023 bit has presumably been set.)
1027 can be used to examine the error bit in a buffer header
1028 and arrange that any error indication found therein is
1029 reflected to the user.
1030 It may be called only in the non-interrupt
1031 part of a driver when I/O has completed
1035 Raw Block-device I/O
1037 A scheme has been set up whereby block device drivers may
1038 provide the ability to transfer information
1039 directly between the user's core image and the device
1040 without the use of buffers and in blocks as large as
1041 the caller requests.
1042 The method involves setting up a character-type special file
1043 corresponding to the raw device
1048 routines which set up what is usually a private,
1049 non-shared buffer header with the appropriate information
1050 and call the device's strategy routine.
1051 If desired, separate
1055 routines may be provided but this is usually unnecessary.
1056 A special-function routine might come in handy, especially for
1059 A great deal of work has to be done to generate the
1060 ``appropriate information''
1061 to put in the argument buffer for
1062 the strategy module;
1063 the worst part is to map relocated user addresses to physical addresses.
1064 Most of this work is done by
1065 .I "physio(strat, bp, dev, rw)
1066 whose arguments are the name of the
1073 and a read-write flag
1075 whose value is either
1080 makes sure that the user's base address and count are
1081 even (because most devices work in words)
1082 and that the core area affected is contiguous
1084 it delays until the buffer is not busy, and makes it
1085 busy while the operation is in progress;
1086 and it sets up user error return information.