minix/usr.bin/trace/NOTES

   1 Developer notes regarding trace(1), by David van Moolenbroek.
   2
   3
   4 OVERALL CODE STRUCTURE
   5
   6 The general tracing engine is in trace.c.  It passes IPC-level system call
   7 enter and leave events off to call.c, which handles IPC-level system call
   8 printing and passes off system calls to be interpreted by a service-specific
   9 system call handler whenever possible.  All the service-specific code is in the
  10 service/ subdirectory, grouped by destination service.  IOCTLs are a special
  11 case, which are handled in ioctl.c and passed on to driver-type-grouped IOCTL
  12 handlers in the ioctl/ subdirectory (this grouping is not strict).  Some of the
  13 generated output goes through the formatting code in format.c, and all of it
  14 ends up in output.c.  The remaining source files contain support code.
  15
  16
  17 ADDING A SYSTEM CALL HANDLER
  18
  19 In principle, every system call stops the traced process twice: once when the
  20 system call is started (the call-enter event) and once when the system call
  21 returns (the call-leave event).  The tracer uses the call-enter event to print
  22 the request being made, and the call-leave event to print the result of the
  23 call.  The output format is supposed to mimic largely what the system call
  24 looks like from a C program, although with additional information where that
  25 makes sense.  The general output format for system calls is:
  26
  27   name(parameters) = result
  28
  29 ..where "name" is the name of the system call, "parameters" is a list of system
  30 call parameters, and "result" is the result of the system call.  If possible,
  31 the part up to and including the equals sign is printed from the call-enter
  32 event, and the result is printed from the call-leave event.  However, many
  33 system calls actually pass a pointer to a block of memory that is filled with
  34 meaningful content as part of the system call.  For that reason, it is also
  35 possible that the call-enter event stops printing somewhere inside the
  36 parameters block, and the call-leave event prints the rest of the parameters,
  37 as well as the equals sign and the result after it.  The place in the printed
  38 system call where the call-enter printer stops and the call-leave printer is
  39 supposed to pick up again, is referred to as the "call split".
  40
  41 The tracer has to a handler structure for every system call that can be made by
  42 a user program to any of the the MINIX3 services.  This handler structure
  43 provides three elements: the name of the system call, an "out" function that
  44 handles printing of the call-enter part of the system call, and an "in"
  45 function that handles printing of the call-leave part of the system call.  The
  46 "out" function is expected to print zero or more call parameters, and then
  47 return a call type, which indicates whether all parameters have been printed
  48 yet, or not.  In fact, there are three call types, shown here with an example
  49 which has a "|" pipe symbol added to indicate the call split:
  50
  51   CT_DONE:       write(5, "foo", 3) = |3
  52   CT_NOTDONE:    read(5, |"foo", 1024) = 3
  53   CT_NORETURN:   execve("foo", ["foo"], []")| = -1 [ENOENT]
  54
  55 The CT_DONE call type indicates that the handler is done printing all the
  56 parameters during the call-enter event, and the call split will be after the
  57 equals sign.  The CT_NOTDONE call type indicates that the handler is not done
  58 printing all parameters yet, thus yielding a call split in the middle of the
  59 parameters block (or even right after the opening parenthesis).  The no-return
  60 (CT_NORETURN) call type is used for a small number of functions that do not
  61 return on success.  Currently, these are the exit(), execve(), and sigreturn()
  62 system calls.  For these calls, no result will be printed at all, unless such
  63 a call fails, in which case a failure result is printed after all.  The call
  64 split is such that the entire parameters block is printed upon entering the
  65 call, but the equals sign and result are printed only if the call does return.
  66
  67 Now more about the handler structure for the system call.  First of all, each
  68 system call has a name, which must be a static string.  It may be supplied
  69 either as a string, or as a function that returns a name string.  The latter is
  70 for cases where one message-level system call is used to implement multiple
  71 C-level system calls (such as setitimer() and getitimer() both going through
  72 PM_ITIMER).  The name function has the following prototype:
  73
  74   const char *svc_syscall_name(const message *m_out);
  75
  76 ..where "m_out" is a local copy of the request message, which the name function
  77 can use to decide what string to return for the system call.  As a sidenote,
  78 in the future, the system call name will be used to implement call filtering.
  79
  80 An "out" printer function has the following prototype:
  81
  82   int svc_syscall_out(struct trace_proc *proc, const message *m_out);
  83
  84 Here, "proc" is a pointer to the process structure containing information about
  85 the process making the system call; proc->pid returns the process PID, but the
  86 function should not access any other fields of this structure directly.
  87 Instead, many of the output primitive and helper functions (which are all
  88 prefixed with "put_") take this pointer as part of the call.  "m_out" is a
  89 local copy of the request message, and the printer may access its fields as it
  90 sees fit.
  91
  92 The printer function should simply print parameters.  The call name and the
  93 opening parenthesis are printed by the main output routine.
  94
  95 All simple call parameters should be printed using the put_field() and
  96 put_value() functions.  The former prints a parameter or field name as flat
  97 text; the latter is a printf-like interface to the former.  By default, call
  98 paramaters are simply printed as "value", but if printing all names is enabled,
  99 call parameters are printed as "name=value".  Thus, all parameters should be
 100 given a name, even if this name does not show up by default.  Either way, these
 101 two functions take care of deciding whether to print the name, as well as of
 102 printing separators between the parameters.  More about printing more complex
 103 parameters (such as structures) in a bit.
 104
 105 The out printer function must return one of the three CT_ call type values.  If
 106 it returns CT_DONE, the main output routine will immediately print the closing
 107 parenthesis and equals sign.  If it returns CF_NORETURN, a closing parenthesis
 108 will be printed.  If it return CF_NOTDONE, only a parameter field separator
 109 (that is, a comma and a space) will be printed--after all, it can be assumed
 110 that more parameters will be printed later.
 111
 112 An "in" printer function has the following prototype:
 113
 114   void svc_syscall_in(struct trace_proc *proc, const message *m_out,
 115           const message *m_in, int failed);
 116
 117 Again, "proc" is the traced process of which its current system call has now
 118 returned.  "m_out" is again the request message, guaranteed to be unchanged
 119 since the "out" call.  "m_in" is the reply message from the service.  "failed"
 120 is either 0 to indicate that the call appears to have succeeded, or PF_FAILED
 121 to indicate that the call definitely failed.  If PF_FAILED is set, the call
 122 has failed either at the IPC level or at the system call level (or for another,
 123 less common reason).  In that case, the contents of "m_in" may be garbage and
 124 "m_in" must not be used at all.
 125
 126 For CF_NOTDONE type calls, the in printer function should first print the
 127 remaining parameters.  Here especially, it is important to consider that the
 128 entire call may fail.  In that case, the parameters of which the contents were
 129 still going to be printed may also contain garbage, since they were never
 130 filled.  The expected behavior is to print such parameters as pointer or "&.."
 131 or something else to indicate that their actual contents are not valid.
 132
 133 Either way, once a CF_NOTDONE type call function is done printing the remaining
 134 parameters, it must call put_equals(proc) to print the closing parenthesis of
 135 the call and the equals sign.  CF_NORETURN calls must also use put_equals(proc)
 136 to print the equals sign.
 137
 138 Then comes the result part.  If the call failed, the in printer function *must*
 139 use put_result(proc) to print the failure result.  This call not only takes
 140 care of converting negative error codes from m_in->m_type into "-1 [ECODE]" but
 141 also prints appropriate failure codes for IPC-level and other exceptional
 142 failures.  Only if the system call did not fail, may the in printer function
 143 choose to not call put_result(proc), which on success simply prints
 144 m_in->m_type as an integer.  Similarly, if the system call succeeded, the in
 145 printer function may print extended results after the primary result, generally
 146 in parentheses.  For example, getpid() and getppid() share the same system call
 147 and thus the tracer prints both return values, one as the primary result of the
 148 actual call and one in parentheses with a clarifying name as extended result:
 149
 150   getpid() = 3 (ppid=1)
 151
 152 It should now be clear that printing extended results makes no sense if the
 153 system call failed.
 154
 155 Besidse put_equals and put_result, the following more or less generic support
 156 functions are available to print the various parts of the requests and replies.
 157
 158   put_field - output a parameter, structure field, and so on; this function
 159               should be used for just about every actual value
 160   put_value - printf-like version of put_field
 161   put_text  - output plain text; for call handlers, this should be used only to
 162               to add things right after a put_field call, never on its own
 163   put_fmt   - printf-like version of put_text, should generally not be used
 164               from call handlers at all
 165   put_open  - open a nested block of fields, surrounded by parentheses,
 166               brackets, or something like that; this is used for structures,
 167               arrays, and any other similar nontrivial case of nesting
 168   put_close - close a previously opened block of fields; the nesting depth is
 169               actually tracked (to keep per-level separators etc), so each
 170               put_open call must have a corresponding put_close call
 171   put_open_struct  - perform several tasks necessary to start printing the
 172                      fields of a structure; note that this function may fail!
 173   put_close_struct - end successful printing of a structure
 174   put_ptr   - print a pointer in the traced process
 175   put_buf   - print a buffer or string
 176   put_flags - print a bitwise flags field
 177   put_tail  - helper function for printing the continuation part of an array
 178
 179 Many of these support functions take a flags field which takes PF_-prefixed
 180 flags to modify the output they generate.  The value of 'failed' in the in
 181 printer function may actually be passed (bitwise-OR'ed in) as the PF_FAILED
 182 flag to these support functions, and they will do the right thing.  For
 183 example, a call to put_open_struct with the PF_FAILED flag will end up simply
 184 printing the pointer to the structure, and not allow printing of the contents
 185 of the structure.
 186
 187 The above support functions are documented (at a basic level) within the code,
 188 but in many cases, it may be useful to look up how they are used in practice by
 189 the existing handlers.  The same goes for various less clear cases; while there
 190 is basic support for printing structures, support for printing arrays must be
 191 coded fully by hand, as has been done for many places.  A serious attempt has
 192 been made to make the output consistent across the board (mainly thanks to the
 193 output format of strace, on which the output of this tracer has been based,
 194 sometimes very strictly and sometimes more loosely, but that aside) so it is
 195 always advisable to follow the ways of the existing handlers.  Also keep in
 196 mind that there are already printer functions for several generic structures,
 197 and these should be used whenever possible (e.g., see the put_fd() comment).
 198
 199 Finally, the default_out and default_in functions may be used as printer
 200 functions for call with no parameters, and for functions which need no more
 201 than put_result() to print their system call result, respectively.
 202
 203
 204 ADDING AN IOCTL HANDLER
 205
 206 There are many IOCTL requests, and many have their own associated data types.
 207 Like with system calls, the idea is to provide an actual implementation for any
 208 IOCTLs that can actually occur in the wild.  This consists of printing the full
 209 IOCTL name, as well as its argument.  First something about how handling IOCTLs
 210 is grouped into files in the ioctl subdirectory, then about the actual
 211 procedure the IOCTLs are handled.
 212
 213 Grouping of IOCTL handling in the ioctl subdirectory is currently based on the
 214 IOCTLs' associated device type.  This is not a performance optimization: for
 215 any given IOCTL, there is no way for the main IOCTL code (in ioctl.c) to know
 216 which group, if any, contains a handler for the IOCTL, so it simply queries all
 217 groups.  The grouping is there only to keep down the size of individual source
 218 files, and as such not even strict: for example, networking IOCTLs are
 219 technically a subset of character IOCTLs, and kept separate only because there
 220 are so many of them.  The point here is mainly that the separation is not at
 221 all set in stone.  However, the svrctl group is an exception: svrctl(2)
 222 requests are very much like IOCTLs, and thus also treated as such, but they are
 223 in a different namespace.  Thus, their handlers are in a separate file.
 224
 225 As per the ioctl_table structure, each group has a function to return the name
 226 of an IOCTL it knows (typically <group>_ioctl_name), and a function to handle
 227 IOCTL arguments (typically <group>_ioctl_arg).  Whenever an IOCTL system call
 228 is made, each group's name function is queried.  This function has the
 229 following prototype:
 230
 231   const char *group_ioctl_name(unsigned long req);
 232
 233 The "req" parameter contains the IOCTL request code.  The function is to return
 234 a static non-NULL string if it knows the name for the request code, or NULL
 235 otherwise.  If the function returns a non-NULL string, that name will be used
 236 for the IOCTL.  In addition, if the IOCTL has an argument at all, i.e. it is
 237 not of the basic _IO() type, that group (and only that group!) will be queried
 238 about the IOCTL argument, by calling the group's IOCTL argument function.  The
 239 IOCTL argument function has the following prototype:
 240
 241   int group_ioctl_arg(struct trace_proc *proc, unsigned long req, void *ptr,
 242           int dir);
 243
 244 For a single IOCTL, this function may be called up to three times.  The first
 245 time, "ptr" will be NULL, and based on the same IOCTL request code "req", the
 246 function must return any bitwise combination of two flags: IF_OUT and IF_IN.
 247
 248 The returned flags determine whether and how the IOCTL's argument will be
 249 printed: before and/or after performing the IOCTL system call.  These two flags
 250 effectively correspond to the "write" and "read" argument directions of IOCTLs:
 251 IF_OUT indicates that the argument should be printed before the IOCTL request,
 252 and this is to be used only for IOCTLs of type _IOW() and _IOWR().  IF_IN
 253 indicates that the argument should be printed after the IOCTL request (but if
 254 it was successful only), and is to be used only for IOCTLs of type _IOR() and
 255 _IOWR().
 256
 257 The returned flag combination determines how the IOCTL is formatted.  The
 258 following possible return values result in the following output formats, again
 259 with the "|" indicating the call split, "out" being the IOCTL argument contents
 260 printed before the IOCTL call, and "in" being the IOCTL argument printed after
 261 the IOCTL call:
 262
 263   0:             ioctl(3, IOCFOO, &0xaddress) = |0
 264   IF_OUT:        ioctl(3, IOCFOO, {out}) = |0
 265   IF_OUT|IF_IN:  ioctl(3, IOCFOO, {out}) = |0 {in}
 266   IF_IN:         ioctl(3, IOCFOO, |{in}) = 0
 267
 268 Both IF_ flags are optional, mainly because it is not always needed to print
 269 both sides for an _IOWR() request.  However, using the wrong flag (e.g., IF_OUT
 270 for an _IOR() request, which simply makes no sense) will trigger an assert.
 271 Also, the function should basically never return 0 for an IOCTL it recognizes.
 272 Again, for IOCTLs of type _IO(), which have no argument, the argument function
 273 is not called at all.
 274
 275 Now the important part.  For each flag that is returned on the initial call to
 276 the argument function, the argument function will be called again, this time to
 277 perform actual printing of the argument.  For these subsequent calls, "ptr"
 278 will point to the argument data which has been copied to the local address
 279 space, and "dir" will contain one of the returned flags (that is, either IF_OUT
 280 or IF_IN) to indicate whether the function is called before or after the IOCTL
 281 call.  As should now be obvious, if the first call returned IF_OUT | IF_IN, the
 282 function will be called again with "dir" set to IF_OUT, and if the IOCTL call
 283 did not fail, once more (for the third time), now with "dir" set to IF_IN.
 284
 285 For these calls with an actual "ptr" value and a direction, the function should
 286 indeed print the argument as appropriate, using "proc" as process pointer for
 287 use in calls to the printing functions.  The general approach is to print non-
 288 structure arguments as single values with no field name, and structure
 289 arguments by printing its fields with their field names.  The main code (in
 290 ioctl.c) ensures that the output is enclosed in curly brackets, thus making the
 291 output look like a structure anyway.
 292
 293 For these subsequent calls, the argument function's return value should be
 294 IF_ALL if all parts of the IOCTL argument have been printed, or 0 otherwise.
 295 In the latter case, the main code will add a final ".." field to indicate to
 296 the user that not all parts of the argument have been printed, very much like
 297 the "all" parameter of put_close_struct.
 298
 299 If no name can be found for the IOCTL request code, the argument will simply be
 300 printed as a pointer.  The same happens in error cases, for example if copying
 301 in the IOCTL data resulted in an error.
 302
 303 There is no support for dealing with multiple IOCTLs with the exact same
 304 request code--something that should not, but sadly does, occur in practice.
 305 For now, the preferred approach would be to implement only support for the
 306 IOCTL that is most likely to be found in practice, and possibly to put a horse
 307 head in the bed of whoever introduced the duplicate request code.
 308
 309
 310 INTERNALS: MULTIPROCESS OUTPUT AND PREEMPTION
 311
 312 Things get interesting when multiple processes are traced at once.  Due to the
 313 nature of process scheduling, system calls may end up being preempted between
 314 the call-enter and call-leave phases.  This means that the output of a system
 315 call has to be suspended to give way to an event from another traced process.
 316 Such preemption may occur with literally all calls; not just "blocking" calls.
 317
 318 The tracer goes through some lengths to aid the user in following the output in
 319 the light of preemtion.  The most important aspect is that the output of the
 320 call-enter phase is recorded, so that in the case of preemption, the call-leave
 321 phase can start by replaying the record.  As a result, the user gets to see the
 322 whole system call on a single line, instead of just the second half.  Such
 323 system call resumptions are marked with a "*" in their prefix, to show that
 324 the call was not just entered.  The output therefore looks like this:
 325
 326       2| syscall() = <..>
 327       3| othercall() = 0
 328       2|*syscall() = 0
 329
 330 Signals that arrive during a call will cause a resumption of the call as well.
 331 As a result, a call may be resumed multiple times:
 332
 333       2| syscall() = <..>
 334       3| othercall() = 0
 335       2|*syscall() = ** SIGUSR1 ** ** SIGUSR2 ** <..>
 336       3| othercall() = -1 [EBUSY]
 337       2|*syscall() = ** SIGHUP ** <..>
 338       3| othercall() = 0
 339       2|*syscall() = 0
 340
 341 This entire scenario shows one single system call from process 2.
 342
 343 In the current implementation, the output that should be recorded and/or cause
 344 the "<..>" preemption marker, as well as the cases where the recorded text must
 345 be replayed, are marked by the code explicitly.  Replay takes place in three
 346 cases: upon the call-leave event (obviously), upon receiving a signal (as shown
 347 above), and when it is required that a suspended no-return call is shown as
 348 completed before continuing with other output.  The last case applies to exit()
 349 and execve(), and both are documented in the code quite extensively.  Generally
 350 speaking, in all output lines where no recording or replay actions are
 351 performed, the recording will not be replayed but also not removed.  This
 352 allows for intermediate lines for that process in the output.  Practically
 353 speaking, future support for job control could even print when a process get
 354 stopped and continued, for that process, while preempting the output for the
 355 ongoing system call for that same process.
 356
 357 It is possible that the output of the call-enter phase exhausts the recording
 358 buffer for its process.  In this case, a new, shorter text is generated upon
 359 process resumption.  There are many other aspects to proper output formatting
 360 in the light of preemption, but most of them should be documented as part of
 361 the code reasonably well.