internal.doc

   1 Internals of the Netwide Assembler
   2 ==================================
   3
   4 The Netwide Assembler is intended to be a modular, re-usable x86
   5 assembler, which can be embedded in other programs, for example as
   6 the back end to a compiler.
   7
   8 The assembler is composed of modules. The interfaces between them
   9 look like:
  10
  11                   +---- parser.c ----+
  12                   |        |         |
  13                   |     float.c      |
  14                   |                  |
  15                   +--- assemble.c ---+
  16                   |        |         |
  17         nasm.c ---+     insnsa.c     +--- nasmlib.c
  18                   |                  |
  19                   +---- labels.c ----+
  20                   |                  |
  21                   +--- outform.c ----+
  22                   |                  |
  23                   +----- *out.c -----+
  24
  25 In other words, each of `parser.c', `assemble.c', `labels.c',
  26 `outform.c' and each of the output format modules `*out.c' are
  27 independent modules, which do not inter-communicate except through
  28 the main program.
  29
  30 The Netwide *Disassembler* is not intended to be particularly
  31 portable or reusable or anything, however. So I won't bother
  32 documenting it here. :-)
  33
  34 nasmlib.c
  35 ---------
  36
  37 This is a library module; it contains simple library routines which
  38 may be referenced by all other modules. Among these are a set of
  39 wrappers around the standard `malloc' routines, which will report a
  40 fatal error if they run out of memory, rather than returning NULL.
  41
  42 parser.c
  43 --------
  44
  45 This contains a source-line parser. It parses `canonical' assembly
  46 source lines, containing some combination of the `label', `opcode',
  47 `operand' and `comment' fields: it does not process directives or
  48 macros. It exports two functions: `parse_line' and `cleanup_insn'.
  49
  50 `parse_line' is the main parser function: you pass it a source line
  51 in ASCII text form, and it returns you an `insn' structure
  52 containing all the details of the instruction on that line. The
  53 parameters it requires are:
  54
  55 - The location (segment, offset) where the instruction on this line
  56   will eventually be placed. This is necessary in order to evaluate
  57   expressions containing the Here token, `$'.
  58
  59 - A function which can be called to retrieve the value of any
  60   symbols the source line references.
  61
  62 - Which pass the assembler is on: an undefined symbol only causes an
  63   error condition on pass two.
  64
  65 - The source line to be parsed.
  66
  67 - A structure to fill with the results of the parse.
  68
  69 - A function which can be called to report errors.
  70
  71 Some instructions (DB, DW, DD for example) can require an arbitrary
  72 amount of storage, and so some of the members of the resulting
  73 `insn' structure will be dynamically allocated. The other function
  74 exported by `parser.c' is `cleanup_insn', which can be called to
  75 deallocate any dynamic storage associated with the results of a
  76 parse.
  77
  78 names.c
  79 -------
  80
  81 This doesn't count as a module - it defines a few arrays which are
  82 shared between NASM and NDISASM, so it's a separate file which is
  83 #included by both parser.c and disasm.c.
  84
  85 float.c
  86 -------
  87
  88 This is essentially a library module: it exports one function,
  89 `float_const', which converts an ASCII representation of a
  90 floating-point number into an x86-compatible binary representation,
  91 without using any built-in floating-point arithmetic (so it will run
  92 on any platform, portably). It calls nothing, and is called only by
  93 `parser.c'. Note that the function `float_const' must be passed an
  94 error reporting routine.
  95
  96 assemble.c
  97 ----------
  98
  99 This module contains the code generator: it translates `insn'
 100 structures as returned from the parser module into actual generated
 101 code which can be placed in an output file. It exports two
 102 functions, `assemble' and `insn_size'.
 103
 104 `insn_size' is designed to be called on pass one of assembly: it
 105 takes an `insn' structure as input, and returns the amount of space
 106 that would be taken up if the instruction described in the structure
 107 were to be converted to real machine code. `insn_size' also requires
 108 to be told the location (as a segment/offset pair) where the
 109 instruction would be assembled, the mode of assembly (16/32 bit
 110 default), and a function it can call to report errors.
 111
 112 `assemble' is designed to be called on pass two: it takes all the
 113 parameters that `insn_size' does, but has an extra parameter which
 114 is an output driver. `assemble' actually converts the input
 115 instruction into machine code, and outputs the machine code by means
 116 of calling the `output' function of the driver.
 117
 118 insnsa.c
 119 --------
 120
 121 This is another library module: it exports one very big array of
 122 instruction translations. It has to be a separate module so that DOS
 123 compilers, with less memory to spare than typical Unix ones, can
 124 cope with it.
 125
 126 labels.c
 127 --------
 128
 129 This module contains a label manager. It exports six functions:
 130
 131 `init_labels' should be called before any other function in the
 132 module. `cleanup_labels' may be called after all other use of the
 133 module has finished, to deallocate storage.
 134
 135 `define_label' is called to define new labels: you pass it the name
 136 of the label to be defined, and the (segment,offset) pair giving the
 137 value of the label. It is also passed an error-reporting function,
 138 and an output driver structure (so that it can call the output
 139 driver's label-definition function). `define_label' mentally
 140 prepends the name of the most recently defined non-local label to
 141 any label beginning with a period.
 142
 143 `define_label_stub' is designed to be called in pass two, once all
 144 the labels have already been defined: it does nothing except to
 145 update the "most-recently-defined-non-local-label" status, so that
 146 references to local labels in pass two will work correctly.
 147
 148 `declare_as_global' is used to declare that a label should be
 149 global. It must be called _before_ the label in question is defined.
 150
 151 Finally, `lookup_label' attempts to translate a label name into a
 152 (segment,offset) pair. It returns non-zero on success.
 153
 154 The label manager module is (theoretically :) restartable: after
 155 calling `cleanup_labels', you can call `init_labels' again, and
 156 start a new assembly with a new set of symbols.
 157
 158 outform.c
 159 ---------
 160
 161 This small module contains a set of routines to manage a list of
 162 output formats, and select one given a keyword. It contains three
 163 small routines: `ofmt_register' which registers an output driver as
 164 part of the managed list, `ofmt_list' which lists the available
 165 drivers on stdout, and `ofmt_find' which tries to find the driver
 166 corresponding to a given name.
 167
 168 The output modules
 169 ------------------
 170
 171 Each of the output modules, `binout.o', `elfout.o' and so on,
 172 exports only one symbol, which is an output driver data structure
 173 containing pointers to all the functions needed to produce output
 174 files of the appropriate type.
 175
 176 The exception to this is `coffout.o', which exports _two_ output
 177 driver structures, since COFF and Win32 object file formats are very
 178 similar and most of the code is shared between them.
 179
 180 nasm.c
 181 ------
 182
 183 This is the main program: it calls all the functions in the above
 184 modules, and puts them together to form a working assembler. We
 185 hope. :-)
 186
 187 Segment Mechanism
 188 -----------------
 189
 190 In NASM, the term `segment' is used to separate the different
 191 sections/segments/groups of which an object file is composed.
 192 Essentially, every address NASM is capable of understanding is
 193 expressed as an offset from the beginning of some segment.
 194
 195 The defining property of a segment is that if two symbols are
 196 declared in the same segment, then the distance between them is
 197 fixed at assembly time. Hence every externally-declared variable
 198 must be declared in its own segment, since none of the locations of
 199 these are known, and so no distances may be computed at assembly
 200 time.
 201
 202 The special segment value NO_SEG (-1) is used to denote an absolute
 203 value, e.g. a constant whose value does not depend on relocation,
 204 such as the _size_ of a data object.
 205
 206 Apart from NO_SEG, segment indices all have their least significant
 207 bit clear, if they refer to actual in-memory segments. For each
 208 segment of this type, there is an auxiliary segment value, defined
 209 to be the same number but with the LSB set, which denotes the
 210 segment-base value of that segment, for object formats which support
 211 it (Microsoft .OBJ, for example).
 212
 213 Hence, if `textsym' is declared in a code segment with index 2, then
 214 referencing `SEG textsym' would return zero offset from
 215 segment-index 3. Or, in object formats which don't understand such
 216 references, it would return an error instead.
 217
 218 The next twist is SEG_ABS. Some symbols may be declared with a
 219 segment value of SEG_ABS plus a 16-bit constant: this indicates that
 220 they are far-absolute symbols, such as the BIOS keyboard buffer
 221 under MS-DOS, which always resides at 0040h:001Eh. Far-absolutes are
 222 handled with care in the parser, since they are supposed to evaluate
 223 simply to their offset part within expressions, but applying SEG to
 224 one should yield its segment part. A far-absolute should never find
 225 its way _out_ of the parser, unless it is enclosed in a WRT clause,
 226 in which case Microsoft 16-bit object formats will want to know
 227 about it.
 228
 229 Porting Issues
 230 --------------
 231
 232 We have tried to write NASM in portable ANSI C: we do not assume
 233 little-endianness or any hardware characteristics (in order that
 234 NASM should work as a cross-assembler for x86 platforms, even when
 235 run on other, stranger machines).
 236
 237 Assumptions we _have_ made are:
 238
 239 - We assume that `short' is at least 16 bits, and `long' at least
 240   32. This really _shouldn't_ be a problem, since Kernighan and
 241   Ritchie tell us we are entitled to do so.
 242
 243 - We rely on having more than 6 characters of significance on
 244   externally linked symbols in the NASM sources. This may get fixed
 245   at some point. We haven't yet come across a linker brain-dead
 246   enough to get it wrong anyway.
 247
 248 - We assume that `fopen' using the mode "wb" can be used to write
 249   binary data files. This may be wrong on systems like VMS, with a
 250   strange file system. Though why you'd want to run NASM on VMS is
 251   beyond me anyway.
 252
 253 That's it. Subject to those caveats, NASM should be completely
 254 portable. If not, we _really_ want to know about it.
 255
 256 Porting Non-Issues
 257 ------------------
 258
 259 The following is _not_ a portability problem, although it looks like
 260 one.
 261
 262 - When compiling with some versions of DJGPP, you may get errors
 263   such as `warning: ANSI C forbids braced-groups within
 264   expressions'. This isn't NASM's fault - the problem seems to be
 265   that DJGPP's definitions of the <ctype.h> macros include a
 266   GNU-specific C extension. So when compiling using -ansi and
 267   -pedantic, DJGPP complains about its own header files. It isn't a
 268   problem anyway, since it still generates correct code.