1 PseudoC assembly language
2 =========================
4 Author: Paul Sokolovsky
6 ScratchABlock uses PseudoC as its Intermediate Representation (IR)
7 language. PseudoC uses familiar, C-like syntax to represent a generic
8 RISC-like assembler. Following were requirements for choosing such
11 * Syntax should be automagically understood by any (qualified)
12 programmer, what can be more familiar than C?
13 * A typical RISC assembler should map almost statement to statement
14 to IR, to ensure human readability. (Situation is not so bright with
15 CISC, but that's it - CISC.)
17 Syntactic elements of PseudoC are described below.
24 Start with `$` (chosen because some C compilers, like GCC, allow `$` to
25 be part of an identifier), followed by a normal identifer. Examples:
28 $a0, $a1, $r, $eax, $long_virtual_reg
31 Virtual registers are effectively "variables" of a PseudoC program, and
32 may be sometimes called as such in the documentation/code. But generally,
33 this should be avoided, to not confuse them with C-level variables.
34 Currently, registers are assumed to be unsigned. Where for a particular
35 operation it's important that register's contents are interpreted as a
36 signed value, it's explicitly cast just for that operation (see below).
41 PseudoC uses a very simple type system, consisting of signed and unsigned
42 integers, represented in a form contracted from `<stdint.h>`:
44 * Unsigned: `u8`, `u16`, `u32`, `u64`, etc.
45 * Signed: `i8`, `i16`, `i32`, `i64`, etc.
47 The sizes of types (in bits) are generally expected to be powers of 2,
48 up to word length of a CPU, but syntax allows arbitrary sizes.
50 For memory references (see below), there can be pointers to the types
53 Types appear in few places in PseudoC program:
55 * Declaring types of CPU registers (tentative).
56 * When accessing memory, to cast numeric value in a register into a pointer
57 to a type of needed size.
58 * To mark places where signed variants of operations are needed (or
59 vice-versa, unsigned). In this usage, l-values remain l-values after
60 cast, unlike C. E.g. `(i32)$r0 >>= 1` performs arithmetic shift right
62 * For narrowing or widening values, where this usage has normal C
63 semantics, e.g. `$r32 = (i16)$r32`, assuming `$r32` is 32-bit, will
64 truncate it to 16 bits, then sign-extend the value to the full 32 bits.
67 ### Symbolic addresses
69 Addresses of memory locations are represented by an identifier. The
70 value of this identifier is a memory address. In other words, symbol's
71 value is an address constant. Note that in general, there're no numeric
72 addresses in a correct PseudoC program - i.e., PseudoC already assumes
73 that numeric data / addresses were already properly classified.
75 There can be exceptions to this rule, for example, MMIO addresses can
76 be represented by numeric constants. On the other hand, addresses of
77 jumps and variables are expected to be symbolic. But there still can
78 be necessary exeptions to this rule. For example, consider that a
79 particular program has two `u32` variables: `var1` at `0x10` and
80 `var2` at `0x14`. The code could access `var2` as:
84 $r1 = *(u32*)($r0 + 1)
87 This effectively represents an aliasing problem. Numeric constant
88 `0x13` above doesn't have any symbolic association. (And it means
89 that PseudoC listing alone may not be enough for all kinds of program
90 transformations, it may need to be accompanied by other data, e.g.
91 a symbol table in the case above. Considerations of additional data
92 is however outside the scope of PseudoC specification per se.)
97 PseudoC supports decimal constants (consisting of all digits) and
98 hexadical constants (starting with `0x`). As mentioned above, numeric data
99 is assumed to be such - any address would be replaced by a symbol, e.g.
100 if `0x406ef` is an address, it should be represented as e.g. `data_406ef`.
101 A constant may be prefixed with `-` to signify that it's negative. PseudoC
102 assumes two's complement representation of negative numbers (and may
103 convert negative number to unsigned representation if needed).
108 The C syntax and precedence rules are used for operators, including
109 compound assignments for "two address instruction" form. The following
110 operators are supported: `-` (unary minus), `!` (logical not), `(<type>)`
111 (type cast), `*` (unary, pointer dereference, must be followed by explicit
112 cast to a pointer type), `*`, `/`, `%` (modulo), `+`, `-`, `<<`, `>>`,
113 `<`, `<=`, `>=`, `>`, `==`, `!=`, `&`, `^`, `|`, `&&`, `||`.
115 The following C operators are not supported by PseudoC:
117 * `++` and `--` - use `$var += 1` and `$var -= 1` instead.
118 * Unary `&` - in assemebler (and thus PseudoC), there's no concept of
119 "memory variable", just an address and way to access (dereference) it.
120 Thus, unary `&` isn't needed either. (Unary `&` might appear in lifted
122 * `=` and compound assignments - these are supported in PseudoC, but are
123 statements, not operators. In other words, syntax like
124 `$r0 = $r1 = $r2 + 1` is not supported (should be 2 separate stamenents,
125 `$r1 = $r2 + 1` and `$r0 = $r1`). *Future*: If an actual CPU architecture
126 is found with "multiple assignments", this syntax could be support as
127 "macro-like extension" (see below).
128 * `.`, `->` - As there are no structures in PseudoC, there are no operators
129 for field access. `.` actually can be a part of identifier in ScratchABlock
130 (not formally part of PseudoC).
131 * `?:` - ternary operator is not supported, explicit `if` statement should be
133 * `,` - sequential operator not supported. *Future*: as alternative, multiple
134 `;`-separated statements on the same line could be allowed.
136 Example of operators (used as part of assignment statements):
143 Operations like `*`, `/`, `%`, `>>` may have different forms for signed and
144 unsigned cases. As mentioned above, PseudoC currently defaults to
145 unsigned operations. Where signed operations are performed, arguments
146 should be cast to a signed type. E.g.:
149 $a0 = (i32)$a1 * (i32)$a2
150 $eax = (i32)$eax >> 2
154 ### Special functions
156 Any CPU operation not representable by C-inspired operations above,
157 or any special operation in general, can be represented by a special
158 function. PseudoC defines inventory of generic special functions,
159 useful across different CPUs, and any particular CPU may define
160 additional special functions. An example of generic special functions:
165 // Extract bitfield, with keyword-like arguments
166 $a0 = bitfield($a0, /*lsb*/0, /*sz*/5)
169 CPU specific special functions:
172 // Maybe, disable interrupts
174 // Maybe, read status register of coprocessor 2
178 Note that this syntax is NOT used to represent calls to
179 functions/procedures defined in the PseudoC code. See the `call`
180 statement below for that.
183 ### Memory expressions
185 The syntax is: `*(type*)expr`, i.e. a typical C pointer dereference,
186 with explicit cast of an expression to pointer type. Examples:
190 $a0 = *(u8*)($a1 + $a2)
191 $eax = *(u32*)($ebx + $ecx * 8 + 3)
199 A most basic statement is an assigment, which computes expression on
200 the right hand side and assigns it to the location on left hand side
201 (commonly known as l-value). E.g.:
207 As the example above shows, assignments are not limited to "3-address
208 form", where there is only a destination on the LHS and operator
209 connecting 2 arguments on th RHS, instead arbitrary expressions are
212 L-values are also different to what they are in C. Following expressions
217 * Casts (including narrowing casts) of the above
218 * bitfield() special function, with constant *lsb* ans *sz* arguments,
221 Using casts and bitfield() allows to concisely represent accesses to
222 sub-registers, common in CISC CPUs. E.g. taking x86 ECX register as the
223 base register, CL sub-register can be assigned as:
232 bitfield($ecx, /*lsb*/8, /*sz*/8) = 2
235 ### Compound assignments
237 Compound assignments are similar to C, and are useful to represent
238 2-address form of assembly statements. E.g.:
259 (i32)$r1 = (i32)$r1 >> 1
264 Unconditional direct jumps are represented by:
270 Where `label` is an address symbol.
272 Conditional direct jumps:
275 if (condition) goto label
278 Where *condition* is an expression. A condition can be (among other things):
280 * `$Z == 0` (Z flag is zero)
281 * `$S == 1 && $V == 0` (a combination of flags)
282 * `$a0 == $a1` (direct register comparison, typical for RISC)
284 Note that a comparison is another case when signed casts may be required:
287 if ((i32)$a1 > (i32)$a2) goto less
290 `if` statement can be generalized to accept multiple conditions and optional
294 if ($a0 < 2) goto label1, ($a1 + $a2) goto label2, else goto label3
297 This form can be used to represent C's `switch` statement, and in general
298 to faithfully represent CFGs with multiple successors. Note that there is
299 no requirement that conditions were comparisons of the same variable with
300 a constant (like for C `switch`).
302 Indirect jumps use an expression which computes address instead of a static
308 if ($val < 10) goto *(u32*)(jtable + $val * 4)
312 ### Calling functions/procedures
314 The syntax is similar to unconditional jumps: `call label` or `call expr`,
320 call *(u32*)($r0 + $r1 * 4)
323 ### Returning from functions/procedures
331 However, return may also take an argument or list of arguments:
336 return UINT64($edx, $eax)
339 This has a number of uses:
341 * Making return value explicit (as a kind of self-documenting).
342 * As a shortcut for `uses()` special function.
343 * In lifted PsuedoC, as a result of decompilation process (in which case
344 it may be an expression, not just a register).
346 Macro-like functionality
347 ------------------------
349 PseudoC may support additional syntactic sugar, but generally, it's
350 expected that a parser may/will convert them to a number of individual
351 instructions from the repertoir above. Examples:
353 ### Conditional instructions
355 Some pseudo-RISC CPUs support conditional instruction execution. PseudoC
362 But it will be converted by the parser to:
365 if (!cond) goto _skip
371 ### Multiply-accumulate
373 Could be represented with:
376 $bigreg += $small1 * $small2
380 ### Macro-like special functions
382 (Not yet implemented.)
384 Special functions described above in general allow for concise
385 representation of cumbersome constructs, e.g.:
388 // lt() hides flag combination for signed less
392 ### Conditional flags
394 (Macros and pragmas are not yet implemented.)
396 The idea of dealing with conditional flags (as used in conditional jump
397 instructions in many architectures) is to maintain and assign
398 explicitly to virtual registers representing these conditinal flags. For
399 example, x86's Z flag can be named ``$Z``, ``$z``, ``$flags_z``, etc.
400 Potentially, it can be named even ``$flags.Z`` (i.e. structure field
401 syntax - likely, a bitfield).
403 As an example, x86 ``cmp eax, ebx`` instruction's effect on Z flag could
411 Of course, this instruction sets more flags, e.g. effect on Z, C, and S
412 flags could be represented as:
421 An instruction which sets flags may overwrite a source register, but its
422 original value may be needed to properly calculate flags, so it may need
423 to be preserved, e.g. for ``sub eax, ebx``:
434 As can be seen, such conversion can lead to quite verbose PseudoC
435 instruction sequences. Many superfluous assignments will be removed
436 by further processing (expression propagation and dead code elimination),
437 but input PseudoC is still very verbose. As human-readability is one
438 of the main goals of the format, there are still further ideas how
439 to make it less verbose:
441 1. Implicit macros. Perhaps, these would be activated by "pragma", and
442 could automatically add assignments to save original contents of the
446 #pragma save_org_regs
451 would translate (during parsing) to:
462 2. A macro to set flags, e.g. x86 ``sub eax, ebx`` and ``add ecx, edx``
463 could be converted to:
466 #pragma save_org_regs
473 Instruction addresses (interpreted symbolically)
474 ------------------------------------------------
476 For a particular PseudoC program, each line may be prefixed by an
477 address. This address is interpreted symbolically, not numerically.
478 This is to facilitate insertion of new statements in a program.
479 Addresses are compared lexicographically, not numerically, so for
480 proper operation they should be e.g. of the same width (padded with
481 zeroes on the left). If explicit addresses are not given, an
482 implicit address is assigned to each instruction, based on the line
483 number of the instruction.
485 Representation of functions/procedures
486 --------------------------------------
488 Canonical PseudoC representation is one function/procedure per files.
489 For the cases when this is not practical, multiple functions can be
490 put into a file, each starting with `.func` pragma (RFC). A function
491 should start with a label, representing function name.
493 A PseudoC program (i.e. file) may start with declararion of register
494 sizes/signedness (RFC), e.g.:
502 Examples of PseudoC programs
503 ----------------------------
507 01 if (!$a1) goto l30
519 RFC - Request for comments from interested parties. Most of RFCs marked
520 in texts are not yet implemented in the reference PseudoC processing
521 library, ScratchABlock.