Goal: Translate from C to assembly language Idea: Gradually lower C towards target language Example: int *p = ...; uintptr_t n = ...; p += n; => (remove pointer arithmetic) p = (int *)((uintptr_t)p + n * sizeof(int)); => (remove nested subexpressions) uintptr_t t1 = n * sizeof(int); p = (int *)((uintptr_t)p + t1); => (strength reduction, assuming sizeof(int) == 4) uintptr_t t1 = n << 2; p = (int *)((uintptr_t)p + t1); => (direct translation by assembly, assuming p and n are in registers) slli t1, n, 2 // uintptr_t t1 = n << 2 add p, t1 // p = (int *)((uintptr_t)p + t1); Features in C that may not be available in an assembly language: - Local and global variables (vs registers, memory locations) - Non-word-sized data types (uint8, int64, arrays, structs/unions) - "Missing" operations (e.g. sign/zero extension) - Pointer arithmetic (implicit multiplication and division) - Nested/compound expressions - Structured control flow - Short-circuiting expressions with implicit control flow - Functions Local variables: Initially assume all local variables fit in registers. Assume the target machine only has word-sized registers (e.g. 32 bits). Then easy to do word-sized operations that correspond 1:1 to instructions: uint32 a = ...; uint32 b = ...; uint32 c = a + b; => (direct translation to assembly) add c, a, b // uint32 c = a + b; But there are still missing word-sized operations, e.g. no seq on RISC-V, uint32 c = a == b; => (lower from set-if-equal to set-if-zero) uint32 c = a - b == 0; => (remove nested subexpressions) uint32 c = a - b; c = c == 0; => (direct translation to assembly) sub c, a, b // uint32 c = a - b; seqz c // c = c == 0; => (expansion of seqz pseudoinstruction) sub c, a, b // uint32 c = a - b; sltiu c, 1 // c = c == 0; --- Subword locals have to live in word-sized registers and use word-sized ops. uint8 a = ...; uint8 b = ...; uint8 c = a + b; Note that C promotes a and b to ints before addition, so this is implicitly equivalent to uint8 c = int(a) + int(b); And int is then implicitly converted back to uint8 for the assignment/initialization: uint8 c = uint8(int(a) + int(b)); For most operations, information flows from less significant bits to more significant bits or don't flow at all. E.g. bitwise operations like and, or, xor don't have any flow of information between bits, and addition and multiplication have carries from lower bits to higher bits. This means that you can often do a long sequence of subword ops with word ops and only do word-to-subword conversion "at the end". But in any case, this suggests that it's a good idea to avoid using anything but word-sized arithmetic/operations. Try to treat uint8, uint16, etc, purely as storage types that are used for memory loads/stores, and do most things with native ints, uints if possible. Naive conversion: uint8 a = ...; uint8 b = ...; uint8 c = ...; uint8 d = uint8(int(a) + int(b)); int e = c < d; => (remove nested subexpressions) int t1 = int(a) + int(b); uint8 d = uint8(t1); int e = c < d; => (direct conversion to assembly) add d, a, b // int t1 = int(a) + int(b); zext8 d // d = uint8(t1); slt e, c, d // int e = c < d; => (expand zext8 pseudoinstruction) add d, a, b // int t1 = int(a) + int(b); andi d, 255 // d = uint8(t1); slt e, c, d // int e = c < d; If we're using int8 rather than uint8, it would be sext8 instead of zext8, and sext8 d would expand to slli d, 24 srai d, 24 Keep in mind that RISC-V and other architectures, while they may not have subword arithmetic operations, they always have subword (byte and halfword) store-to-memory operations, in which case zero/sign extension can be elided: static uint8 s; uint8 a = ...; uint8 b = ...; s = a + b; => (remove nested subexpressions, assignment to static is an implicit store) uint32 t1 = a + b; s = t1; => (direct translation to assembly) add t1, a, b // uint32 t1 = a + b; sb s, t1 // s = t1; Note that sb/sh/sw may be a pseudoinstruction when used with a label like s as its destination operand, but we won't further expand it here. --- For operations on larger-than-word-sized operands like uint64 on a 32-bit platform, you can store each local variable in two registers and synthesize double-width operations from single-width operations. The details vary with the architecture, e.g. it's easier/faster to generate double-width adds if the arch has a carry flag, so on x86 you can do add eax, ecx adc ebx, edx for a 64-bit add, where eax:ebx is the first operand and ecx:edx is the second operand. Here adc is "add with carry", which adds an extra 1 if the previous add had a carry. This kind of thing has traditionally always been available, e.g. here's how you'd do a 16-bit add on 6502, which has 8-bit operations: lda X add Y sta Z lda X+1 adc Y+1 sta Z+1 However, RISC-V (and a few others) chose not to have a carry flag, so it's actually a bit more work to synthesize a double-width add on that architecture. Interestingly, while the multiply in the M extension of RISC-V doesn't produce a double-width result, it has a mulh/mulhu instruction that you can call to generate the higher half of the result separately. Suppose we want to add al:ah to bl:bh where al, ah, bl, bh are 32-bit words. The result will be stored in cl:ch. add cl, al, bl add ch, ah, bh sltu t1, cl, al // t1 will be the carry of cl = al + bl add ch, t1 // add in the carry Suppose we want to calculate c = a * b. As before, break them into 32-bit parts: cl + ch * 2^32 = (al + ah * 2^32) * (bl + bh * 2^32) = al * bl + (al * bh + ah * bl) * 2^32 + ah * bh * 2^64 We want to calculate this mod 2^64, so the last term vanishes. And then we want to break this into the upper and lower 32-bit halves, so cl = lo(al * bl) ch = hi(al * bl) + lo(al * bh) + lo(ah * bl) You can calculate the lo and hi products using mul/mulh and then you have to do the adds. You don't have to worry about carries here for the adds since only the upper half of the result involves explicit additions, and we don't care about anything beyond 64 bits. So the code would look something like this for an unsigned 64x64 = 64 bit mul: mul cl, al, bl mulhu ch, al, bl mul t1, al, bh add ch, t1 mul t1, ah, bl add ch, t1 What about 64-bit comparisons in terms of 32-bit comparisons? Equality is easy: al:ah == bl:bh iff al == bl and ah == bh. Depending on the context, you can implement the 'and' with cascaded branches or bitwise and. Inequality is only a little bit trickier. We just have to do lexicographic comparisons: al:ah < bl:bh iff ah < bh or (ah == bh and al < bl). Again, you can implement the logic with bitwise ops or branches. Division is quite a bit more complicated, so I won't cover that here. The M extension of RISC-V has no provision like mulh for assisting with double-width divides. --- Arrays are implemented via pointer arithmetic like in C, so let's cover pointer arithmetic. If given a base pointer p, p[i] is equivalent to *(p + i) and p + i is in turn equivalent to *(int *)((uintptr_t)p + i * sizeof(*p)). We already showed in the intro how this translates to assembly: int *p = ...; int x = p[i]; => (conversion to pointer arithmetic, removal of pointer arithmetic) int x = *(int *)((uintptr_t)p + i * sizeof(*p)); => (remove nested expressions, strength reduction of * to << when sizeof(*p) == 4) uint32 t1 = i << 2; int *t2 = (int *)((uintptr_t)p + t1); int x = *t2; => (direct translation to assembly) slli x, i, 2 add x, p lw x, [x] Note that here (as earlier) I'm reusing registers rather than using separate temporary registers where it's clear and makes sense. I won't point that out explicitly anymore. --- Subtraction of pointers is a little subtle. It looks innocent enough at the C level but actually involves an implicit divide at the assembly level. For example, typedef struct { int x, y, z; } Vector; Vector *p = ...; Vector *q = ...; ptrdiff_t n = p - q; => (removal of pointer arithmetic) ptrdiff_t n = ((uintptr_t)p - (uintptr_t)q) / sizeof(Vector); => (naive conversion to assembly, don't do this) sub n, p, q li t1, 12 div n, t1 Divides are still extremely high latency operations even on modern computers, so this is pretty bad. If the division was by a power of two, it could be reduced to a right shift, but here we're dividing by 12, so no simple optimization like that is available. However, there is a trick used by every compiler for optimizing divisions by constants. We won't cover that in detail here, but you can look it up. --- Structs and unions are handled by calculating the sizes and offsets of the fields and then doing explicit offset arithmetic when referencing fields. typedef struct { int x; // offset 0 int y; // offset 4 int z; // offset 8 } Vector; static Vector v; int y = v.y; => (convert field access to explicit offset arithmetic) int y = *(int *)((uintptr_t)&v + offsetof(Vector, y)); => (remove nested expressions) uintptr_t p = (uintptr_t)&v + offsetof(Vector, y); int y = *(int *)p; => (one possible translation to assembly) la p, v lw y, [p, OFFSET_VECTOR_Y] => (another possible translation to assembly) la p, v + OFFSET_VECTOR_Y lw y, [p] This assumes v is a label that requires the pseudoinstruction la to load its address. OFFSET_VECTOR_Y is an assembler constant which corresponds to offsetof(Vector, y), so it equals the constant 4 in this case. The la pseudoinstruction will in general expand to 2 instructions: AUIPC or LUI to load the upper 20 immediate bits, and ADDI to load the lower 12 immediate bits. While in the above case, the two possible translations take the same number of cycles, if you're loading/storing multiple fields, you always want to exploit the load/store addressing modes with immediate offsets: int x = v.x; int y = v.y; int z = v.z; => la p, v lw x, [p, OFFSET_VECTOR_X] lw y, [p, OFFSET_VECTOR_Y] lw z, [p, OFFSET_VECTOR_Z] In order to calculate offsets and sizes for structs/unions, you have to know the ABI's packing and alignment conventions. --- Assembly language only directly supports "unstructured" control flow with direct jumps to a location, indirect jumps (where the target location is provided in a register), and conditional branches. How do you translate C-style structured control flow with if, while, for, etc into that vocabulary? if: if (x) { // A } // B => if (!x) goto B_start; // A B_start: // B if-else: if (x) { // A } else { // B } // C => if (!x) goto B_start; // A goto C_start; B_start: // B C_start: // C do-while: do { // A } while (x); // B => A_start: // A A_continue: if (x) goto A_start; A_break: // B If there is a break or continue in the scope of the do-while loop, they act as a goto to the A_break or A_continue labels. Rather than formulating while loops directly in terms of unstructured control flow, you can see them as a variant of do-while loops: while (x) { // A } // B => goto A_continue; do { // A A_continue: } while (x); // B => goto A_continue; A_start: // A A_continue: if (x) goto A_start; A_break: // B This is one case where it's normal to reverse the ordering between the C code and assembly code (the code for the while condition gets moved past the body). There is a translation that preserves the code order and probably feels more natural if you're coming from C: while (x) { // A } // B => A_continue: if (!x) goto A_break; // A goto A_continue; A_break: // B However, this is less efficient since every iteration contains an unconditional jump. The expectation is that loops execute more than one iteration, so you'd rather have the unconditional jump at the entry to the loop where it's executed only once. That is why the do-while-based translation is preferred. Now we can directly translate any of the goto-based C snippets to assembly code: goto A_continue; A_start: // A A_continue: if (x) goto A_start; A_break: // B => jmp A_continue A_start: // A A_continue: bne x, 0, A_start // B Or if you want to use anonymous labels: jmp 2> 1: // A 2: bne x, 0, <1 // B Here we were assuming the condition was based on the zeroness or nonzeroness of a variable x. In general, that is sufficient to support any condition. You can reduce a compound logical condition to something that generates a value into a variable x that can then be used with beq or bne. However, you should exploit the fused compare-and-branch instructions that RISC-V offers. Don't do this: if (x < y) { // A } // B => (remove expression from if condition) int t1 = x < y; if (t1) { // A } // B => (direct translation to assembly) slt t1, x, y beq t1, 0, >1 // A 1: // B Do this instead: if (x < y) { // A } // B => (reduction to unstructured control flow) if (x >= y) goto B_start; // A B_start: // B => (direct translation to assembly) bge x, y, B_start // A B_start: // B --- Short-circuiting logical expressions are translated to control flow. Logical and: if (/* A */ && /* B */) { // C } // D => if (!(/* A */)) goto D_start; if (!(/* B */)) goto D_start; // C D_start: // D Logical or: if (/* A */ || /* B */) { // C } // D => if (/* A */) goto C_start; if (/* B */) goto C_start; goto D_start; C_start: // C // D --- Switch statements are generally used if you want the compiler to generate jump tables. If we assume computed gotos are available as a C language extension (as they are in GCC/Clang), we can actually translate switch statements to computed gotos and explicit jump tables, like we've been doing with other constructs above. int x = ...; switch (x) { case ' ': // A break; case '\n': // B break; case '0': // C break; // ... default: // D } // E The compiler is allowed to turn a switch statement like this into a cascaded if-elseif chain and indeed some compilers will choose to do so in practice. But let's focus on the case where a jump table is going to be generated. enum { MIN_VALUE = '\n', MAX_VALUE = '0', }; void *switch_table[MAX_VALUE - MIN_VALUE + 1] = { [' ' - MIN_VALUE] = &&case_space, ['\n' - MIN_VALUE] = &&case_newline, ['0' - MIN_VALUE] = &&case_zero, // Fill remaining entries with &&case_default }; if (MIN_VALUE <= x && x <= MAX_VALUE) { goto *switch_table[x - MIN_VALUE]; } goto case_default; case_space: // A goto switch_break; case_newline: // B goto switch_break; case_zero: // C goto switch_break; // ... case_default: // D switch_break: // E You can also add an extra level of indirection to the table lookup to compress the table when it has few distinct entries relative to the total number of keys: void *cases[] = { &&case_default, &&case_space, &&space_newline, &&case_zero, }; uint8_t switch_table[MAX_VALUE - MIN_VALUE + 1] = { [' ' - MIN_VALUE] = 1, ['\n' - MIN_VALUE] = 2, ['0' - MIN_VALUE] = 3, }; if (MIN_VALUE <= x && x <= MAX_VALUE) { goto *cases[switch_table[x - MIN_VALUE]]; } goto case_default; The switch table with MAX_VALUE - MIN_VALUE + 1 entries now only stores 1 byte per entry instead of 4 bytes per entry. A further optimization when there are a lot of entries in cases[] is to store them as 16-bit offsets rather than pointers/absolute addresses (a 2x savings with 32-bit pointers and a 4x savings with 64-bit pointers): int16_t cases[] = { &&case_default - &&case_default, &&case_space - &&case_default, &&space_newline - &&case_default, &&case_zero - &&case_default, }; uint8_t switch_table[MAX_VALUE - MIN_VALUE + 1] = { [' ' - MIN_VALUE] = 1, ['\n' - MIN_VALUE] = 2, ['0' - MIN_VALUE] = 3, }; if (MIN_VALUE <= x && x <= MAX_VALUE) { goto *(cases[switch_table[x - MIN_VALUE]] + &&case_default); } goto case_default; Anyway, assuming we're dealing with the first case, without the extra table direction, the translation to assembly code is simple and direct except for the designated table initialization which you can ignore if you want (it's just doing the same thing as the C code with some basic assembler tricks and a helper macro). MIN_VALUE = '\n' MAX_VALUE = '0' .macro switch_case value, label temp = $ $ = switch_table + ($value - MIN_VALUE) * 4 .uint32 $label $ = temp .endmacro switch_table: .times MAX_VALUE - MIN_VALUE + 1, .uint32 case_default $switch_case ' ', case_space $switch_case '\n', case_newline $switch_case '0', case_zero // ... li t1, MIN_VALUE blt x, t1, case_default li t2, MAX_VALUE bgt x, t2, case_default sub t1, x, t1 slli t1, 2 la t2, switch_table add t1, t2 lw t1, [t1] jmp [t1] case_space: // A jmp switch_break case_newline: // B jmp switch_break case_zero: // C jmp switch_break case_default: // D switch_break: // E Here's an additional optimization for the range check: li t1, MIN_VALUE blt x, t1, case_default li t2, MAX_VALUE bgt x, t2, case_default sub t1, x, t1 => sub x, MIN_VALUE li t1, MAX_VALUE - MIN_VALUE bgtu x, t1, case_default This works because the subtraction by MIN_VALUE makes a zero-based [0, MAX_VALUE - MIN_VALUE] range, and we check membership in that range with a single unsigned comparison. As a bonus, we avoid the subtraction afterward for the table index, but it would still be worth doing range checks this way without that additional benefit. Another small optimization and a note on the dangers of using pseudoinstructions: la t2, switch_table add t1, t2 lw t1, [t1] jmp [t1] The la (load address) pseudoinstruction expands to two instructions: lui t2, hi(switch_table) addi t2, lo(switch_table) By hiding the underlying instructions, it can be easy to miss obvious optimization opportunities. In this case, the addi t2, lo(switch_table) can be folded into the lw address calculation: la t2, switch_table add t1, t2 lw t1, [t1] jmp [t1] => (expand la pseudoinstruction) lui t2, hi(switch_table) addi t2, lo(switch_table) add t1, t2 lw t1, [t1] jmp [t1] => (fold lo(switch_table) offset into lw address calculation) lui t2, hi(switch_table) add t1, t2 lw t1, [t1, lo(switch_table)] jmp [t1] --- Functions: Even if your own function doesn't have so much register pressure that it needs to spill local variables to stack or other memory, as soon as you're interacting with other functions, you typically need to save and restore certain registers using the stack. Two types of registers as far as calling conventions go: caller-save registers must be preserved by the caller around function calls if the caller wants to use the value in the register after the called function returns. So, for example, if you're writing a recursive function like factorial: int fact(int n) { if (n == 0) { return 1; } else { return n * fact(n-1); } } If you translate this directly to assembly, in the else-branch you're required to call fact recursively with n-1, and then multiply the result, when it returns, by n, so you have to preserve the value of n until after the recursive call returns. If n is stored in a caller-save register, then you have to save n on the stack before you do the recursive call, and then after returning you restore n's value from its stack slot. On an architecture that has push/pop instructions, this usually looks like this: push n call f pop n However, on RISC architectures there usually aren't dedicated push/pop instructions or indeed an explicit notion of a stack at all, and "the stack" is purely an ABI/calling convention. For example, with RISC-V's standard ABI the stack pointer sp is stored in register x2, and the stack grows downwards, so you might naively implement this kind of push/pop sequence as follows: sw [sp, 0], n call fact lw n, [sp, 0] This assumes that the function has already adjusted sp in its entry code to have sufficient space. Since the stack grows downward, this is done with a subtract: fact: sub sp, 8 // ... Before the function returns, the sp adjustment has to be undone: // ... add sp, 8 ret That is for caller-save registers. Callee-save registers are 'the opposite'. If n lived in a callee-save register, then I could just recursively call fact directly, and I'm guaranteed as the caller that n will have its same value upon return. This means that if the callee (the called function) wants to use a callee-save register for its own purposes, it must save and restore it. This is usually done in the epilogue/prologue of the function, near the sp adjustment. So, if n were in a callee-save register, we might have code that looks like this: fact: sub sp, 8 sw [sp, 0], n // Now the code can overwrite n, since we've saved the original copy on the stack. lw n, [sp, 0] add sp, 8 ret What about calls and returns themselves? On older CISC architectures, there were dedicated call/ret instructions that pushed and popped the return address from the stack. As with most other things related to stacks, this changed with RISC architectures. Instead of a call instruction that pushes the return address on the stack directly as a side effect, RISC-V (and MIPS and most other RISCs) has a jump-and-link (jal) instruction which puts the return address in a designated register: jal x1, f // put the return address in register x1 and jump to f And for returns, there isn't any special support at all. You just do an indirect jump based on the value in the link register, which contains the return address. jmp [x1] // jump to the return address stored in x1 In RISC-V there are pseudoinstructions call and ret corresponding to the above two cases. For non-leaf functions you do need to save and restore return addresses on the stack, but with RISC you have do the load/store instructions explicitly: fact: sub sp, 8 sw [sp, 4], x1 // Assume x1 is the link register containing the return address // ... lw x1, [sp, 4] // Restore the return address from the stack add sp, 8 jmp [x1] // Jump to the return address That pretty much covers the basics. Let's try writing out the code for all of fact: fact: // a0 contains the argument/return value, ra contains the return address sub sp, 8 sw [sp, 4], ra beq a0, 0, >1 sw [sp, 0], a0 sub a0, 1 call fact // jal ra, fact lw t0, [sp, 0] mul a0, t0 jmp >2 1: li a0, 1 2: lw ra, [sp, 4] add sp, 8 ret // jmp [ra] This is about as simple as a recursive function gets (note that you shouldn't actually implement factorial as a recursive function, so this is only a toy example). You can see the two values we need to save on the stack are the return address and the variable n whose lifetime survives the recursive call. Let me show the version using a callee-save register instead. The RISC-V API specifies that the s registers are callee-save, so we'll use s0 to store the value of n across the call. fact: // a0 contains the argument/return value, ra contains the return address sub sp, 8 sw [sp, 4], ra sw [sp, 0], s0 beq a0, 0, >1 mov s0, a0 sub a0, 1 call fact mul a0, s0 jmp >2 1: li a0, 1 2: lw s0, [sp, 0] lw ra, [sp, 4] add sp, 8 ret For a recursive function like fact, the distinction between caller-save and callee-save registers doesn't really help writing more efficient code. These conventions (passing the return address in a register, designating some registers as caller save and others as callee save) show their strengths in leaf-level functions: int square(int x) { return x * x; } => (direct translation to assembly) square: mul a0, a0 ret Because the return address is passed in a register when calling and read from a register when returning, and the leaf function (by definition) isn't calling other functions, we never have to save/restore the return address on the stack. For a leaf function that needs more registers to do its job, it should prioritize using caller-save registers since the leaf function doesn't have to save them since it never calls anything. --- A quick note on flattening expressions and register allocation: A generic recipe for translating "A op B" is to generate the code for evaluating A and putting the result in t1, then generate the code for evaluating B and putting the result in t2, and then generate t1 = t1 op t2 to combine the results. Here t1 and t2 stand for free temporary registers, not fixed registers. This recipe is applied recursively to the subexpressions until they bottom out in literals or variables or whatever. For example, to evaluate a + b + c + d = ((a + b) + c) + d, we would do t1 = a; t2 = b; t1 += t2; t2 = c; t1 += t2; t2 = d; t1 += t2; => (direct translation to assembly) lw t1, a lw t2, b add t1, t2 lw t2, c add t1, t2 lw t2, d add t1, t2 However, suppose we used the same left-to-right translation on the corresponding right-associative grouping of the same expression, a + (b + (c + d)): lw t1, a lw t2, b lw t3, c lw t4, d add t3, t4 add t2, t3 add t1, t2 You can see that with this naive translation approach, the right-associative expression uses as many temporary registers as there are operands, which is a disaster. Look up the Sethi-Ullman algorithm if you want to see how to do this optimally in general. However, even a simple-minded left-to-right translation approach is fine as long as you're dealing with left-associative expression trees.