preknowledge

1 byte has 8 bits

  • char has 1 byte
  • short has 2 byte
  • int has 4 byte

img

1
2
3
4
gcc -E hello.c -o hello.i
gcc -S hello.i -o hello.s
gcc -c hello.s -o hello.o
gcc hello.o -o hello
  • All AARCH64 instructions are 4 bytes in width.

  • All AARCH64 pointers are 8 bytes in width†.

    While this is technically true, typically only the lower 39, 42 or 48 bits of addresses in Linux systems are used - i.e. the virtual address space of an ARM Linux process is smaller than 64 bits. The upper bits are set to zero when considering the address as an 8-byte value.

register

register access speed

Latency

This says that if we liken accessing a register (which can be done at least once per CPU Clock Cycle) to one second, accessing RAM would be like a 3.5 to 5.5 minute wait.

register type

  • rn means register “of some type” number n.

The kind of register is specified by a letter. Which register within a given type is specified by a number. There are some exceptions to this. Here is an introductory summary:

Letter Type
x 64 bit integer or pointer
w 32 bit or smaller integer
d 64 bit floats (doubles)
s 32 bit floats

Some register types have been left out.

(Chapter 9.1)(Cortex-A Series Programmer’s Guide for ARMv8-A)

image-20250511155518113

  • x29是栈帧指针(FP)
  • x30是链接寄存器(LR,即返回地址)

The registers used for floating point types (and vector operations) are coincident:

image-20250511210301586

  • q registers are a massive 16 bytes wide - quad words.(vn的别名,主要用于SIMD/Neon 指令中)
  • v registers are also 16 bytes wide and are synonyms for the q registers.
  • d registers for doubles which are 8 bytes wide - double precision. 2 per v.
  • s registers for floats which are 4 bytes wide - single precision. 4 per v.
  • h registers for half precisions floats which are 2 bytes wide. 8 per v.
  • b registers for byte operations. 16 per v.

register and C type

Integers

This declares an integer This IS an integer
char wn
short wn
int wn
long xn

Pointers

This declares a pointer This IS a pointer
type * xn

All pointers are stored in x registers. X registers are 64 bits long but many operating systems do not support 64 bit address spaces because keeping track of that big of an address space itself would use a lot of space. Instead OS’s typically have 48 to 52 bit address spaces.

Floating Point

This declares a float This IS a float
float sn
double dn
__fp16 (half) hn

image-20250511210512600

image-20250511210539205

vn是真正的物理寄存器名,推荐使用, 支持最多类型的访问(浮点 + SIMD)

qn是vn的别名,主要用于SIMD/Neon 指令中(Single Instruction - Multiple Data)

instructions

preknowledge

EVERY AARCH64 instruction is 4 bytes wide. Everything the CPU needs to know about what the instruction is and what variation it might be plus what data it will use will be found in those 4 bytes.

  • Most (but not all) AARCH64 instructions have three operands. These are read in the following way:
1
op     ra, rb, rc

means:

1
ra = rb op rc

examples:

1
2
sub    x0, x0, x1 ; means x0 = x0 - x1
mov x0, x1 ; means x0 = x1
  • [ ]

the [ and ] serve the same purpose of the asterisk in C and C++ indicating “dereference.” It means use what’s inside the brackets as an address for going out to memory.

when a ! is at the end of [] , for example:

1
2
3
stp     x21, x30, [sp, -16]!  

stp x29, x30, [sp, -16]!

Lastly, the exclamation point means that the stack pointer should be changed (i.e. the -16 applied to it) before the value of the stack pointer is used as the address in memory to which the registers will be copied. Again, this is a predecrement.

it means:

  1. sp = sp - 16(栈指针向下移动 16 字节)
  2. x29 存入 [sp],把 x30 存入 [sp + 8]

对应:

1
ldp     x29, x30, [sp], 16

it means:

  1. [sp] 读取 8 字节给 x29,从 [sp + 8] 读取 8 字节给 x30
  2. sp = sp + 16(释放栈帧空间)

The stack pointer in ARM V8 can only be manipulated in multiples of 16.

The stack pointer in ARM V8 can only be manipulated in multiples of 16.

The stack pointer in ARM V8 can only be manipulated in multiples of 16.

x29是栈帧寄存器,但不是必须保存的

memory access

ldr

load register

1
2
3
4
ldr    x0, [sp]   // load 8 bytes from address specified by sp
ldr w0, [sp] // load 4 bytes from address specified by sp
ldrh w0, [sp] // load 2 bytes from address specified by sp
ldrb w0, [sp] // load 1 byte from address specified by sp

When misaligned accesses to RAM are made, the processor must slow down and access each byte individually. This is a big performance hit. Properly aligned access is critical to performance.

str

store register

1
2
3
4
str    x0, [sp]   // store 8 bytes to address specified by sp
str w0, [sp] // store 4 bytes to address specified by sp
strh w0, [sp] // store 2 bytes to address specified by sp
strb w0, [sp] // store 1 byte to address specified by sp

Casting between integer types is in some cases accomplished by anding with 255 and 65535 (for char and short) or :

Whenever a narrower portion of a register is written to, the remainder of the register is zero’d out. That is: ldrb overwrites the least significant byte of an x register and zeros out the upper 7 bytes.

ldp

load pair, same as ldr but load a pair of value

stp

store pair, same as str but load a pair of value

offsets

1
2
3
1) LDR Xt, [Xn|SP{, #pimm}] ; 64-bit general registers
2) LDR Xt, [Xn|SP], #simm ; 64-bit general registers, Post-index
3) LDR Xt, [Xn|SP, #simm]! ; 64-bit general registers, Pre-index
  • simm can be in the range of -256 to 255 (10 byte signed value).
  • pimm can be in the range of 0 to 32760 in multiples of 8.

three patterns

  1. 普通偏移模式
1
LDR Xt, [Xn, #pimm]

Xn + pimm 的地址加载数据到 Xt地址寄存器 Xn 不变

pimm 是一个 正的立即数(positive immediate),必须是 8 的倍数,最大为 32760。

  1. 后变基模式
1
LDR Xt, [Xn], #simm

先用 Xn 的原始值作为地址加载数据到 Xt,然后再用 simm 更新 Xn地址寄存器 Xn 改变

  1. 前变基模式
1
LDR Xt, [Xn, #simm]!

先用 Xn + simm 作为地址加载数据到 Xt,并将更新后的地址写回 Xn地址寄存器 Xn 改变

pseudo instruction

1
ldr     x1, =label
  • the assembler puts the address of the label into a special region of memory called a “literal pool.” What matters is this region of memory is placed immediately after (therefore nearby) your code.

  • Then, the assembler computes the difference between the address of the current instruction (the ldr itself) and the address of the data in the literal pool made from the labeled data.

  • The assembler generates a different ldr instruction which uses the difference (or offset) of the data relative to the program counter (pc). The pc is non-other the address of the current instruction.

  • Because the literal pool for your code is located nearby your code, the offset from the current instruction to the data in the pool is a relatively small number. Small enough, to fit inside a four byte ldr instruction.

1
ldr    x1, [pc, offset to data in literal pool]

A downside of this approach is that the literal pool, from which the address is loaded, resides in RAM. This means each of these ldr pseudo instructions incurs a memory reference.

literal pool

compare

1
2
ldr x1, =q
ldr x1, q

aarch64

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
        .global     main       // expose main to linker                                        
.text // begin to write code
.align 2 // the code should certainly begin on an even address

main: str x30, [sp, -16]!

ldr x0, =fmt
ldr x1, =q
ldr x2, [x1]
bl printf

ldr x0, =fmt
ldr x1, q
ldr x2, [x1]
bl printf

ldr x30, [sp], 16
mov w0, wzr
ret

.data
q: .quad 0x1122334455667788
fmt: .asciz "address: %p value: %lx\n"

.end

disasembling the binary machine code:

1
2
3
4
5
6
7
8
9
10
11
12
13
0000000000007a0 <main>:
7a0: f81f0ffe str x30, [sp, #-16]!
7a4: 58000160 ldr x0, 7d0 <main+0x30>
7a8: 58000181 ldr x1, 7d8 <main+0x38>
7ac: f9400022 ldr x2, [x1]
7b0: 97ffffb4 bl 680 <printf@plt>
7b4: 580000e0 ldr x0, 7d0 <main+0x30>
7b8: 580842c1 ldr x1, 11010 <q>
7bc: f9400022 ldr x2, [x1]
7c0: 97ffffb0 bl 680 <printf@plt>
7c4: f84107fe ldr x30, [sp], #16
7c8: 2a1f03e0 mov w0, wzr
7cc: d65f03c0 ret

and

1
2
3
000000000011010 <q>:
11010: 55667788
11014: 11223344
  • It says 000000000011010 <q>:. This means that what comes next is the data corresponding to what is labeled q in our source code. Notice the relocatable address of 11010. We will explain “relocatable address” below.

  • Now, look at the disassembled code on the line beginning with 7b8. It reads ldr x1, 11010. So the disassembled executable is saying “go to address 11010 and fetch its contents” which are our 1122334455667788.

Instruction Meaning
ldr r, =label Load the address of the label into r
ldr r, label Load the value found at the label into r

relocation of address when executing

None of the addresses we have seen so far are the final addresses that will be used once the program is actually running. All addresses will be relocated.

One reason for this is a guard against malware. A technique called Address Space Layout Randomization (ASLR) prevents malware writers from being able to know ahead where to modify your executable in order to accomplish their nefarious purposes.

64 bit ARM Linux kernels allocate 39, 42 or 48 bits for the size of a process’s virtual address space. Notice 42 and 48 bit values require 6 bytes to hold them. A virtual address space is all of the addresses a process can generate / use. Further, all addresses used by processes are virtual addresses.

using this can avoid literal pool

1
2
adrp    x0, s
add x0, x0, :lo12:s

examples

loading (storing) various sizes of integers

Instruction Meaning
ldr x0, [x1] Fetches a 64 bit value from the address specified by x1 and places it in x0
ldr w0, [x1] Fetches a 32 bit value from the address specified by x1 and places it in w0
ldrh w0, [x1] Fetches a 16 bit value from the address specified by x1 and places it in x0
ldrb w0, [x1] Fetches an 8 bit value from the address specified by x1 and places it in x0
  • Pointers and longs use x registers.
  • All other integer sizes use w registers where the instruction itself specifies the size.

array indexing

1
2
3
4
5
6
7
8
9
long Sum(long * values, long length)   
{
long sum = 0;
for (long i = 0; i < length; i++)
{
sum += values[i];
}
return sum;
}

Notice we’re using the index variable i for nothing more than traipsing through the array. This is fantastically inefficient (in this case).

1
2
3
4
5
6
7
8
9
10
long Sum(long * values, long length)         
{
long sum = 0;
long * end = values + length;
while (values < end)
{
sum += *(values++);
}
return sum;
}

Notice we don’t use an index variable any longer. Instead, we use the pointer itself for both the dereferencing and to tell us when to stop the loop.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
    .global Sum                                           
.text
.align 4

// x0 is the pointer to data
// x1 is the length and is reused as `end`
// x2 is the sum
// x3 is the current dereferenced value

Sum:
mov x2, xzr // x2 = 0
add x1, x0, x1, lsl 3 // x1 = x0+x1*8
b 2f

1: ldr x3, [x0], 8
add x2, x2, x3
2: cmp x0, x1
blt 1b

mov x0, x2
ret

.end

faster memory copy

Suppose you needed to copy 16 bytes of memory from one place to another. You might do it like this:

1
2
3
4
5
void SillyCopy16(uint8_t * dest, uint8_t * src)
{
for (int i = 0; i < 16; i++)
*(dest++) = *(src++);
}

This is especially silly as why would you go through 16 loops when you could have simply:

1
2
3
4
5
void SillyCopy16(uint64_t * dest, uint64_t * src)
{
*(dest++) = *(src++); // 3
*dest = *src; // 4
}

in aarch64

1
2
3
4
5
6
SillyCopy16:              // 1
ldr x2, [x0], 8 // 2
str x2, [x1], 8 // 3
ldr x2, [x0] // 4
str x2, [x1] // 5
ret

using ldp

1
2
3
4
SillyCopy16:
ldp x2, x3, [x0]
stp x2, x3, [x1]
ret

using q register

1
2
3
4
SillyCopy16:
ldr q2, [x0]
str q2, [x1]
ret

indexing through an array of struct

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
#include <stdio.h>                                       

struct Person
{
char * fname;
char * lname;
int age;
};

extern int rand();
extern struct Person * FindOldestPerson(struct Person *, int);

struct Person * OriginalFindOldestPerson(struct Person * people, int length)
{
int oldest_age = 0;
struct Person * oldest_ptr = NULL;

if (people)
{
struct Person * end_ptr = people + length;
while (people < end_ptr)
{
if (people->age > oldest_age)
{
oldest_age = people->age;
oldest_ptr = people;
}
people++;
}
}
return oldest_ptr;
}

#define LENGTH 20

int main()
{
struct Person array[LENGTH];
for (int i = 0; i < LENGTH; i++)
{
array[i].age = rand() % 5000;
}
struct Person * oldest = FindOldestPerson(array, LENGTH);
for (int i = 0; i < LENGTH; i++)
{
printf("%d", array[i].age);
if (oldest == &array[i])
printf("*");
printf("\n");
}
}

Line 11 tells us that somewhere else, there is a function called FindOldestPerson. That function must have a .global specifying the same name so that the linker can reconcile the reference to FindOldestPerson.

gcc with -O2 or -O3 optimization rendered OriginalFindOldestPerson() into 18 lines of assembly language.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
        .global FindOldestPerson                                        // 1 
.text // 2
.align 2 // 3
// 4
// x0 has struct Person * people // 5
// will be used for oldest_ptr as this is the return value // 6
// w1 has int length // 7
// w2 used for oldest_age // 8
// x3 used for Person * // 9
// x4 used for end_ptr // 10
// w5 used for scratch // 11
// 12
FindOldestPerson: // 13
cbz x0, 99f // short circuit // 14
mov w2, wzr // initial oldest age is 0 // 15
mov x3, x0 // initialize loop pointer // 16
mov x0, xzr // initialize return value // 17
mov w5, 24 // struct is 24 bytes wide // 18
smaddl x4, w1, w5, x3 // initialize end_ptr // 19
b 10f // enter loop // 20
// 21
1: ldr w5, [x3, p.age] // fetch loop ptr -> age // 22
cmp w2, w5 // compare to oldest_age // 23
csel w2, w2, w5, gt // update based on cmp // 24
csel x0, x0, x3, gt // update based on cmp // 25
add x3, x3, 24 // increment loop ptr // 26
10: cmp x3, x4 // has loop ptr reached end_ptr? // 27
blt 1b // no, not yet // 28
// 29
99: ret // 30
// 31
.data // 32
.struct 0 // 33
p.fn: .skip 8 // 34
p.ln: .skip 8 // 35
p.age: .skip 4 // 36
p.pad: .skip 4 // 37
// 38
.end // 39

control flow

cmp

compare

discards the result of the subtraction but keeps a record of whether or not the result was less than, equal to or greater than zero. It sets the condition bits

br

Branch to Register

1
br <register>

无条件跳转,类似于

1
goto *(ptr)

ble

Branch less or equal

bl

Branch with Link

跳转到一个函数(子程序)地址,并且保存返回地址到 x30 寄存器中(也叫 lr,Link Register)

cbz

Compare and Branch if Zero

1
cbz <register>, <label>

如果 <register> 中的值为 0,就跳转到 <label>

否则继续执行下一条指令。

csel

Conditional Select

1
csel <dest>, <src1>, <src2>, <condition>

如果满足 <condition>,则将 <src1> 的值赋给 <dest>

否则将 <src2> 的值赋给 <dest>

examples:

1
2
cmp w2, w5
csel w2, w2, w5, gt // 如果 w2 > w5,则 w2 保持不变;否则更新为 w5

这是无分支的条件赋值,比 if-else 更高效。

this is equal to

1
w2 = (w2 > w5) ? w2 : w5;

calculate

shift Opertations

lsl

Logical Shift Left

The LSL instruction performs multiplication by a power of 2.

lsr

Logical Shift Right

The LSR instruction performs division by a power of 2.

asr

Arithmetic Shift Right

The ASR instruction performs division by a power of 2, preserving the sign bit.

ror

rotate right

The ROR instruction performs a bitwise rotation, wrapping the bits rotated from the LSB into the MSB.
即:ROR 指令执行按位右旋转操作:从最低有效位(LSB)被旋转出来的位,会重新被放入到最高有效位(MSB)的位置中。

bit manipulation

mvn

mvn (Move Not) 作用是 将操作数按位取反(bitwise NOT)后,放入目标寄存器。

orr

orr (bitwise inclusive OR) 对两个操作数执行按位或(bitwise OR)运算,然后将结果写入目标寄存器

bfi

bfi (Bit Field Insert) 即“位字段插入”。

1
bfi <Xd>, <Xn>, #<lsb>, #<width>

:目标寄存器(结果写到这里)

:源寄存器(从这里取低位的值)

:目标寄存器中开始插入的起始位(least significant bit 起始位)

:要插入多少位(宽度)

假设:

Xd = 0b1111 0000
Xn = 0b1011 (只用低4位)
lsb=1
width=3

执行:

1
bfi Xd, Xn, #1, #3

结果:
将 Xn 的低3位 011 插入 Xd 的位1~3上,替换原值
结果是 Xd = 1111 0110

ubfm

ubfm = Unsigned BitField Move

基本格式:

1
ubfm <dst>, <src>, #lsb, #msb

:目标寄存器

:源寄存器
lsb:起始位(low bit index)
msb:结束位(high bit index)
这条指令从 src 中 提取一个无符号位字段(即一段连续的比特位),把它放到 dst 的低位(bit 0 开始),其他位清零或忽略
也就是说:

  1. 从 src 的第 lsb 位开始,取到 msb 位
  2. 将这段 bit 字段提取出来
  3. 右对齐放到 dst 的低位(bit 0)
    其他位全部清零
    实例:
1
ubfm    w1, w2, #8, #15
  1. 从 w2 中提取 bit 8 到 bit 15(共 8 位)
  2. 把它放到 w1 的 bit 0~7

ubfiz

ubfiz (Unsigned Bit Field Insert Zeroed) 将一个无符号数的低位字段插入到另一个寄存器的指定位置,但目标寄存器在插入之前会被清零。
它其实是 ubfm(Unsigned Bit Field Move)的一个特化形式,和 UBFM 的语义类似。

指令格式:

1
ubfiz  <dst>, <src>, #lsb, #width

简单来讲就是:ubfiz = 把 src 的低 width 位 插入到 dst 的 bit lsb 开始的位置,其余位置全部清零。

其中:

:来源寄存器(如 w1)

:目标寄存器(如 w2),最终结果放在这里
lsb:目标中插入位置的起始 bit 位(从0开始)
width:要插入的位数(从 的最低位开始数)

目标寄存器其他位都会被清零。

举例说明:

1
ubfiz   w1, w1, #3, #5

含义如下:

  1. 从 w1 的 最低 5 位(bit 0 到 bit 4)提取出来
  2. 插入到目标(w1)寄存器的 bit 3 到 bit 7
  3. w1 的其他所有位(02 和 831)清零

other

adr

Address

adrp

Address of page

1
2
3
4
5
6
7
8
    .section .rodata
fmt:
.asciz "%p a: 0x%lx b: %x c: %x\n"

.text

adrp x0, fmt
add x0, x0, :lo12:fmt // 汇编器会自动提取 fmt 的低12位作为立即数,计算页偏移
  • 作用:把符号 fmt 所在的 4KB 对齐页的页地址加载到 x0 中。
  • adrp = Address of Page
  • 它会忽略符号地址的低 12 位,只保留高位。
  • 举例:如果 fmt 地址是 0x400123,那么 adrp x0, fmt 会将 0x400000 加载到 x0
  • adrp x0, fmt 会将 fmt 地址向下取整到最近的 4KB 边界(即清除低12位)

为什么不直接用 ldr x0, =fmt

  • 在 ARM64 下,使用 ldr x0, =fmt 可能隐式引入 文字常量池(literal pool),不利于可重定位代码,尤其是在动态链接或 PIE (Position Independent Executable) 环境下。
  • adrp + add推荐的可重定位代码写法(relocatable and PIC-compliant)
  • Linux 下的动态链接器(ld.so)支持这种模式更好。
指令 含义 支持的偏移范围 常用于
adr 获取当前指令附近的地址 ±1MB 局部跳转、临时变量等
adrp 获取4KB 页对齐的高地址部分 ±4GB(页对齐偏移) 获取全局变量地址、字符串、常量表地址等

smaddl

Signed Multiply Add Long

两个 32位整数(有符号) 相乘后,加上一个 64位整数,结果保存在一个 64位寄存器中。

1
smaddl <Xd>, <Wn>, <Wm>, <Xa>

执行如下操作:

1
Xd = (int64_t)(int32_t)Wn * (int64_t)(int32_t)Wm + Xa;

programming

if statement

if

1
2
3
4
if (a > b)                                                              
{
// CODE BLOCK
}

in aarch64

1
2
3
4
5
6
    // Assume value of a is in x0                                       
// Assume value of b is in x1
cmp x0, x1
ble 1f
// CODE BLOCK
1:

If a > b then x0 - x1 will be greater than zero.

If a == b then x0 - x1 will be equal to zero.

If a < b then x0 - x1 will be less than zero.

ble means branch (a jump or goto) if the previous computation shows less than or equal to zero

a rule of thumb

  • In the higher level language, you want to enter the following code block if the condition is true.

  • In assembly language, you want to avoid the following code block if the condition is false.

temporary label

The target of the branch instruction is given as 1f. This is an example of a temporary label.

There are a lot of braces used in C and C++. Since labels frequently function as equivalents to { and }, there can be a lot of labels used in assembly language. But label is only a position label, it is not a scope

A temporary label is a label made using just a number. Such labels can appear over and over again (i.e. they can be reused). They are made unique by virtue of their placement relative to where they are being used.

  • 1f looks forward in the code for the next label 1.
  • 1b looks in the backward direction for the most recent label 1.

if / else

1
2
3
4
5
6
7
8
if (a > b)                                                          
{
// CODE BLOCK IF TRUE
}
else
{
// CODE BLOCK IF FALSE
}

There are two branches built into this code!

in aarch64:

1
2
3
4
5
6
7
8
9
    // Assume value of a is in x0                                       
// Assume value of b is in x1
cmp x0, x1
ble 1f
// CODE BLOCK IF TRUE
b 2f
1:
// CODE BLOCK IF FALSE
2:

a complete example

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
    .global main                                                       
.text

main:
stp x29, x30, [sp, -16]!
mov x1, 10
mov x0, 5
cmp x0, x1
ble 1f
ldr x0, =T //Pseudo Instruction 伪指令
bl puts
b 2f

1: ldr x0, =F
bl puts

2: ldp x29, x30, [sp], 16
mov x0, xzr
ret

.data
F: .asciz "FALSE"
T: .asciz "TRUE"
.end

Line 11 is one way of loading the address represented by a label. In this case, the label T corresponds to the address to the first letter of the C string “TRUE”. Line 15 loads the address of the C string containing “FALSE”.

The occurrences of .asciz on line 23 and line 24 are invocations of an assembler directive the creates a C string. Recall that C strings are NULL terminated. The NULL termination is indicated by the z which ends .asciz.

There is a similar directive .ascii that does not NULL terminate the string.

loop

while loop

while loop

1
2
3
while (a >= b) {
// CODE BLOCK
}

aarch64:

1
2
3
4
5
6
7
8
9
    // Assume value of a is in x0                                       
// Assume value of b is in x1

1: cmp x0, x1
blt 2f
// CODE BLOCK
b 1b

2:

for loop

1
2
3
4
for (set up; decision; post step)                                   
{
// CODE BLOCK
}

for

1
2
3
4
for (long i = 0; i < 10; i++)                                     
{
// CODE BLOCK
}

aarch64 (the flow chart on the left)

1
2
3
4
5
6
7
8
9
10
11
12
    // Assume i is implemented using x0                                                                                        
mov x0, xzr

1: cmp x0, 10
bge 2f

// CODE BLOCK

add x0, x0, 1
b 1b

2:

aarch64 (the flow chart on the right)

1
2
3
4
5
6
7
8
9
10
11
12
    // Assume i is implemented using x0                                 

mov x0, xzr
b 2f

1:

// CODE BLOCK

add x0, x0, 1
2: cmp x0, 10
blt 1b

continue

1
2
3
4
5
6
for (long i = 0; i < 10; i++) {
// CODE BLOCK "A"
if (i == 5)
continue;
// CODE BLOCK "B"
}

in aarch64

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
    // Assume i is implemented using x0                                 

mov x0, xzr

1: cmp x0, 10
bge 3f
// CODE BLOCK "A".
// if (i == 5)
// continue

cmp x0, 5
beq 2f
// CODE BLOCK "B"

2: add x0, x0, 1
b 1b

3:

another one

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
    // Assume i is implemented using x0                                 

mov x0, xzr
b 3f

1:

// CODE BLOCK "A"

// if (i == 5)
// continue

cmp x0, 5
beq 2f

// CODE BLOCK "B"

2: add x0, x0, 1
3: cmp x0, 10
blt 1b

break

The implementation of break is very similar to that of continue.

1
2
3
4
5
6
for (long i = 0; i < 10; i++) {
// CODE BLOCK "A"
if (i == 5)
break;
// CODE BLOCK "B"
}

aarch64:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
    // Assume i is implemented using x0                                 

mov x0, xzr
b 3f

1:

// CODE BLOCK "A"

// if (i == 5)
// break;

cmp x0, 5
beq 4f

// CODE BLOCK "B"

2: add x0, x0, 1
3: cmp x0, 10
blt 1b

4:

structs

alignment

Data members exhibit natural alignment.

That is:

  • a long will be found at addresses which are a multiple of 8.
  • an int will be found at addresses which are a multiple of 4.
  • a short will be found at addresses which are even.
  • a char can be found anywhere.

example

1
2
3
4
5
struct {
long a;
short b;
int c;
};

布局:

Offset Width Member
0 8byte a
8 2byte b
10 2 — gap —
12 4byte c
1
2
3
4
5
6
7
struct Foo {
long a;
short b;
int c;
};

struct Foo Bar = { 0xaaaaaaaaaaaaaaaa, 0xbbbb, 0xcccccccc };

A hex dump will show:

1
aaaa aaaa aaaa aaaa bbbb 0000 cccc cccc

Notice the gap filled in which zeros. Note, if this were a local variable, the zeros might be garbage.

change the order:

1
2
3
4
5
6
7
struct Foo {
short a;
char b;
int c;
};

struct Foo Bar = { 0xaaaa, 0xbb, 0xcccccccc };

A hex dump will show:

1
aaaa 00bb cccc cccc

Notice there is only one byte of gap before the int c starts.

why are the zeros to the left of the b’s?

This ARM processor is running as a little endian machine.

defining structs

1
2
3
4
5
6
7
struct Foo {
short a;
char b;
int c;
};

struct Foo Bar = { 0xaaaa, 0xbb, 0xcccccccc };

Here is one way of defining and accessing the struct:

硬编码字段偏移量

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
    .section .rodata
fmt:
.asciz "%p a: 0x%lx b: %x c: %x\n"

.data
bar:
.short 0xaaaa // a: short 2 byte
.byte 0xbb // b: char 1 byte
.byte 0x00 // padding
.word 0xcccccccc // c: int 4 byte

.text
.global main
.align 2
main:
stp x29, x30, [sp, -16]! // 保存栈帧
mov x29, sp

adrp x0, fmt
add x0, x0, :lo12:fmt // printf 格式字符串地址

adrp x1, bar
add x1, x1, :lo12:bar // bar 的地址

ldrh w2, [x1, 0] // short a
ldrb w3, [x1, 2] // char b
ldr w4, [x1, 4] // int c

bl printf // 调用 printf(&bar, a, b, c)

// 显式退出系统调用
mov x8, #93 // syscall number for exit
mov x0, xzr // exit code 0
svc 0 // make syscall

:lo12:fmt 会被汇编器替换成 fmt 地址的低 12 位。

adrp x0, fmt 会将 fmt 地址向下取整到最近的 4KB 边界(即清除低12位),然后加载这个“页基址”到 x0

例如:
如果 fmt = 0x12345678,那么:

  • adrp x0, fmt 会得到 0x12345000(低 12 位清零)

another way to define a structs is

使用 .equ 伪指令定义符号常量

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
    .global main                // main 函数声明
.text
.p2align 2

.equ foo_a, 0 // like #define foo_a 0
.equ foo_b, 2 // like #define foo_b 2
.equ foo_c, 4 // like #define foo_c 4

main:
stp x29, x30, [sp, -16]! // 保存 x29, x30 到栈上
mov x29, sp // 设置新的帧指针

// 加载 fmt 和 bar 的地址
ldr x0, =fmt // fmt 字符串的地址
ldr x1, =bar // bar 的地址
ldrh w2, [x1, foo_a] // 加载 bar.a 到 w2
ldrb w3, [x1, foo_b] // 加载 bar.b 到 w3
ldr w4, [x1, foo_c] // 加载 bar.c 到 w4

// 调用 printf,传递参数
mov x0, x0 // 第一个参数:fmt 地址
mov x1, w2 // 第二个参数:a 的值
mov x2, w3 // 第三个参数:b 的值
mov x3, w4 // 第四个参数:c 的值
bl printf // 调用 printf

// 恢复栈和寄存器
ldp x29, x30, [sp], #16 // 恢复 x29 和 x30
ret // 返回

.data
fmt:
.asciz "%p a: 0x%lx b: %x c: %x\n" // printf 格式字符串
bar:
.short 0xaaaa // a
.byte 0xbb // b
.byte 0 // padding
.word 0xcccccccc // c

.end

the third way:(Linux only)

使用 .struct 和字段标签自动推导偏移

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
    .section .rodata
fmt:
.asciz "%p a: 0x%lx b: %x c: %x\n"

// 用 .struct 模拟 struct Foo 的字段偏移
.set Foo, 0
.struct 0
Foo_a: .struct Foo_a + 2 // short a: 2字节
Foo_b: .struct Foo_b + 1 // char b: 1字节
.struct Foo_b + 1 // padding: 1字节
Foo_c: .struct Foo_b + 2 // int c: 从 offset 4 开始
// 现在 Foo_c 是偏移量 4

.data
bar:
.short 0xaaaa // a: short 2 byte
.byte 0xbb // b: char 1 byte
.byte 0x00 // padding
.word 0xcccccccc // c: int 4 byte

.text
.global main
.align 2
main:
stp x29, x30, [sp, -16]! // 保存栈帧
mov x29, sp

adrp x0, fmt
add x0, x0, :lo12:fmt // printf 格式字符串地址

adrp x1, bar
add x1, x1, :lo12:bar // bar 的地址

ldrh w2, [x1, Foo_a] // 加载 bar.a(short)
ldrb w3, [x1, Foo_b] // 加载 bar.b(char)
ldr w4, [x1, Foo_c] // 加载 bar.c(int)

bl printf // printf(bar, a, b, c)

// 显式退出
mov x8, #93 // syscall number for exit
mov x0, xzr // exit code 0
svc 0 // syscall

using structs

To summarize using structs:

  • All structs have a base address
  • The base address corresponds to the beginning of the first data member
  • All subsequent data members are offsets relative to the first
  • In order to use a struct correctly, you must have first calculated the offsets of each data member
  • Sometimes there will be padding between data members due to the need to align all data members on natural boundaries.

this pointer in c++

  • Every non-static method call employs a hidden first parameter. That’s it. That’s the slight of hand. The hidden argument is the this pointer.
1
2
TestClass tc;
tc.SetString(test_string);

看起来我们只传入了一个参数 test_string。但实际上编译器传入了两个参数:

  1. 第一个是 this 指针:也就是 tc 的地址,传给寄存器 x0

  2. 第二个是 test_string,传给寄存器 x1

在汇编里看到:

1
2
3
adrp x1, _test_string
adrp x0, _tc // 把 tc 对象地址放到 x0 —— 也就是 this 指针
bl __ZN9TestClass9SetStringEPc

const

The meaning and function of const only partially translates to assembly language.

  • const local variables and const parameters are just like any other data to assembly language.
  • The constant nature of const local variables and parameters is implemented solely in the compiler.

  • const globals are made constant by the hardware. Attempting to modify a variable protected in this manner will be like poking a dragon. Best not to poke dragons.

switch and jump table

When the C++ optimizer is enabled, it will look at your cases and choose between three different constructs for implementing your switch.

And, it can use any combination of the following! Compiler writers are smart!

  1. It may emit a long string of if / else constructs.
  2. It may find the right case using a binary search.
  3. Finally, it might use a jump table.

Suppose our cases are largely consecutive. Given that all branch instructions are the same length in bytes, we can do math on the switch variable to somehow derive the address of the case we want.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
#include <stdlib.h>                                              
#include <stdio.h>
#include <time.h>

int main()
{
int r;

srand(time(0));
r = rand() & 7;
switch (r)
{
case 0:
puts("0 returned");
break;

case 1:
puts("1 returned");
break;

case 2:
puts("2 returned");
break;

case 3:
puts("3 returned");
break;

case 4:
puts("4 returned");
break;

case 5:
puts("5 returned");
break;

case 6:
puts("6 returned");
break;

case 7:
puts("7 returned");
break;
}
return 0;
}

Notice that the case values are all, in this case, consecutive.

1
2
3
4
5
6
7
8
jt:     b       0f
b 1f
b 2f
b 3f
b 4f
b 5f
b 6f
b 7f

f means forward, b means backward

At address jt there are a sequence of branch statements… jumps if you will. Being in a sequence, this is an example of a jump table. We’ll compute the index into this array of instructions and then branch to it.

1
2
3
4
lsl     x0, x0, 2     
ldr x1, =jt
add x1, x1, x0
br x1
  • Line 2 loads the base address of the “instruction array” starting at address jt.

    complete example

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
        .text
.align 4
.global main

main: str x30, [sp, -16]!
mov x0, xzr // set up call to time(nullptr)
bl time // call time setting up srand
bl srand // call srand setting up rand
bl rand // get a random number
and x0, x0, 7 // ensure its range is 0 to 7
// note use of x register is on purpose
lsl x0, x0, 2 // multiply by 4
ldr x1, =jt // load base address of jump table
add x1, x1, x0 // add offset to base address
br x1

// If, as in this case, all the "cases" have the same number of
// instructions then this intermediate jump table can be omitted saving
// some space and a tiny amount of time. To omit the intermediate jump
// table, you'd multiply by 12 above and not 4. Twelve because each
// "case" has 3 instructions (3 x 4 == 12).

// Question for you: If you did omit the jump table, relative to what
// would you jump (since "jt" would be gone).

jt: b 0f
b 1f
b 2f
b 3f
b 4f
b 5f
b 6f
b 7f

0: ldr x0, =ZR
bl puts
b 99f

1: ldr x0, =ON
bl puts
b 99f

2: ldr x0, =TW
bl puts
b 99f

3: ldr x0, =TH
bl puts
b 99f

4: ldr x0, =FR
bl puts
b 99f

5: ldr x0, =FV
bl puts
b 99f

6: ldr x0, =SX
bl puts
b 99f

7: ldr x0, =SV
bl puts
b 99f

99: mov w0, wzr
ldr x30, [sp], 16
ret

.data
.section .rodata

ZR: .asciz "0 returned"
ON: .asciz "1 returned"
TW: .asciz "2 returned"
TH: .asciz "3 returned"
FR: .asciz "4 returned"
FV: .asciz "5 returned"
SX: .asciz "6 returned"
SV: .asciz "7 returned"

.end

implement falling through

If there is no break falling the code for a case, control will simply fall through to the next case

Here is a snippet from the program linked just above

1
2
3
4
5
6
7
0:      ldr     x0, =ZR  
bl puts
b 99f

1: ldr x0, =ON
bl puts
b 99f

implementing gaps

The example above present shows 8 consecutive cases. What if there was no code for case 4? In other words, what if case 4 didn’t exit?

Here is the result:

1
2
3
4
5
6
7
8
9
10
11
12
13
2:      ldr     x0, =TW
bl puts
b 99f

3: ldr x0, =TH
bl puts
b 99f

4: b 99f

5: ldr x0, =FV
bl puts
b 99f

other strategies for implementing switch

As indicated above, an optimizer has at least three tools available to it to implement complex switch statements. And, it can combine these tools.

  1. For example, suppose your cases boil down to two ranges of fairly consecutive values. For example, you have cases 0 to 9 and also cases 50 to 59. You can implement this as two jump tables with an if / else to select which one you use.

假设你的 switch 语句中,case 值主要集中在两个小的连续范围内,例如:一组是 case 0case 9,另一组是 case 50case 59,那么可以用 两个跳转表 来处理这两个范围,再用一个 if / else 来决定使用哪一个跳转表。

  1. Suppose you have a large switch statement with widely ranging case values. In this case, you can implement a binary search to narrow down to a small range in which another technique becomes viable to narrow down to a single case.

假设你有一个包含很多 case 分支的 switch 语句,而且这些 case 值之间的数值范围差异很大,比如 case 10, case 1000, case 50000…,那么可以先用二分查找法缩小查找范围,把目标值限制在一个较小的范围内,然后在这个范围内再用其他技术(比如跳转表、线性比较等)来确定最终对应哪个 case 分支。

  1. You might have need to implement hierarchical jump tables(分层跳转表), for example.

“分层跳转表”是一种优化结构,适用于以下情况:

  • case 值非常稀疏范围极广(例如 case 0, case 1000, case 2000...)
  • 但它们在局部范围内是稠密的(比如 1000~1009, 2000~2009

你可以:

  1. 先用一个“一级跳转表”根据高位或区段跳转到一个子跳转表(子范围)。
  2. 再在子跳转表中做具体跳转
    这就构成了一个“分层结构”——像树一样的跳转过程。

strategies for implementing if-else

If you do choose to implement a long chain of if / else statements, consider how frequently a given case might be chosen. Put the most common cases at the top of the if / else sequence.

This is known as making the common case fast.

Making the common case fast is one of the Great Ideas in Computer Science. One, you would do well to remember no matter what language you’re working with.

fucntions

bottom line concept

The bl instruction is stands for Branch with Link. The Link concept is what enables a function (or method) to return to the instruction after the call.

Branch-with-link computes the address of the instruction following it.

It places this address into register x30 and then branches to the label provided. It makes one link of a trail of breadcrumbs to follow to get back following a ret.

This is why it is absolutely essential to backup x30 inside your functions if they call other functions themselves.

a example

1
2
3
4
5
6
7
8
9
10
11
12
        .text                                      
.global main
.align 2

main: ldr x0, =hw
bl puts
ret

.data
hw: .asciz "Hello World!"

.end

The program hung and had to be killed with ^C.

Somebody called main() - it’s a function and someone called it with a bl instruction. At the moment main() entered, the address to which it needed to return was sitting in x30.

Then, main() called a function - in this case puts() but which function is called doesn’t matter - it called a function. In doing so, it overwrote the address to which main() needed to return with the address of line 7 in the code. That is where puts() needs to return.

So, when line 7 executes it puts the contents of x30 into the program counter and branches to it.

Here is a fixed version of the code:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
        .text                                   
.global main
.align 2

main: str x30, [sp, -16]!
ldr x0, =hw
bl puts
ldr x30, [sp], 16
ret

.data
hw: .asciz "Hello World!"

.end

In the AARCH64 Linux style calling convention, values are returned in x0 and sometimes also returned in other scratch registers though this is uncommon.(Note that x0 could also be w0 or the first floating point register if the function is returning a float or double.)

If your functions call any other functions, x30 must be backed up on the stack and then restored into x30 before returning.

A function with more than one return value is not supported by C or C++ but they can be written in assembly language where the rules are yours to break.

inline functions

Functions that are declared as inline don’t actually make function calls. Instead, the code from the function is type checked and inserted directly where the “call” is made after adjusting for parameter names.

passing parameters to functions

How parameters are passed to functions can be different from OS to OS. This chapter is written to the standard implemented for Linux.

For the purposes of the present discussion, we assume all parameters are long int and are therefore stored in x registers.

  • Up to 8 parameters can be passed directly via scratch registers.(These are x0 through x7) Each parameter can be up to the size of an address, long or double (8 bytes).

    • Scratch means the value of the register can be changed at will without any need to backup or restore their values across function calls.

    • This means that you cannot count on the contents of the scratch registers maintaining their value if your function makes any function calls.

a example

1
2
3
4
long func(long p1, long p2)              
{
return p1 + p2;
}

is implemented as:

1
2
func:   add x0, x0, x1  
ret

If you are the author of both the caller and the callee and both are in assembly language, you can play loosey goosey with how you return values. Specifically, you can return more than one value. But if you do so, you give up the possibility of calling these functions from C or C++.

const

1
2
3
4
long func(const long p1, const long p2)              
{
return p1 + p2;
}

how would the assembly language change?

Answer: no change at all!

const is an instruction to the compiler ordering it to prohibit changing the values of p1 and p2. We’re smart humans and realize that our assembly language makes no attempt to change p1 and p2 so no changes are warranted.

passing pointers

1
2
3
4
void func(long * p1, long * p2)               
{
*p1 = *p1 + *p2;
}
1
2
3
4
5
func:   ldr x2, [x0]                     
ldr x3, [x1]
add x2, x2, x3
str x2, [x0]
ret

The value of x0 on return is, in the general sense, undefined because this is a void function.

passing reference

1
2
3
4
long func(long & p1, long & p2)                     
{
return p1 + p2;
}
1
2
3
4
func:   ldr x0, [x0]                     
ldr x1, [x1]
add x0, x0, x1
ret

Passing by reference is also an instruction to the compiler to treat pointers a little differently - the differences don’t show up here so there the only change to our pointer passing version is how we return the answer.

more than eight parameters

1
2
3
4
5
6
7
8
9
10
11
#include <stdio.h>

void SillyFunction(long p1, long p2, long p3, long p4,
long p5, long p6, long p7, long p8,
long p9) {
printf("This example hurts: %ld %ld\n", p8, p9);
}

int main() {
SillyFunction(1, 2, 3, 4, 5, 6, 7, 8, 9);
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
        .text                                                            
.global main

/* Demonstration of using more than 8 arguments to a function. This
demo is LINUX only as APPLE will put all arguments beyond the first
one on the stack anyway.

On LINUX, all parameters to a function beyond the eight go on the
stack. The first 8 go in registers x0 through x7 as normal (for
LINUX).
*/

SillyFunction:
stp x29, x30, [sp, -16]! // Changes sp.
mov x29, sp // set new sp
ldr x0, =fmt
mov x1, x7 // 第八个参数
ldr x2, [sp, 16] // This does not alter the sp,第九个参数
bl printf
ldp x29, x30, [sp], 16 // Undoes change to sp.
ret

main:
stp x29, x30, [sp, -16]! // sp down total of 16.
mov x29, sp
mov x0, 9
str x0, [sp, -16]! // sp down total of 32.
mov x0, 1
mov x1, 2
mov x2, 3
mov x3, 4
mov x4, 5
mov x5, 6
mov x6, 7
mov x7, 8
bl SillyFunction
add sp, sp, 16 // undoes change of sp by 16 due
// to function call.
ldp x29, x30, [sp], 16 // undoes change to sp of 16.
ret

.data
fmt: .asciz "This example hurts my brain: %ld %ld\n"

.end

After executing Line 24, the stack will have:

1
2
sp + 0    former contents of frame pointer
sp + 8 return address for main

After executing Line 27, the stack will have:

1
2
3
4
sp + 0    9
sp + 8 garbage
sp + 16 former contents of frame pointer
sp + 24 return address for main

After executing Line 14, the stack will have:

1
2
3
4
5
6
sp + 0    return address for SillyFunction
sp + 8 garbage
sp + 16 9
sp + 24 garbage
sp + 32 former contents of frame pointer
sp + 40 return address for main

This means that Line 18 fetches p9 from memory and puts its value into x2 (where it becomes the third argument to printf()).

在 AArch64 中,栈空间常常是 以 16 字节为单位对齐分配的,但你可能 只写了其中的一部分数据,剩下的就没有被初始化,于是我们称它为 “garbage”(未定义的内容)

The stack pointer in ARM V8 can only be manipulated in multiples of 16.

The stack pointer in ARM V8 can only be manipulated in multiples of 16.

The stack pointer in ARM V8 can only be manipulated in multiples of 16.

examples of calling some common C runtime functions

There are, by the way, two broad types of functions within the C runtime.

  • Some are implemented largely in the C runtime itself.

  • Others that exist in the C runtime act as wrappers for functions implemented within the OS itself. These are called “system calls”.

For the purposes of calling functions in the C runtime, there is no practical difference between these two types. Note however, there are ways of calling system calls directly using the svc instruction.

“C runtime”(C 运行时)指的是一组在程序运行时提供支持的函数、变量和基础机制,主要用于支持 C 语言标准库和程序的初始化/终止。这套系统通常被称为 C runtime library(C 运行时库),在不同平台中常见的实现有:

  • GNU/Linux 下的 glibc
  • Windows 下的 MSVCRT
  • macOS 下的 libSystem.dylib(包含 libc)

C runtime 做了哪些事?

  1. 程序初始化
    • main() 执行之前,C runtime 会设置好堆栈、初始化全局变量、调用构造函数等。
    • 典型入口点是 _start__libc_start_main()main()
  2. 提供标准库函数
    • printf(), malloc(), exit(), fopen() 等,这些函数由 C runtime 实现或封装。
  3. 管理资源
    • 比如内存分配、文件句柄、线程等的生命周期管理。
  4. 提供系统调用封装
    • 比如你调用 write(),它其实是调用了一个 C runtime 提供的 wrapper,最终通过 syscallsvc 指令访问内核。

system calls

Many C runtime functions are just wrappers for system calls. For example if you call open() from the C runtime, the function will perform a few bookkeeping operations and then make the actual system call.

What IS a system call?

The short answer is a system call is a sort-of function call that is serviced by the operating system itself, within its own private region of memory and with access to internal features and data structures.

Our programs run in “userland”. The technical name for userland on the ARM64 processor is EL0 (Exception Level 0).

We can operate within the kernel’s space only through carefully controlled mechanisms - such as system calls. The technical name for where the kernel (or system) generally operates is called EL1.

There are two higher Exception Levels (EL2 and EL3) which are beyond the scope of this book.

Mechanism of making a system call

First, like any function call, parameters need to be set up. The first parameter goes in the first register, etc.

Second, a number associated with the specific system call we wish to make is loaded in a specific register (w8).

Finally, a special instruction svc causes a trap which elevates us out of userland into kernel space. Said differently, svc causes a transition from EL0 to EL1. There, various checks are done and the actual code for the system call is run.

A description of returning from a system call is beyond the scope of this book. Hint: just as there’s a special instruction that escalates from EL0 to EL1, there is a special instruction that does the reverse.

the number associated with a particular system call

reference:

example getpid()

1
2
3
4
5
6
7
#include <stdio.h>                                                
#include <unistd.h>

int main() {
printf("Greetings from: %d\n", getpid());
return 0;
}

Written in assembly language using C runtime

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
        .global main                                              
.text
.align 2

main: stp x29, x30, [sp, -16]!
bl getpid
mov w1, w0
ldr x0, =fmt
bl printf
ldp x29, x30, [sp], 16
mov w0, wzr
ret

.data
fmt: .asciz "Greetings from: %d\n"

.end

And finally: calling the system call directly
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
        .global main                                              
.text
.align 2

main: stp x29, x30, [sp, -16]!
mov x8, 172 // getpid on ARM64
svc 0 // trap to EL1
mov w1, w0
ldr x0, =fmt
bl printf
ldp x29, x30, [sp], 16
mov w0, wzr
ret

.data
fmt: .asciz "Greetings from: %d\n"

.end

We chose getpid() because it doesn’t require any parameters. Using the C runtime, we simply bl to it. Calling the system call directly is different in that we must first load x8 with the number that corresponds to getpid() for the AARCH64 architecture.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
/*  Perry Kivolowitz
Example of file operations.
*/
.text
.global main
.align 2

/* This program will
* open() a file in the current directory,
* write() some text to it,
* seek back to the beginning of the file,
* read() each line, printing it
* close() the file
*/
// 使用 .req 给寄存器取别名,便于阅读。例如,fd 其实就是 w28,代表文件描述符。
retval .req w27
fd .req w28

main: stp x29, x30, [sp, -16]!
stp x27, x28, [sp, -16]!
bl open_file

// w0 will contain either the file descriptor of the new
// file or -1 for a failure. Note that the value in w0
// has also been copied to "fd" - a register alias.
cmp w0, wzr
bge 1f

// If we get here, the open has failed. Use perror() to
// print a meaningful error and branch to exit. The return
// code of the program will be set to non-zero inside fail.
ldr x0, =fname
bl fail
b 99f

1: // When we get here, the file is open. Write some data to it.
// If write_file returns non-zero, it signifies an error. If
// so, branch to the file closing code since the file is open
// after printing an error message.
bl write_data
cbz w0, 10f

// If we get here, there was an error in write_data. Print
// a reasonable error message then branch to the clean usleep
// code.
ldr x0, =wf // load legend
bl fail // print error
b 50f // branch to clean up.

// Seek back to position zero preparing to read the file back.
// The return value in x0 (off_t) is the return value of
// lseek().
10: bl seek_zero
cbz x0, 20f

// If we get here, the seek failed. Cause a reasonable
// message to be printed then branch to the clean up code.
ldr x0, =sf
bl fail
b 50f

20: // When we get here, we have to read from the file and print
// the results. To ignore the complexity of memory allocation
// and buffer overrun potential, we'll read one character at a
// time looking the end-of-file.

// ssize_t read(int fildes, void *buf, size_t nbyte);
mov w0, fd
ldr x1, =buffer
mov x2, 1
bl read
// Check the return value - should be 1.
cbz x0,50f // zero means EOF - that's OK.
// If x0 is negative, that IS a problem.
cmp x0, xzr
bge 25f
// The return value is negative - this is an error.
ldr x0, =rf
bl fail
b 99f

25: // Write the character sitting in buffer to the console.
mov w0, 1
ldr x1, =buffer
mov x2, 1
bl write
// We will ignore the return value for the sake of brevity.
// There are plenty of examples of handling a potential error
// elsewhere in this code.
// --
b 20b

// When we get here, we are done. Close the file.
50: mov w0, fd
bl close
mov retval, wzr

99: ldp x27, x28, [sp], 16
ldp x29, x30, [sp], 16
mov w0, retval
ret

/* open_file()
This function attempts to open a file for both reading and
writing. Return values will be checked to ensure the file is
opened. If successful, the fd is returned (and is squirreled
away in register "fd"). If unsuccessful, the -1 returned by
open() is passed back to the caller.

Explanation of the magic numbers:

int open(const char *pathname, int flags, mode_t mode);

octal 102 for flags is O_RDRW | O_CREAT
octal 600 for mode is rw------- i.e. read and write for
the owner but no permissions for anyone else.

There is a version of open() that takes two parameters. However,
if O_CREAT is specified, the three parameter version is required.
*/

.equ O_FLAGS, 0102
.equ O_MODE, 0600

open_file:
stp x29, x30, [sp, -16]!
ldr x0, =fname
mov w1, O_FLAGS
mov w2, O_MODE
bl open
mov fd, w0
ldp x29, x30, [sp], 16
ret


/* This function uses perror() to print a meaningful error
message in the event of a failure. The string value
passed to perror() arrives to us as a pointer in x0.
*/

fail:
stp x29, x30, [sp, -16]!
bl perror
mov retval, 1
ldp x29, x30, [sp], 16
ret

/* ssize_t write(int fd, const void *buf, size_t count);

This function will write a string to the file descriptor contained
in "fd" (a register alias).
*/

write_data:
stp x29, x30, [sp, -16]!
str x20, [sp, -16]!
mov w0, fd // file descriptor
ldr x1, =txt // address to print from
ldr x2, =txt_s // load pointer to size
ldr x2, [x2] // dereference the pointer
mov w20, w2 // need this value for error check.
bl write
cmp x0, x20 // Did we write the expected amount?
bne 90f
// successful write - return 0
mov x0, xzr
b 99f
90: // failure - ensure we return non-zero!
mov x0, 1
99: ldr x20, [sp], 16
ldp x29, x30, [sp], 16
ret

/* off_t lseek(int fd, off_t offset, int whence);
*/
seek_zero:
stp x29, x30, [sp, -16]!
mov w0, fd // file descriptor
mov x1, xzr // beginning of file
mov w2, wzr // SEEK_SET - absolute offset
bl lseek
ldp x29, x30, [sp], 16
ret

.data
prog: .asciz "file_ops"
wf: .asciz "write failed"
rf: .asciz "read failed"
sf: .asciz "lseek failed"
fname: .asciz "test.txt"
txt: .asciz "some data\n"
txt_s: .word txt_s - txt - 1 // strlen(txt),txt:“some data”的总长度
buffer: .word 0
.end

floating point

what are floating points numbers?

reference

IEEE 754

register

There are four highest level ideas relating to floating point operations on AARCH64.

  • There is another complete register set for floating point values.
  • There are alternative instructions just for floating point values.
  • There are exotic instructions that operate on sets of floating point values (SIMD).
  • There are instructions to go back and forth to and from the integer registers.

regs

上图展示了 ARM64 架构中 SIMD(Single Instruction, Multiple Data)寄存器 V0 的不同视图与访问方式,包括不同位宽的排列方式(Arrangement Specifiers)与 Lane(通道)索引

图解说明

这个图以 V0 寄存器为例,展示了 如何用不同的排列方式访问其内容

层级 类型 说明
最底层 V0 整个 128-bit 的 V0 寄存器
向上 V0.2D, V0.4S, V0.8H, V0.16B 以不同大小的数据视图访问 V0:
- D = 64-bit(2 × 64bit)
- S = 32-bit(4 × 32bit)
- H = 16-bit(8 × 16bit)
- B = 8-bit(16 × 8bit)
再上 V0.2D[0], V0.4S[0] 每个 lane 的索引,比如:
- V0.4S[2] 表示第 3 个 32-bit 单元
- V0.16B[15] 表示第 16 个 8-bit 字节
最上层 B0, H0, S0, D0 是对 V0 的 alias,按位宽访问(只访问最低位的数据)

truncation towards zero

truncate(截断)

In C and C++, truncation is what we get from:

1
2
integer_variable = int(floating_variable);  // C++
integer_variable = (int) floating_variable; // C

The instruction is fcvtz - convert towards zero. Then, the choice as to whether to produce a signed or unsigned result is defined by the final letterL u or s.

Mnemonic Meaning
fcvtzu Truncate (always towards 0) producing an unsigned int
fcvtzs Truncate (always towards 0) producing a signed int
  • fcvtzu: Float Convert to Unsigned integer, with truncation toward zero
  • fcvtzs: Float Convert to Signed integer, with truncation toward zero

this instruction which completely discards the fractional value is said by the ARM documentation as doing rounding not truncating.

The the choice of source register defined whether you are converting a double or single precision floating point value.

Source Register Converts a
dX double to an integer
sX float to an integer
Destination Register Converts a
xX 64 bit integer
wX 32 bit or less integer

Examples where d is a double and f is a float:

C++ Instruction
int32_t(d) fcvtzs w0, d0
uint32_t(d) fcvtzu w0, d0
int64_t(d) fcvtzs x0, d0
uint64_t(d) fcvtzu x0, d0

example

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
    .section .text
.global main
.type main, @function // 表示 告诉汇编器和链接器:main 是一个函数符号(symbol)
//.type <symbol>, @<type> 是 GAS(GNU Assembler)的一条伪指令,用于给符号指定类型。
// <symbol>:符号名,比如 main
// @<type>:符号类型,这里是 @function,表示这是一个函数,而不是变量或标签


main:
stp x29, x30, [sp, -16]! // 保存 frame pointer 和 link register
mov x29, sp

// 保存浮点寄存器
stp d20, d21, [sp, -16]!
stp d22, d23, [sp, -16]!

// 加载提示信息
ldr x0, =leg
bl printf

// 加载 vless 数据到 d20-d23
ldr x0, =vless
ldr d20, [x0] // dless = 5.49
ldr d21, [x0, #8] // dmore = 5.51
ldr d22, [x0, #16] // ndless = -5.49
ldr d23, [x0, #24] // ndmore = -5.51

// fcvtps: 向上取整(+∞)
fcvtps x1, d20
fcvtps x2, d21
ldr x0, =fmt1
bl printf

fcvtps x1, d22
fcvtps x2, d23
ldr x0, =fmt1
bl printf

// fcvtns: 四舍五入 (tie to even)
fcvtns x1, d20
fcvtns x2, d21
ldr x0, =fmt2
bl printf

fcvtns x1, d22
fcvtns x2, d23
ldr x0, =fmt2
bl printf

// fcvtzs: 向 0 取整
fcvtzs x1, d20
fcvtzs x2, d21
ldr x0, =fmt4
bl printf

fcvtzs x1, d22
fcvtzs x2, d23
ldr x0, =fmt4
bl printf

// fcvtas: 四舍五入 (tie away from zero)
fcvtas x1, d20
fcvtas x2, d21
ldr x0, =fmt3
bl printf

fcvtas x1, d22
fcvtas x2, d23
ldr x0, =fmt3
bl printf

// 恢复浮点寄存器和返回地址
ldp d22, d23, [sp], #16
ldp d20, d21, [sp], #16
ldp x29, x30, [sp], #16
mov w0, wzr
ret

.section .rodata
vless:
.double 5.49
.double 5.51
.double -5.49
.double -5.51

fmt1:
.asciz "fcvtps less: %ld more: %ld\n"
fmt2:
.asciz "fcvtns less: %ld more: %ld\n"
fmt3:
.asciz "fcvtas less: %ld more: %ld\n"
fmt4:
.asciz "fcvtzs less: %ld more: %ld\n"
leg:
.asciz "less values are +/- 5.49. more values are +/- 5.51.\n"

Notice all the values were truncated to the whole number that is closer to zero.

Truncation Away From Zero

Truncation away from zero is not as easy. In fact, it cannot be performed with a single instruction.

In C (and C++):

1
iv = (int(fv) == fv) ? int(fv) : int(fv) + ((fv < 0) ? -1 : 1);

If the fv is already equal to a whole number, the integer value will be that whole number. Other wise the iv is the whole number further away from zero.

In C++, a more sophisticated version would require and could look like:

1
2
3
4
template <typename T>
int MyTruncate(T x) {
return int((x < 0) ? floor(x) : ceil(x));
}

floor() always truncates downward (towards more negative).
ceil() always truncates upwards (towards more positive).

1
2
3
4
5
6
7
8
9
10
RoundAwayFromZero:
fcmp d0, 0
ble 1f
// Value is positive, truncate towards positive infinity (ceil)
frintp d0, d0
b 2f
1: // Value is negative, truncate towards negative infinity (floor)
frintm d0, d0
2: fcvtzs x0, d0
ret
  • frintp(Round toward +∞)

  • frintm(Round toward -∞)

  • frintz(Round toward 0)

  • frinta(Round to nearest, tie away from 0)

  • frintn(Round to nearest, tie to even)

rounding conversion

rounding(四舍五入)
An instruction which does what we normally think of as rounding is frinta. This is the conversion “to nearest with ties going away.” So, 5.5 goes to 6 as one would expect from “rounding.”

converting an integer to a float point value

In C / C++:

1
2
double_var = double(integer_var); // C++
double_var = (double)integer_var; // C

Is handled by two instructions:

  • scvtf converts a signed integer to a floating point value
  • ucvtf converts an unsigned integer to a floating point value
    The name of the destination register controls which kind of floating point value is made. For example, specifying dX makes a double etc.

The name of the destination register controls which kind of floating point value is made. For example, specifying dX makes a double etc.

floating point literals

Recall that all AARCH64 instructions are 4 bytes long. Recall also that this means that there are constraints on what can be specified as a literal since the literal must be encoded into the 4 byte instruction. If the literal is too large, an assembler error will result.

Given that floating point values are always at least 4 bytes long themselves, using floating point literals is extremely constrained. For example:

1
2
fmov    d0, 1     // 1
fmov d0, 1.1 // 2

Line 1 will pass muster but Line 2 will cause an error.

To load a float, you could translate the value to binary and do as the following:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
        .text                                                   
.global main
.align 2

main: str x30, [sp, -16]!
ldr s0, =0x3fc00000
fcvt d0, s0
ldr x0, =fmt
bl printf
ldr x30, [sp], 16
mov w0, wzr
ret

.data
fmt: .asciz "%f\n"
.end

printf() only knows how to print double precision values. When you specify a float, it will convert it to a double before emitting it.

Translating floats and doubles by hand isn’t a common practice for humans, though compilers are happy to do so.

Instead for us humans, the assembler directives .float and .double are used more frequently to specify float and double values putting them into RAM.
a example:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
        .global main                                            
.text
.align 2

counter .req x20
dptr .req x21
fptr .req x22
.equ max, 4

main: stp counter, x30, [sp, -16]!
stp dptr, fptr, [sp, -16]!
ldr dptr, =d
ldr fptr, =f
mov counter, xzr

1: cmp counter, max
beq 2f

ldr d0, [dptr, counter, lsl 3]
ldr s1, [fptr, counter, lsl 2]
fcvt d1, s1
ldr x0, =fmt
add counter, counter, 1
mov x1, counter
bl printf
b 1b

2: ldp dptr, fptr, [sp], 16
ldp counter, x30, [sp], 16
mov w0, wzr
ret

.data
fmt: .asciz "%d %f %f\n"
d: .double 1.111111, 2.222222, 3.333333, 4.444444
f: .float 1.111111, 2.222222, 3.333333, 4.444444

.end
指令 全称/缩写 作用 常见用法示例
.req register require(非官方缩写) 寄存器起别名 foo .req x0 表示以后写 foo 就等于 x0
.equ equate 定义一个常量符号 BUF_SIZE .equ 64 表示 BUF_SIZE = 64

On Linux, just as w/x0 through w/x7 are scratch registers and used to pass parameters, s/d0 and s/d7 are as well beginning with the 0 register.

即:

📥 整数参数传递:
x0 ~ x7(或 32 位的 w0 ~ w7)用于传递前 8 个整数类参数(int、pointer、long 等)。

超过 8 个就通过栈传递。

📥 浮点参数传递:
d0 ~ d7(64 位 double 类型)或 s0 ~ s7(32 位 float 类型)用于传递前 8 个浮点参数。

超过 8 个浮点参数也是通过栈传递。

Fitting 32 bits into a 32 bit bag

1
ldr s0, =0x3fc00000  // 伪指令!我们以为它直接把 0x3fc00000 加载进 s0

编译器不能直接把任意 32 位值硬编码进指令中(因为一条 ARM 指令本身就只有 32 位)。

所以它实际上是:

  1. 将字面量值 0x3fc00000 写到内存的某个地方(通常靠近当前函数底部)。
  2. 生成一条 ldr 指令,用 PC-relative load 的方式从这个地址加载该值。
    这块被称为一个 literal pool,它是一些常量的集合。

We expected line 6 to read:

1
ldr        s0, =0x3fc00000

Instead we find:
1
b+ 0x784 <main+4>          ldr     s0, 0x7a0 <main+32>

Scan downward to find 0x7a0:
1
0x7a0 <main+32>         .inst   0x3fc00000 ; undefined  

伪指令 实际效果 GDB中看到的实际汇编
ldr s0, =0x3fc00000 把常量加载进 s0 寄存器 ldr s0, #literal_addr
literal_addr: .inst 0x3fc00000
ldr x0, =fmt 加载字符串指针地址 ldr x0, #literal_addr
literal_addr: .inst 地址值
.inst 0x3fc00000 手动插入一个 32 位数据(不一定是有效指令) 存放常量(不是执行)

.inst 的含义
全称:.inst = insert instruction
用途:直接插入一条 ARM 指令的机器码(通常是 32 位十六进制值)

1
.inst 0xd65f03c0   // 实际是 ret 指令

这个例子中,.inst 后的机器码 0xd65f03c0 是 ret 指令的 32 位编码。也就是说:

1
ret

等价于:

1
.inst 0xd65f03c0

在上面的例子中,可以用.inst定义一个地址,从该地址中加载

为什么不用 mov reg, #imm ?

  • mov 有立即数编码限制,不能加载任意 32 位值。
  • 超过范围时,必须用 ldr 从内存加载。

    fmov

    The fmov instruction is used to move floating point values in and out of floating point registers and to some degree, moving data between integer and floating point registers.

loading floating point numbers as immediate values

Just as we saw with integer registers, some values can be used as immediate values and some cannot. It comes down to how many bits are necessary to encode the value. Too many bits… not enough room to fit in a 4 byte instruction plus the opcode.

For example, this works:

1
mov    x0, 65535

but this does not:
1
mov    x0, 65537

The constraints placed on immediate values for fmov are much tighter because floating point numbers are far more complex than integers.

fmov d0, #imm 能否工作,取决于该浮点数是否能在8位编码空间内被精确表示:

结构 位数 说明
符号位 1 bit 表示正或负
指数部分 3 bits 控制大小(乘以 2 的幂)
尾数部分 4 bits 仅能由 1/2、1/4、1/8、1/16 组合构成
1
2
3
4
5
6
fmov d0, 1.0        // ✅ OK:整数 1 是 2⁰,指数可编码
fmov d0, 1.5 // ✅ OK:1 + 0.5 = 2⁰ + 2⁻¹,指数/尾数都能编码
fmov d0, 1.75 // ✅ OK:1 + 0.5 + 0.25 = 2⁰ + 2⁻¹ + 2⁻²
fmov d0, 1.875 // ✅ OK:+ 2⁻³
fmov d0, 1.9375 // ✅ OK:+ 2⁻⁴
fmov d0, 1.96875 // ❌ 不行:需要 2⁻⁵,尾数超出 4 位

大浮点不能用 fmov,改用 ldr。

fmov 是“位复制器”,不是“精度转换器”。你要改数值精度,就必须用 fcvt 系列。

half precision

Support for half precision (16 bit) floating point values does exist but there is no complete agreement on how different compilers support them. Indeed, there are not one but two competing half precision formats out there. These are the IEEE and GOOGLE types. Further still, many open source developers have created their own implementations with potentially clashing naming conventions.

1
2
3
__fp16 Foo(__fp16 g, __fp16 f) {
return g + f;
}

compiles to:

1
2
3
4
5
fcvt    s1, h1
fcvt s0, h0
fadd s0, s0, s1
fcvt h0, s0
ret

Notice each half precision value is converted to single precision. So, from C and C++ working with half precision values can be inefficient.

On the other hand, if you are willing to use intrinsics and one of the SIMD instruction sets offered by ARM, then knock yourself out. Be aware that doing so ties your code to the ARM processor in ways which you might regret later.

bit manipulation

Bit fields are a feature of the C and C++ language which completely hide what is often called “bit bashing”.

the ordering of bits in a bit field is not guaranteed to be the same on different platforms and even between different compilers on the same platform.

位域是一种用来在结构体内 精确控制成员所占二进制位数 的语法,通常用于硬件寄存器、协议头等空间敏感的场景。
语法格式

1
2
3
4
struct 结构体名 {
类型 成员名 : 位宽;
...
};

example:
1
2
3
4
5
struct BF {
unsigned char a : 1;
unsigned char b : 2;
unsigned char c : 5;
};

  • a 用 1 位,能表示 0 或 1
  • b 用 2 位,能表示 0 ~ 3
  • c 用 5 位,能表示 0 ~ 31
    三个成员总共占 1 + 2 + 5 = 8 位,即 1 字节
  1. 虽然每个成员是个位宽,但整体大小通常向整型对齐(这里是 1 字节,因为 8 位正好一字节)。
  2. 不同编译器对位域对齐和填充细节可能略有差异。
  3. 访问时可以像普通成员一样:
1
2
3
4
struct BF bf;
bf.a = 1;
bf.b = 3;
bf.c = 31;

编译器会自动对位域进行掩码和移位处理。

Consider a data structure for which there will be potentially millions of instances in RAM. Or, perhaps billions of instances on disc. Suppose you need 8 boolean members in every instance. The C++ standard does not define the size of a bool instead leaving it to be implementation dependent. Some implementations equate bool to int, four bytes in length. Some implement bool with a char, or 1 byte in length.

Let’s assume the smallest case and equate a bool with char. Our struct, for which there may be millions or billions of instances requires 8 bool so therefore 8 bytes. Times millions or billions.

Bit fields can come to your aid here by using a single bit per boolean value. In the best case, 8 bytes collapse to 1 byte. In a worse case, 8 x 4 = 32 bytes collapsed into 1.

假设使用最小单位,即每个 bool 是 1 字节:

1
2
3
4
5
6
7
8
9
10
struct S {
bool b0;
bool b1;
bool b2;
bool b3;
bool b4;
bool b5;
bool b6;
bool b7;
};

这个结构体大小为 8 字节(1 字节 × 8 个 bool)。
如果有百万个实例,占用的内存就是 8MB,如果有十亿个实例,则是 8GB。
对于 4 字节的 bool 实现,则大小直接变成 32 字节,每亿实例就是 3.2GB。

解决方案:使用位域压缩布尔值
用位域,将 8 个布尔值定义为 1 位大小:

1
2
3
4
5
6
7
8
9
10
struct S {
unsigned char b0 : 1;
unsigned char b1 : 1;
unsigned char b2 : 1;
unsigned char b3 : 1;
unsigned char b4 : 1;
unsigned char b5 : 1;
unsigned char b6 : 1;
unsigned char b7 : 1;
};

8 个 1-bit 成员 合起来正好占 1 字节。

这样 8 字节压缩成 1 字节,节省了大量空间。

In Computer Science there is an eternal tension between space and time. The following is a law:

If you want something to go faster, it will cost more memory.

If you want to save memory, what you’re doing will take more time.

This law shows up here… recall the example of where we wanted to save memory by collapsing 8 bool into 1 byte? To save that memory we will slow down because accessing the right bits takes a couple of instructions where overwriting a bool implemented as an int takes just one instruction.

As for the assembly language that bit field will produce, it depends upon optimization level. Unoptimized, the code produced will be much longer and cumbersome than the “sophisticated” assembly language.

endian

the ARM swing both ways: the litte-endian and the big-endian. But:

The standard toolchain emits little endian code. It is a big task to install the big-endian version of the toolchain.

Here is a quote from Wikipedia:

1
ARM, C-Sky, and RISC-V have no relevant big-endian deployments, and can be considered little-endian in practice.

The common Intel processors are also little-endian.

assembly macros

An early innovation in assemblers was the introduction of a macro capability. Given what could be considered a certain amount of tedium in coding in asm, macros provide a simple form of meta programming where a series of statements can be encapsulated by a single macro. Think of a macro as an early form of C++ templated function (kinda but not really).

Here’s an example of an assembly language macro:

1
2
3
4
5
6
7
8
.macro LLD_ADDR xreg, label 
adrp \xreg, \label@PAGE
add \xreg, \xreg, \label@PAGEOFF
.endm
```asm
Here's how it might be used:
```asm
LLD_ADDR x0, fmt

This gets expanded to:
1
2
adrp    x0, fmt@PAGE
add x0, x0, fmt@PAGEOFF

gcc on Linux does not run assembly language files through the C pre-processor if the asm file ends in .s but WILL if the file ends in .S

Genaral Use

AASCIZ

AASCIZ label, string

This macro invokes .asciz with the string set to string and the label set to label. In addition, this macro ensures that the string begins on a 4-byte-aligned boundary.

PUSH_P, PUSH_R, POP_P and POP_R

These macros save some repetitive typing. For example:

1
PUSH_P  x29, x30

resolves to:
1
stp     x29, x30, [sp, -16]!

START_PROC and END_PROC

Place START_PROC after the label introducing a function.

Place END_PROC after the last ret of the function.

These resolve to: .cfi_startproc and .cfi_endproc respectively.

MIN and MAX

Handy more readable macros for determining minima and maxima. Note that the macro performs a cmp which subtracts src_b from src_a (discarding the results) in order to set the flags to be interpreted by the following csel.

Signature:

1
MIN     src_a, src_b, dest

The smaller of src_a and src_b is put into dest.

Signature:

1
MAX     src_a, src_b, dest

The larger of src_a and src_b is put into dest.

MOD

MOD macro used above is defined as:

1
2
3
4
.macro  MOD         src_a, src_b, dest, scratch
sdiv \scratch, \src_a, \src_b
msub \dest, \scratch, \src_b, \src_a
.endm

GLABEL

Mark a label as global, Makes a label available externally.

Signature:

1
GLABEL label

An underscore is prepended.

CRT

Calling CRT(C runtime) functions
If you create your own function without an underscore, just call it as usual.
If you need to call a function such as those found in the C runtime library, use this macro in this way:

1
CRT     strlen

MAIN

Declaring main()
Put MAIN on a line by itself. Notice there is no colon.

errno

The externally defined errno is accessed via a CRT function which isn’t seen when coding in C and C++. The function is named differently on Mac versus Linux. To get the address of errno use:

1
ERRNO_ADDR

This macro makes the correct CRT call and leaves the address of errno in x0.

Loads and Stores

GLD_PTR

Loads the address of a label and then dereferences it where, on Apple the label is in the global space and on Linux is a relatively close label.

Signature:

1
GLD_PTR     xreg, label

When this macro finishes, the specified x register contains what 64 bit value lives at the specified label.

GLD_ADDR

Loads the address of the label into the specified x register. No dereferencing takes place. On Apple machines, the label will be found in the global space.

Signature:

1
GLD_ADDR    xreg, label

When this macro completes, the address of the label is in the x register.

LLD_ADDR

Similar to GLD_ADDR this macro loads the address of a “local” label.

Signature:

1
LLD_ADDR xreg, label

When this macro completes, the address of the label is in the x register.

LLD_DBL

Signature:

1
LLD_DBL xreg, dreg, label

When this macro completes, a double that lives at the specified local label will sit in the specified double register.

LLD_FLT

Signature:

1
LLD_FLT xreg, sreg, label

When this macro completes, a float that lives at the specified local label will sit in the specified single precision register.

performance

Undoing Stack Pointer Changes

A small tip concerning undoing changes to the stack pointer. You might think that changes to the stack made by str or stp and their cousins must be undone with ldr or ldp and their cousins.

This depends.

If you need to get back the original contents of a register pushed onto the stack, then an ldr or ldp is appropriate. However, if you don’t need to get the original contents of a register back, then it is faster to undo a change to the stack using addition.

Take for example the use of printf(). On Apple Silicon systems, you must send arguments to printf() by pushing them onto the stack. However, when printf() completes, you have no need for the values that you pushed. As shown above, simply add the right (multiple of 16) to the stack pointer. This is faster as the addition makes no reference to RAM (or caches) as the ldr would.

other stuff

let the assembler itself calculate the length for you

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
        .global        main                                             
.align 2
.text

main: str x30, [sp, -16]!
mov w0, 1 // stdout
ldr x1, =s // pointer to string
ldr x2, =ssize // pointer to computed length
ldr w2, [x2] // actual length of string
bl write

ldr x0, =fmt
ldr x1, =s
ldr x2, =ssize
ldr w2, [x2]
bl printf

ldr x30, [sp], 16
mov w0, wzr
ret

.data

s: .asciz "Hello, World!\n"
ssize: .word ssize - s - 1 // accounts for null at end
fmt: .asciz "str: %slen: %d\n" // accounts for newline

.end

atomic operations

Load Linked, Store Condition

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
        .text                                                 
.p2align 2

#if defined(__APPLE__)
.global _LoadLinkedStoreConditional
_LoadLinkedStoreConditional:
#else
.global LoadLinkedStoreConditional
LoadLinkedStoreConditional:
#endif
1: ldaxr w1, [x0]
add w1, w1, 1
stlxr w2, w1, [x0]
cbnz w2, 1b
ret

LL/SC 是一种乐观并发控制机制。它大致逻辑是:

  • Load-Linked(LDAXR):加载一个地址的值,并“观察”该地址是否被改动。
    你可以修改这个值(如加1)。

  • Store-Conditional(STLXR):尝试写回这个新值,如果在这之间地址内容没有被别人改过,则写入成功;否则失败。
    成功与否会通过 STLXR 的返回值告诉你(0 表示成功,非 0 表示失败)。

llsc

Implementations of operations on atomic variables were improved in the second version of ARMv8, called ARMv8.1. The load linked and store conditional instructions are still available but several new instructions were added which perform certain operations such as addition, subtraction and various bitwise operations in a single atomic instruction.

For example:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
    mov       w1, 1
ldaddal w1, w0, [x0]
``
does the same work of atomically adding one to the value in memory pointed to by x0.



##### spin-lock

Here is the source code to the spin-lock for ARM V8.

Lock

```asm
Lock:
START_PROC
mov w3, 1 // 准备存储的值:1 表示“加锁”
1: ldaxr w1, [x0] // 原子加载并标记 exclusive 访问
cbnz w1, 1b // 如果锁不为 0(被别人持有),继续自旋
stlxr w2, w3, [x0] // 尝试原子写入,成功则 w2=0
cbnz w2, 1b // 如果失败(有竞争),继续自旋
ret
END_PROC

stlxr: 如果 exclusive tag 还有效(没人抢走锁),那么将 w3 的值写入 *x0,并将结果放入 w2(0 表示成功)

  1. ldaxr dereferencing the lock itself (once again an int32_t) and marks the location of the lock as being hopefully, exclusive.
  2. Having gotten the value of the lock, its value is inspected and if found to be non-zero, we branch back to attempting to get it again - this is the spin.
  3. If the contents of the lock is 0, its value in w1 is changed to non-zero. Note, this could be made a bit better if a value of 1 was stored in another w register and simply used directly on line 10.
  4. stlxr w2, w3, [x0] conditionally stores the changed value back to the location of the lock. If the stlxr returns 0, we got the lock. If not, we start over - somebody else got in there ahead of us. Perhaps this happened because we were descheduled. Perhaps we lost the lock to another thread running on a different core.

unlock

1
2
3
4
5
6
Unlock:                                                           
START_PROC
str wzr, [x0] // 写 0 表示释放锁
dmb ish // 内存屏障,跨核同步
ret
END_PROC

  1. All it does is set to value of the lock to zero. The correct operation of the lock requires that no bad actor simply stomps on the lock by calling Unlock without first owning the lock. Just say no to lock stompers.

  2. dmb ish sets up a data memory barrier across each processor - it makes sure threads running on different cores see the update correctly. This code seemed to work without this line but intuition suggests it could be important. In Lock() the stlxr instruction has an implied data memory barrier.

总结(伪代码角度)
🔒 Lock(x0):

1
2
3
4
5
do {
w1 = *x0; // atomic exclusive load
if (w1 != 0) continue;
result = atomic_store_exclusive(x0, 1); // try to set lock
} while (result != 0); // someone else beat us

🔓 Unlock(x0):
1
2
*x0 = 0;       // unlock
dmb(ISH); // ensure all cores see the update

Reference