preknowledge

1 byte has 8 bits

char has 1 byte
short has 2 byte
int has 4 byte

gcc -E hello.c -o hello.i
gcc -S hello.i -o hello.s
gcc -c hello.s -o hello.o
gcc hello.o -o hello

All AARCH64 instructions are 4 bytes in width.
All AARCH64 pointers are 8 bytes in width†.

While this is technically true, typically only the lower 39, 42 or 48 bits of addresses in Linux systems are used - i.e. the virtual address space of an ARM Linux process is smaller than 64 bits. The upper bits are set to zero when considering the address as an 8-byte value.

register

register access speed

Latency

This says that if we liken accessing a register (which can be done at least once per CPU Clock Cycle) to one second, accessing RAM would be like a 3.5 to 5.5 minute wait.

register type

rn means register “of some type” number n.

The kind of register is specified by a letter. Which register within a given type is specified by a number. There are some exceptions to this. Here is an introductory summary:

Letter	Type
x	64 bit integer or pointer
w	32 bit or smaller integer
d	64 bit floats (doubles)
s	32 bit floats

Some register types have been left out.

（Chapter 9.1）（Cortex-A Series Programmer’s Guide for ARMv8-A）

x29是栈帧指针（FP）
x30是链接寄存器（LR，即返回地址）

The registers used for floating point types (and vector operations) are coincident:

q registers are a massive 16 bytes wide - quad words.(vn的别名，主要用于SIMD/Neon 指令中)
v registers are also 16 bytes wide and are synonyms for the q registers.
d registers for doubles which are 8 bytes wide - double precision. 2 per v.
s registers for floats which are 4 bytes wide - single precision. 4 per v.
h registers for half precisions floats which are 2 bytes wide. 8 per v.
b registers for byte operations. 16 per v.

register and C type

Integers

This declares an integer	This IS an integer
char	wn
short	wn
int	wn
long	xn

Pointers

This declares a pointer	This IS a pointer
type *	xn

All pointers are stored in x registers. X registers are 64 bits long but many operating systems do not support 64 bit address spaces because keeping track of that big of an address space itself would use a lot of space. Instead OS’s typically have 48 to 52 bit address spaces.

Floating Point

This declares a float	This IS a float
`float`	`sn`
`double`	`dn`
`__fp16` (half)	`hn`

vn是真正的物理寄存器名，推荐使用, 支持最多类型的访问（浮点 + SIMD）

qn是vn的别名，主要用于SIMD/Neon 指令中(Single Instruction - Multiple Data)

instructions

preknowledge

EVERY AARCH64 instruction is 4 bytes wide. Everything the CPU needs to know about what the instruction is and what variation it might be plus what data it will use will be found in those 4 bytes.

Most (but not all) AARCH64 instructions have three operands. These are read in the following way:

1	op ra, rb, rc

means:

1	ra = rb op rc

examples:

1 2	sub x0, x0, x1 ; means x0 = x0 - x1 mov x0, x1 ; means x0 = x1

[ ]

the [ and ] serve the same purpose of the asterisk in C and C++ indicating “dereference.” It means use what’s inside the brackets as an address for going out to memory.

when a ! is at the end of [] , for example:

1
2
3

stp     x21, x30, [sp, -16]!  

stp     x29, x30, [sp, -16]!

Lastly, the exclamation point means that the stack pointer should be changed (i.e. the -16 applied to it) before the value of the stack pointer is used as the address in memory to which the registers will be copied. Again, this is a predecrement.

it means:

sp = sp - 16（栈指针向下移动 16 字节）
把 x29 存入 [sp]，把 x30 存入 [sp + 8]

对应：

1	ldp x29, x30, [sp], 16

it means:

从 [sp] 读取 8 字节给 x29，从 [sp + 8] 读取 8 字节给 x30
sp = sp + 16（释放栈帧空间）

The stack pointer in ARM V8 can only be manipulated in multiples of 16.

x29是栈帧寄存器，但不是必须保存的

memory access

ldr

load register

ldr    x0, [sp]   // load 8 bytes from address specified by sp
ldr    w0, [sp]   // load 4 bytes from address specified by sp
ldrh   w0, [sp]   // load 2 bytes from address specified by sp
ldrb   w0, [sp]   // load 1 byte  from address specified by sp

When misaligned accesses to RAM are made, the processor must slow down and access each byte individually. This is a big performance hit. Properly aligned access is critical to performance.

str

store register

str    x0, [sp]   // store 8 bytes to address specified by sp
str    w0, [sp]   // store 4 bytes to address specified by sp
strh   w0, [sp]   // store 2 bytes to address specified by sp
strb   w0, [sp]   // store 1 byte  to address specified by sp

Casting between integer types is in some cases accomplished by anding with 255 and 65535 (for char and short) or :

Whenever a narrower portion of a register is written to, the remainder of the register is zero’d out. That is: ldrb overwrites the least significant byte of an x register and zeros out the upper 7 bytes.

ldp

load pair, same as ldr but load a pair of value

stp

store pair, same as str but load a pair of value

offsets

1
2
3

1) LDR Xt, [Xn|SP{, #pimm}] ; 64-bit general registers
2) LDR Xt, [Xn|SP], #simm ; 64-bit general registers, Post-index
3) LDR Xt, [Xn|SP, #simm]! ; 64-bit general registers, Pre-index

simm can be in the range of -256 to 255 (10 byte signed value).
pimm can be in the range of 0 to 32760 in multiples of 8.

three patterns

普通偏移模式

1	LDR Xt, [Xn, #pimm]

从 Xn + pimm 的地址加载数据到 Xt；地址寄存器 Xn 不变；

pimm 是一个 正的立即数（positive immediate），必须是 8 的倍数，最大为 32760。

后变基模式

1	LDR Xt, [Xn], #simm

先用 Xn 的原始值作为地址加载数据到 Xt，然后再用 simm 更新 Xn；地址寄存器 Xn 改变；

前变基模式

1	LDR Xt, [Xn, #simm]!

先用 Xn + simm 作为地址加载数据到 Xt，并将更新后的地址写回 Xn；地址寄存器 Xn 改变；

pseudo instruction

1	ldr x1, =label

the assembler puts the address of the label into a special region of memory called a “literal pool.” What matters is this region of memory is placed immediately after (therefore nearby) your code.
Then, the assembler computes the difference between the address of the current instruction (the ldr itself) and the address of the data in the literal pool made from the labeled data.
The assembler generates a different ldr instruction which uses the difference (or offset) of the data relative to the program counter (pc). The pc is non-other the address of the current instruction.
Because the literal pool for your code is located nearby your code, the offset from the current instruction to the data in the pool is a relatively small number. Small enough, to fit inside a four byte ldr instruction.

1	ldr x1, [pc, offset to data in literal pool]

A downside of this approach is that the literal pool, from which the address is loaded, resides in RAM. This means each of these ldr pseudo instructions incurs a memory reference.

literal pool

compare

1 2	ldr x1, =q ldr x1, q

aarch64

        .global     main       // expose main to linker                                        
        .text                  // begin to write code                 
        .align      2          // the code should certainly begin on an even address                    
                                                                       
main:   str         x30, [sp, -16]!                                     
                                                                       
        ldr         x0, =fmt          
        ldr         x1, =q                     
        ldr         x2, [x1]                 
        bl          printf               
                                                                       
        ldr         x0, =fmt                   
        ldr         x1, q                    
        ldr         x2, [x1]                   
        bl          printf            
                                    
        ldr         x30, [sp], 16           
        mov         w0, wzr                                             
        ret                                                             
                                                                       
        .data                                                           
q:      .quad       0x1122334455667788                                 
fmt:    .asciz      "address: %p value: %lx\n"                         
                                                                       
        .end

disasembling the binary machine code:

0000000000007a0 <main>:
 7a0:   f81f0ffe   str  x30, [sp, #-16]!
 7a4:   58000160   ldr  x0, 7d0 <main+0x30>
 7a8:   58000181   ldr  x1, 7d8 <main+0x38>
 7ac:   f9400022   ldr  x2, [x1]
 7b0:   97ffffb4   bl   680 <printf@plt>
 7b4:   580000e0   ldr  x0, 7d0 <main+0x30>
 7b8:   580842c1   ldr  x1, 11010 <q>
 7bc:   f9400022   ldr  x2, [x1]
 7c0:   97ffffb0   bl   680 <printf@plt>
 7c4:   f84107fe   ldr  x30, [sp], #16
 7c8:   2a1f03e0   mov  w0, wzr
 7cc:   d65f03c0   ret

and

1
2
3

000000000011010 <q>:
   11010:   55667788
   11014:   11223344

It says 000000000011010 <q>:. This means that what comes next is the data corresponding to what is labeled q in our source code. Notice the relocatable address of 11010. We will explain “relocatable address” below.
Now, look at the disassembled code on the line beginning with 7b8. It reads ldr x1, 11010. So the disassembled executable is saying “go to address 11010 and fetch its contents” which are our 1122334455667788.

Instruction	Meaning
ldr r, =label	Load the address of the label into r
ldr r, label	Load the value found at the label into r

relocation of address when executing

None of the addresses we have seen so far are the final addresses that will be used once the program is actually running. All addresses will be relocated.

One reason for this is a guard against malware. A technique called Address Space Layout Randomization (ASLR) prevents malware writers from being able to know ahead where to modify your executable in order to accomplish their nefarious purposes.

64 bit ARM Linux kernels allocate 39, 42 or 48 bits for the size of a process’s virtual address space. Notice 42 and 48 bit values require 6 bytes to hold them. A virtual address space is all of the addresses a process can generate / use. Further, all addresses used by processes are virtual addresses.

using this can avoid literal pool

1 2	adrp x0, s add x0, x0, :lo12:s

examples

loading (storing) various sizes of integers

Instruction	Meaning
`ldr x0, [x1]`	Fetches a 64 bit value from the address specified by `x1` and places it in `x0`
`ldr w0, [x1]`	Fetches a 32 bit value from the address specified by `x1` and places it in `w0`
`ldrh w0, [x1]`	Fetches a 16 bit value from the address specified by `x1` and places it in `x0`
`ldrb w0, [x1]`	Fetches an 8 bit value from the address specified by `x1` and places it in `x0`

Pointers and longs use x registers.
All other integer sizes use w registers where the instruction itself specifies the size.

array indexing

long Sum(long * values, long length)   
{                                          
    long sum = 0;                          
    for (long i = 0; i < length; i++)            
    {                                           
        sum += values[i];                         
    }                                                   
    return sum;                                            
}

Notice we’re using the index variable i for nothing more than traipsing through the array. This is fantastically inefficient (in this case).

long Sum(long * values, long length)         
{                                     
    long sum = 0;                           
    long * end = values + length;                   
    while (values < end)                     
    {                                              
        sum += *(values++);                            
    }                                                
    return sum;                                           
}

Notice we don’t use an index variable any longer. Instead, we use the pointer itself for both the dereferencing and to tell us when to stop the loop.

    .global Sum                                           
    .text                                                
    .align  4                                           

//  x0 is the pointer to data                         
//  x1 is the length and is reused as `end`           
//  x2 is the sum                                  
//  x3 is the current dereferenced value                    

Sum:                                                     
    mov     x2, xzr              // x2 = 0                     
    add     x1, x0, x1, lsl 3    //  x1 = x0+x1*8              
    b       2f                                   

1:  ldr     x3, [x0], 8                          
    add     x2, x2, x3                             
2:  cmp     x0, x1                               
    blt     1b                                        

    mov     x0, x2                                  
    ret                                             

    .end

faster memory copy

Suppose you needed to copy 16 bytes of memory from one place to another. You might do it like this:

void SillyCopy16(uint8_t * dest, uint8_t * src)
{
    for (int i = 0; i < 16; i++)
        *(dest++) = *(src++);
}

This is especially silly as why would you go through 16 loops when you could have simply:

void SillyCopy16(uint64_t * dest, uint64_t * src)
{
    *(dest++) = *(src++); // 3
    *dest = *src;         // 4
}

in aarch64

SillyCopy16:              // 1
    ldr    x2, [x0], 8    // 2
    str    x2, [x1], 8    // 3
    ldr    x2, [x0]       // 4
    str    x2, [x1]       // 5
    ret

using ldp

SillyCopy16:
    ldp    x2, x3, [x0]
    stp    x2, x3, [x1]
    ret

using q register

SillyCopy16:
    ldr    q2, [x0]
    str    q2, [x1]
    ret

indexing through an array of struct

#include <stdio.h>                                       

struct Person                                       
{                                                   
    char * fname;                                
    char * lname;                                      
    int age;                                        
};                                                         

extern int rand();                                          
extern struct Person * FindOldestPerson(struct Person *, int); 

struct Person * OriginalFindOldestPerson(struct Person * people, int length)
{                                                     
    int oldest_age = 0;                        
    struct Person * oldest_ptr = NULL;               

    if (people)                                        
    {                                                     
        struct Person * end_ptr = people + length;       
        while (people < end_ptr)                           
        {                                                   
            if (people->age > oldest_age)             
            {                                      
                oldest_age = people->age;            
                oldest_ptr = people;                   
            }                               
            people++;                          
        }                                         
    }                                             
    return oldest_ptr;                             
}                                                  

#define LENGTH  20                                

int main()                                            
{                                                   
    struct Person array[LENGTH];                          
    for (int i = 0; i < LENGTH; i++)                     
    {                                               
        array[i].age = rand() % 5000;                   
    }                                                       
    struct Person * oldest = FindOldestPerson(array, LENGTH);
    for (int i = 0; i < LENGTH; i++)                  
    {                                                   
        printf("%d", array[i].age);               
        if (oldest == &array[i])                
            printf("*");                           
        printf("\n");                                 
    }                                                  
}

Line 11 tells us that somewhere else, there is a function called FindOldestPerson. That function must have a .global specifying the same name so that the linker can reconcile the reference to FindOldestPerson.

gcc with -O2 or -O3 optimization rendered OriginalFindOldestPerson() into 18 lines of assembly language.

        .global FindOldestPerson                                        // 1 
        .text                                                           // 2 
        .align  2                                                       // 3 
                                                                        // 4 
//  x0  has struct Person * people                                      // 5 
//      will be used for oldest_ptr as this is the return value         // 6 
//  w1  has int length                                                  // 7 
//  w2  used for oldest_age                                             // 8 
//  x3  used for Person *                                               // 9 
//  x4  used for end_ptr                                                // 10 
//  w5  used for scratch                                                // 11 
                                                                        // 12 
FindOldestPerson:                                                       // 13 
        cbz     x0, 99f             // short circuit                    // 14 
        mov     w2, wzr             // initial oldest age is 0          // 15 
        mov     x3, x0              // initialize loop pointer          // 16 
        mov     x0, xzr             // initialize return value          // 17 
        mov     w5, 24              // struct is 24 bytes wide          // 18 
        smaddl  x4, w1, w5, x3      // initialize end_ptr               // 19 
        b       10f                 // enter loop                       // 20 
                                                                        // 21 
1:      ldr     w5, [x3, p.age]     // fetch loop ptr -> age            // 22 
        cmp     w2, w5              // compare to oldest_age            // 23 
        csel    w2, w2, w5, gt      // update based on cmp              // 24 
        csel    x0, x0, x3, gt      // update based on cmp              // 25 
        add     x3, x3, 24          // increment loop ptr               // 26 
10:     cmp     x3, x4              // has loop ptr reached end_ptr?    // 27 
        blt     1b                  // no, not yet                      // 28 
                                                                        // 29 
99:     ret                                                             // 30 
                                                                        // 31 
        .data                                                           // 32 
        .struct 0                                                       // 33 
p.fn:   .skip   8                                                       // 34 
p.ln:   .skip   8                                                       // 35 
p.age:  .skip   4                                                       // 36 
p.pad:  .skip   4                                                       // 37 
                                                                        // 38 
        .end                                                            // 39

control flow

cmp

compare

discards the result of the subtraction but keeps a record of whether or not the result was less than, equal to or greater than zero. It sets the condition bits

br

Branch to Register

1	br <register>

无条件跳转，类似于

1	goto *(ptr)

ble

Branch less or equal

bl

Branch with Link

跳转到一个函数（子程序）地址，并且保存返回地址到 x30 寄存器中（也叫 lr，Link Register）

cbz

Compare and Branch if Zero

1	cbz <register>, <label>

如果 <register> 中的值为 0，就跳转到 <label>。

否则继续执行下一条指令。

csel

Conditional Select

1	csel <dest>, <src1>, <src2>, <condition>

如果满足 <condition>，则将 <src1> 的值赋给 <dest>；

否则将 <src2> 的值赋给 <dest>。

examples:

1 2	cmp w2, w5 csel w2, w2, w5, gt // 如果 w2 > w5，则 w2 保持不变；否则更新为 w5

这是无分支的条件赋值，比 if-else 更高效。

this is equal to

1	w2 = (w2 > w5) ? w2 : w5;

calculate

shift Opertations

lsl

Logical Shift Left

The LSL instruction performs multiplication by a power of 2.

lsr

Logical Shift Right

The LSR instruction performs division by a power of 2.

asr

Arithmetic Shift Right

The ASR instruction performs division by a power of 2, preserving the sign bit.

ror

rotate right

The ROR instruction performs a bitwise rotation, wrapping the bits rotated from the LSB into the MSB.
即：ROR 指令执行按位右旋转操作：从最低有效位（LSB）被旋转出来的位，会重新被放入到最高有效位（MSB）的位置中。

bit manipulation

mvn

mvn (Move Not) 作用是将操作数按位取反（bitwise NOT）后，放入目标寄存器。

orr

orr (bitwise inclusive OR) 对两个操作数执行按位或（bitwise OR）运算，然后将结果写入目标寄存器

bfi

bfi (Bit Field Insert) 即“位字段插入”。

1	bfi <Xd>, <Xn>, #<lsb>, #<width>

：目标寄存器（结果写到这里）

：源寄存器（从这里取低位的值）

：目标寄存器中开始插入的起始位（least significant bit 起始位）

：要插入多少位（宽度）

假设：

Xd = 0b1111 0000
Xn = 0b1011 (只用低4位)
lsb=1
width=3

执行：

1	bfi Xd, Xn, #1, #3

结果：
将 Xn 的低3位 011 插入 Xd 的位1~3上，替换原值
结果是 Xd = 1111 0110

ubfm

ubfm = Unsigned BitField Move

基本格式：

1	ubfm <dst>, <src>, #lsb, #msb

：目标寄存器

：源寄存器
lsb：起始位（low bit index）
msb：结束位（high bit index）
这条指令从 src 中提取一个无符号位字段（即一段连续的比特位），把它放到 dst 的低位（bit 0 开始），其他位清零或忽略
也就是说：

从 src 的第 lsb 位开始，取到 msb 位
将这段 bit 字段提取出来
右对齐放到 dst 的低位（bit 0）
其他位全部清零
实例：

1	ubfm w1, w2, #8, #15

从 w2 中提取 bit 8 到 bit 15（共 8 位）
把它放到 w1 的 bit 0~7

ubfiz

ubfiz (Unsigned Bit Field Insert Zeroed) 将一个无符号数的低位字段插入到另一个寄存器的指定位置，但目标寄存器在插入之前会被清零。
它其实是 ubfm（Unsigned Bit Field Move）的一个特化形式，和 UBFM 的语义类似。

指令格式：

1	ubfiz <dst>, <src>, #lsb, #width

简单来讲就是：ubfiz = 把 src 的低 width 位插入到 dst 的 bit lsb 开始的位置，其余位置全部清零。

其中：

：来源寄存器（如 w1）

：目标寄存器（如 w2），最终结果放在这里
lsb：目标中插入位置的起始 bit 位（从0开始）
width：要插入的位数（从的最低位开始数）

目标寄存器其他位都会被清零。

举例说明：

1	ubfiz w1, w1, #3, #5

含义如下：

从 w1 的最低 5 位（bit 0 到 bit 4）提取出来
插入到目标（w1）寄存器的 bit 3 到 bit 7
w1 的其他所有位（02 和 831）清零

other

adr

Address

adrp

Address of page

    .section .rodata
fmt:
    .asciz "%p a: 0x%lx b: %x c: %x\n"

    .text

	adrp x0, fmt
	add  x0, x0, :lo12:fmt    // 汇编器会自动提取 fmt 的低12位作为立即数,计算页偏移

作用：把符号 fmt 所在的 4KB 对齐页的页地址加载到 x0 中。
adrp = Address of Page。
它会忽略符号地址的低 12 位，只保留高位。
举例：如果 fmt 地址是 0x400123，那么 adrp x0, fmt 会将 0x400000 加载到 x0。
adrp x0, fmt 会将 fmt 地址向下取整到最近的 4KB 边界（即清除低12位）

为什么不直接用 ldr x0, =fmt？

在 ARM64 下，使用 ldr x0, =fmt 可能隐式引入 文字常量池（literal pool），不利于可重定位代码，尤其是在动态链接或 PIE (Position Independent Executable) 环境下。
adrp + add 是 推荐的可重定位代码写法（relocatable and PIC-compliant）。
Linux 下的动态链接器（ld.so）支持这种模式更好。

指令	含义	支持的偏移范围	常用于
`adr`	获取当前指令附近的地址	±1MB	局部跳转、临时变量等
`adrp`	获取4KB 页对齐的高地址部分	±4GB（页对齐偏移）	获取全局变量地址、字符串、常量表地址等

smaddl

Signed Multiply Add Long

两个 32位整数（有符号） 相乘后，加上一个 64位整数，结果保存在一个 64位寄存器中。

1	smaddl <Xd>, <Wn>, <Wm>, <Xa>

执行如下操作：

1	Xd = (int64_t)(int32_t)Wn * (int64_t)(int32_t)Wm + Xa;

programming

if statement

if

if (a > b)                                                              
{                                                                       
    // CODE BLOCK                                                       
}

in aarch64

    // Assume value of a is in x0                                       
    // Assume value of b is in x1                                       
    cmp     x0, x1                                                      
    ble     1f                                                          
    // CODE BLOCK                                                       
1:

If a > b then x0 - x1 will be greater than zero.

If a == b then x0 - x1 will be equal to zero.

If a < b then x0 - x1 will be less than zero.

ble means branch (a jump or goto) if the previous computation shows less than or equal to zero

a rule of thumb

In the higher level language, you want to enter the following code block if the condition is true.
In assembly language, you want to avoid the following code block if the condition is false.

temporary label

The target of the branch instruction is given as 1f. This is an example of a temporary label.

There are a lot of braces used in C and C++. Since labels frequently function as equivalents to { and }, there can be a lot of labels used in assembly language. But label is only a position label, it is not a scope

A temporary label is a label made using just a number. Such labels can appear over and over again (i.e. they can be reused). They are made unique by virtue of their placement relative to where they are being used.

1f looks forward in the code for the next label 1.
1b looks in the backward direction for the most recent label 1.

if / else

if (a > b)                                                          
{                                                                   
    // CODE BLOCK IF TRUE                                           
}                                                          
else 
{                                                                   
    // CODE BLOCK IF FALSE                                         
}

There are two branches built into this code!

in aarch64:

    // Assume value of a is in x0                                       
    // Assume value of b is in x1                                       
    cmp     x0, x1                                                      
    ble     1f                                                          
    // CODE BLOCK IF TRUE                                               
    b       2f                                                         
1:                                                                      
    // CODE BLOCK IF FALSE                                             
2:

a complete example

    .global main                                                       
    .text                                                               
                                                                       
main:                                                                   
    stp     x29, x30, [sp, -16]!                                       
    mov     x1, 10                                                     
    mov     x0, 5                                                       
    cmp     x0, x1                                                     
    ble     1f                                                         
    ldr     x0, =T                     //Pseudo Instruction 伪指令
    bl      puts                                                       
    b       2f                                                         

1:  ldr     x0, =F                                                     
    bl      puts                                                       
                                                                       
2:  ldp     x29, x30, [sp], 16                                         
    mov     x0, xzr                                                     
    ret                                                                 
                                                                    
    .data                                                               
F:  .asciz  "FALSE"                                                     
T:  .asciz  "TRUE"                                                     
    .end

Line 11 is one way of loading the address represented by a label. In this case, the label T corresponds to the address to the first letter of the C string “TRUE”. Line 15 loads the address of the C string containing “FALSE”.

The occurrences of .asciz on line 23 and line 24 are invocations of an assembler directive the creates a C string. Recall that C strings are NULL terminated. The NULL termination is indicated by the z which ends .asciz.

There is a similar directive .ascii that does not NULL terminate the string.

loop

while loop

1
2
3

while (a >= b) {
    // CODE BLOCK
}

aarch64:

    // Assume value of a is in x0                                       
    // Assume value of b is in x1                                       
                                                                        
 1: cmp     x0, x1                                                     
    blt     2f                                                          
    // CODE BLOCK                                                       
    b       1b                                                          

2:

for loop

for (set up; decision; post step)                                   
{                                                                    
    // CODE BLOCK                                                   
}

for

for (long i = 0; i < 10; i++)                                     
{                                                                  
    // CODE BLOCK                                                    
}

aarch64 (the flow chart on the left)

    // Assume i is implemented using x0                                                                                        
    mov     x0, xzr                                                     
                                                                      
1:  cmp     x0, 10                                                     
    bge     2f                                                         
                                                                       
    // CODE BLOCK                                                       
                                                                       
    add     x0, x0, 1                                                   
    b       1b                                                         
                                                                       
2:

aarch64 (the flow chart on the right)

    // Assume i is implemented using x0                                 
                                                                       
    mov     x0, xzr                                                     
    b       2f
                                                                       
1:                                                                     
                                                                       
    // CODE BLOCK                                                       
                                                                       
    add     x0, x0, 1                                                   
2:  cmp     x0, 10                                                     
    blt     1b

continue

for (long i = 0; i < 10; i++) {
    // CODE BLOCK "A"
    if (i == 5)
        continue;
    // CODE BLOCK "B"
}

in aarch64

    // Assume i is implemented using x0                                 
                                                                       
    mov x0, xzr                                                         

1:  cmp x0, 10                                                         
    bge 3f                                                                                                                      
    // CODE BLOCK "A".                                                              
    // if (i == 5)                                                     
    //      continue                                                   
    
    cmp x0, 5                                                           
    beq 2f                                                                                                                      
    // CODE BLOCK "B"                                                   
                                                                       
2:  add x0, x0, 1                                                       
    b   1b                                                             

3:

another one

    // Assume i is implemented using x0                                 
                                                                       
    mov x0, xzr                                                         
    b   3f                                                             
                                                                       
1:                                                                     
                                                                       
    // CODE BLOCK "A"                                                   
                                                                       
    // if (i == 5)                                                     
    //      continue                                                   
                                                                       
    cmp x0, 5                                                           
    beq 2f                                                             
                                                                       
    // CODE BLOCK "B"                                                   
                                                                       
2:  add x0, x0, 1                                                       
3:  cmp x0, 10                                                         
    blt 1b

break

The implementation of break is very similar to that of continue.

for (long i = 0; i < 10; i++) {
    // CODE BLOCK "A"
    if (i == 5)
        break;
    // CODE BLOCK "B"
}

aarch64:

    // Assume i is implemented using x0                                 
                                                                       
    mov x0, xzr                                                         
    b   3f                                                             
 
1:                                                                     
                                                                       
    // CODE BLOCK "A"                                                   
                                                                       
    // if (i == 5)                                                     
    //      break;                                                   
                                                                       
    cmp x0, 5                                                           
    beq 4f                                                             
                                                                       
    // CODE BLOCK "B"                                                   
                                                                       
2:  add x0, x0, 1                                                       
3:  cmp x0, 10                                                         
    blt 1b                                                             
                                                                       
4:

structs

alignment

Data members exhibit natural alignment.

That is:

a long will be found at addresses which are a multiple of 8.
an int will be found at addresses which are a multiple of 4.
a short will be found at addresses which are even.
a char can be found anywhere.

example

struct {
    long a;
    short b;
    int c;
};

布局：

Offset	Width	Member
0	8byte	a
8	2byte	b
10	2	— gap —
12	4byte	c

struct Foo {
    long a;
    short b;
    int c;
};

struct Foo Bar = { 0xaaaaaaaaaaaaaaaa, 0xbbbb, 0xcccccccc };

A hex dump will show:

1	aaaa aaaa aaaa aaaa bbbb 0000 cccc cccc

Notice the gap filled in which zeros. Note, if this were a local variable, the zeros might be garbage.

change the order:

struct Foo {
    short a;
    char b;
    int c;
};

struct Foo Bar = { 0xaaaa, 0xbb, 0xcccccccc };

A hex dump will show:

1	aaaa 00bb cccc cccc

Notice there is only one byte of gap before the int c starts.

why are the zeros to the left of the b’s?

This ARM processor is running as a little endian machine.

defining structs

struct Foo {
    short a;
    char b;
    int c;
};

struct Foo Bar = { 0xaaaa, 0xbb, 0xcccccccc };

Here is one way of defining and accessing the struct:

硬编码字段偏移量

    .section .rodata
fmt:
    .asciz "%p a: 0x%lx b: %x c: %x\n"

    .data
bar:
    .short 0xaaaa        // a: short 2 byte
    .byte  0xbb          // b: char  1 byte
    .byte  0x00          // padding
    .word  0xcccccccc    // c: int   4 byte

    .text
    .global main
    .align 2
main:
    stp x29, x30, [sp, -16]!    // 保存栈帧
    mov x29, sp

    adrp x0, fmt
    add  x0, x0, :lo12:fmt      // printf 格式字符串地址

    adrp x1, bar
    add  x1, x1, :lo12:bar      // bar 的地址

    ldrh w2, [x1, 0]            // short a
    ldrb w3, [x1, 2]            // char b
    ldr  w4, [x1, 4]            // int  c

    bl printf                   // 调用 printf(&bar, a, b, c)
    
    // 显式退出系统调用
    mov     x8, #93       // syscall number for exit
    mov     x0, xzr       // exit code 0
    svc     0             // make syscall

:lo12:fmt 会被汇编器替换成 fmt 地址的低 12 位。

adrp x0, fmt 会将 fmt 地址向下取整到最近的 4KB 边界（即清除低12位），然后加载这个“页基址”到 x0。

例如：
如果 fmt = 0x12345678，那么：

adrp x0, fmt 会得到 0x12345000（低 12 位清零）

another way to define a structs is

使用 .equ 伪指令定义符号常量

    .global main                // main 函数声明
    .text
    .p2align 2

    .equ foo_a, 0               // like #define foo_a 0
    .equ foo_b, 2               // like #define foo_b 2
    .equ foo_c, 4               // like #define foo_c 4

main:
    stp     x29, x30, [sp, -16]!  // 保存 x29, x30 到栈上
    mov     x29, sp               // 设置新的帧指针

    // 加载 fmt 和 bar 的地址
    ldr     x0, =fmt              // fmt 字符串的地址
    ldr     x1, =bar              // bar 的地址
    ldrh    w2, [x1, foo_a]       // 加载 bar.a 到 w2
    ldrb    w3, [x1, foo_b]       // 加载 bar.b 到 w3
    ldr     w4, [x1, foo_c]       // 加载 bar.c 到 w4

    // 调用 printf，传递参数
    mov     x0, x0               // 第一个参数：fmt 地址
    mov     x1, w2               // 第二个参数：a 的值
    mov     x2, w3               // 第三个参数：b 的值
    mov     x3, w4               // 第四个参数：c 的值
    bl      printf               // 调用 printf

    // 恢复栈和寄存器
    ldp     x29, x30, [sp], #16  // 恢复 x29 和 x30
    ret                          // 返回

    .data
fmt:    
	.asciz      "%p a: 0x%lx b: %x c: %x\n"   // printf 格式字符串
bar:    
	.short      0xaaaa                        // a
    .byte       0xbb                          // b
    .byte       0                               // padding
    .word       0xcccccccc                    // c

    .end

the third way:(Linux only)

使用 .struct 和字段标签自动推导偏移

    .section .rodata
fmt:
    .asciz "%p a: 0x%lx b: %x c: %x\n"

    // 用 .struct 模拟 struct Foo 的字段偏移
    .set  Foo, 0
    .struct 0
Foo_a:  .struct Foo_a + 2      // short a: 2字节
Foo_b:  .struct Foo_b + 1      // char b: 1字节
        .struct Foo_b + 1      // padding: 1字节
Foo_c:  .struct Foo_b + 2      // int c: 从 offset 4 开始
    // 现在 Foo_c 是偏移量 4

    .data
bar:
    .short 0xaaaa              // a: short 2 byte
    .byte  0xbb                // b: char  1 byte
    .byte  0x00                // padding
    .word  0xcccccccc          // c: int   4 byte

    .text
    .global main
    .align 2
main:
    stp x29, x30, [sp, -16]!   // 保存栈帧
    mov x29, sp

    adrp x0, fmt
    add  x0, x0, :lo12:fmt     // printf 格式字符串地址

    adrp x1, bar
    add  x1, x1, :lo12:bar     // bar 的地址

    ldrh w2, [x1, Foo_a]       // 加载 bar.a（short）
    ldrb w3, [x1, Foo_b]       // 加载 bar.b（char）
    ldr  w4, [x1, Foo_c]       // 加载 bar.c（int）

    bl printf                  // printf(bar, a, b, c)

    // 显式退出
    mov     x8, #93            // syscall number for exit
    mov     x0, xzr            // exit code 0
    svc     0                  // syscall

using structs

To summarize using structs:

All structs have a base address
The base address corresponds to the beginning of the first data member
All subsequent data members are offsets relative to the first
In order to use a struct correctly, you must have first calculated the offsets of each data member
Sometimes there will be padding between data members due to the need to align all data members on natural boundaries.

this pointer in c++

Every non-static method call employs a hidden first parameter. That’s it. That’s the slight of hand. The hidden argument is the this pointer.

1 2	TestClass tc; tc.SetString(test_string);

看起来我们只传入了一个参数 test_string。但实际上编译器传入了两个参数：

第一个是 this 指针：也就是 tc 的地址，传给寄存器 x0
第二个是 test_string，传给寄存器 x1

在汇编里看到：

1
2
3

adrp x1, _test_string
adrp x0, _tc         // 把 tc 对象地址放到 x0 —— 也就是 this 指针
bl __ZN9TestClass9SetStringEPc

const

The meaning and function of const only partially translates to assembly language.

const local variables and const parameters are just like any other data to assembly language.
The constant nature of const local variables and parameters is implemented solely in the compiler.
const globals are made constant by the hardware. Attempting to modify a variable protected in this manner will be like poking a dragon. Best not to poke dragons.

switch and jump table

When the C++ optimizer is enabled, it will look at your cases and choose between three different constructs for implementing your switch.

And, it can use any combination of the following! Compiler writers are smart!

It may emit a long string of if / else constructs.
It may find the right case using a binary search.
Finally, it might use a jump table.

Suppose our cases are largely consecutive. Given that all branch instructions are the same length in bytes, we can do math on the switch variable to somehow derive the address of the case we want.

#include <stdlib.h>                                              
#include <stdio.h>                                                
#include <time.h>                                                 
                                                                   
int main()                                                        
{                                                                   
    int r;                                                         
                                                                    
    srand(time(0));                                                
    r = rand() & 7;                                                 
    switch (r)                                                      
    {                                                              
        case 0:                                                    
            puts("0 returned");                                    
            break;                                                 
                                                                 
        case 1:                                                  
            puts("1 returned");                                   
            break;                                                  
                                                                    
        case 2:                                                     
            puts("2 returned");                                     
            break;                                                 
                                                                    
        case 3:                                                   
            puts("3 returned");                                  
            break;                                                
                                                                   
        case 4:                                                    
            puts("4 returned");                                     
            break;                                                 
                                                                   
        case 5:                                                     
            puts("5 returned");                                  
            break;                                             
                                                                    
        case 6:                                                    
            puts("6 returned");                                    
            break;                                                  
                                                                  
        case 7:                                                     
            puts("7 returned");                                     
            break;                                                  
    }                                                               
    return 0;                                                     
}

Notice that the case values are all, in this case, consecutive.

jt:     b       0f
        b       1f
        b       2f
        b       3f
        b       4f
        b       5f
        b       6f
        b       7f

f means forward, b means backward

At address jt there are a sequence of branch statements… jumps if you will. Being in a sequence, this is an example of a jump table. We’ll compute the index into this array of instructions and then branch to it.

lsl     x0, x0, 2     
ldr     x1, =jt          
add     x1, x1, x0        
br      x1

Line 2 loads the base address of the “instruction array” starting at address jt.

complete example

        .text
        .align  4
        .global main

main:   str     x30, [sp, -16]!
        mov     x0, xzr             // set up call to time(nullptr)
        bl      time                // call time setting up srand
        bl      srand               // call srand setting up rand
        bl      rand                // get a random number
        and     x0, x0, 7           // ensure its range is 0 to 7
                                    // note use of x register is on purpose
        lsl     x0, x0, 2           // multiply by 4
        ldr     x1, =jt             // load base address of jump table
        add     x1, x1, x0          // add offset to base address
        br      x1

// If, as in this case, all the "cases" have the same number of 
// instructions then this intermediate jump table can be omitted saving
// some space and a tiny amount of time. To omit the intermediate jump
// table, you'd multiply by 12 above and not 4. Twelve because each 
// "case" has 3 instructions (3 x 4 == 12).

// Question for you: If you did omit the jump table, relative to what
// would you jump (since "jt" would be gone).

jt:     b       0f
        b       1f
        b       2f
        b       3f
        b       4f
        b       5f
        b       6f
        b       7f

0:      ldr     x0, =ZR
        bl      puts
        b       99f

1:      ldr     x0, =ON
        bl      puts
        b       99f

2:      ldr     x0, =TW
        bl      puts
        b       99f

3:      ldr     x0, =TH
        bl      puts
        b       99f

4:      ldr     x0, =FR
        bl      puts
        b       99f

5:      ldr     x0, =FV
        bl      puts
        b       99f

6:      ldr     x0, =SX
        bl      puts
        b       99f

7:      ldr     x0, =SV
        bl      puts
        b       99f

99:     mov     w0, wzr
        ldr     x30, [sp], 16
        ret

        .data
        .section    .rodata

ZR:     .asciz      "0 returned"
ON:     .asciz      "1 returned"
TW:     .asciz      "2 returned"
TH:     .asciz      "3 returned"
FR:     .asciz      "4 returned"
FV:     .asciz      "5 returned"
SX:     .asciz      "6 returned"
SV:     .asciz      "7 returned"

        .end

implement falling through

If there is no break falling the code for a case, control will simply fall through to the next case

Here is a snippet from the program linked just above

0:      ldr     x0, =ZR  
        bl      puts   
        b       99f 
                  
1:      ldr     x0, =ON 
        bl      puts    
        b       99f

implementing gaps

The example above present shows 8 consecutive cases. What if there was no code for case 4? In other words, what if case 4 didn’t exit?

Here is the result:

2:      ldr     x0, =TW
        bl      puts
        b       99f

3:      ldr     x0, =TH
        bl      puts
        b       99f

4:      b       99f

5:      ldr     x0, =FV
        bl      puts
        b       99f

other strategies for implementing switch

As indicated above, an optimizer has at least three tools available to it to implement complex switch statements. And, it can combine these tools.

For example, suppose your cases boil down to two ranges of fairly consecutive values. For example, you have cases 0 to 9 and also cases 50 to 59. You can implement this as two jump tables with an if / else to select which one you use.

假设你的 switch 语句中，case 值主要集中在两个小的连续范围内，例如：一组是 case 0 到 case 9,另一组是 case 50 到 case 59,那么可以用 两个跳转表 来处理这两个范围，再用一个 if / else 来决定使用哪一个跳转表。

Suppose you have a large switch statement with widely ranging case values. In this case, you can implement a binary search to narrow down to a small range in which another technique becomes viable to narrow down to a single case.

假设你有一个包含很多 case 分支的 switch 语句，而且这些 case 值之间的数值范围差异很大,比如 case 10, case 1000, case 50000…，那么可以先用二分查找法缩小查找范围，把目标值限制在一个较小的范围内，然后在这个范围内再用其他技术（比如跳转表、线性比较等）来确定最终对应哪个 case 分支。

You might have need to implement hierarchical jump tables（分层跳转表）, for example.

“分层跳转表”是一种优化结构，适用于以下情况：

case 值非常稀疏、范围极广（例如 case 0, case 1000, case 2000...)
但它们在局部范围内是稠密的（比如 1000~1009, 2000~2009）

你可以：

先用一个“一级跳转表”根据高位或区段跳转到一个子跳转表（子范围）。
再在子跳转表中做具体跳转。
这就构成了一个“分层结构”——像树一样的跳转过程。

strategies for implementing if-else

If you do choose to implement a long chain of if / else statements, consider how frequently a given case might be chosen. Put the most common cases at the top of the if / else sequence.

This is known as making the common case fast.

Making the common case fast is one of the Great Ideas in Computer Science. One, you would do well to remember no matter what language you’re working with.

fucntions

bottom line concept

The bl instruction is stands for Branch with Link. The Link concept is what enables a function (or method) to return to the instruction after the call.

Branch-with-link computes the address of the instruction following it.

It places this address into register x30 and then branches to the label provided. It makes one link of a trail of breadcrumbs to follow to get back following a ret.

This is why it is absolutely essential to backup x30 inside your functions if they call other functions themselves.

a example

        .text                                      
        .global main                       
        .align  2                           
                                 
main:   ldr     x0, =hw                
        bl      puts             
        ret
                      
        .data                        
hw:     .asciz  "Hello World!"               
                                                 
        .end

The program hung and had to be killed with ^C.

Somebody called main() - it’s a function and someone called it with a bl instruction. At the moment main() entered, the address to which it needed to return was sitting in x30.

Then, main() called a function - in this case puts() but which function is called doesn’t matter - it called a function. In doing so, it overwrote the address to which main() needed to return with the address of line 7 in the code. That is where puts() needs to return.

So, when line 7 executes it puts the contents of x30 into the program counter and branches to it.

Here is a fixed version of the code:

        .text                                   
        .global main                           
        .align  2                         
                                          
main:   str     x30, [sp, -16]!            
        ldr     x0, =hw                     
        bl      puts                 
        ldr     x30, [sp], 16         
        ret                             
                        
        .data                       
hw:     .asciz  "Hello World!"                   
                                             
        .end

In the AARCH64 Linux style calling convention, values are returned in x0 and sometimes also returned in other scratch registers though this is uncommon.(Note that x0 could also be w0 or the first floating point register if the function is returning a float or double.)

If your functions call any other functions, x30 must be backed up on the stack and then restored into x30 before returning.

A function with more than one return value is not supported by C or C++ but they can be written in assembly language where the rules are yours to break.

inline functions

Functions that are declared as inline don’t actually make function calls. Instead, the code from the function is type checked and inserted directly where the “call” is made after adjusting for parameter names.

passing parameters to functions

How parameters are passed to functions can be different from OS to OS. This chapter is written to the standard implemented for Linux.

For the purposes of the present discussion, we assume all parameters are long int and are therefore stored in x registers.

Up to 8 parameters can be passed directly via scratch registers.（These are x0 through x7） Each parameter can be up to the size of an address, long or double (8 bytes).
- Scratch means the value of the register can be changed at will without any need to backup or restore their values across function calls.
- This means that you cannot count on the contents of the scratch registers maintaining their value if your function makes any function calls.

a example

long func(long p1, long p2)              
{                                              
    return p1 + p2;                           
}

is implemented as:

1 2	func: add x0, x0, x1 ret

If you are the author of both the caller and the callee and both are in assembly language, you can play loosey goosey with how you return values. Specifically, you can return more than one value. But if you do so, you give up the possibility of calling these functions from C or C++.

const

long func(const long p1, const long p2)              
{                                  
    return p1 + p2;
}

how would the assembly language change?

Answer: no change at all!

const is an instruction to the compiler ordering it to prohibit changing the values of p1 and p2. We’re smart humans and realize that our assembly language makes no attempt to change p1 and p2 so no changes are warranted.

passing pointers

void func(long * p1, long * p2)               
{                                                
    *p1 = *p1 + *p2;                           
}

func:   ldr x2, [x0]                     
        ldr x3, [x1]                            
        add x2, x2, x3                       
        str x2, [x0]                            
        ret

The value of x0 on return is, in the general sense, undefined because this is a void function.

passing reference

long func(long & p1, long & p2)                     
{                                              
    return p1 + p2;                               
}

func:   ldr x0, [x0]                     
        ldr x1, [x1]                
        add x0, x0, x1      
        ret

Passing by reference is also an instruction to the compiler to treat pointers a little differently - the differences don’t show up here so there the only change to our pointer passing version is how we return the answer.

more than eight parameters

#include <stdio.h>

void SillyFunction(long p1, long p2, long p3, long p4, 
                   long p5, long p6, long p7, long p8, 
                   long p9) {
    printf("This example hurts: %ld %ld\n", p8, p9);
}

int main() {
    SillyFunction(1, 2, 3, 4, 5, 6, 7, 8, 9);
}

        .text                                                            
        .global    main                                                
                                                                        
/*  Demonstration of using  more than 8 arguments to  a function.  This  
    demo is LINUX only as APPLE will put all arguments beyond the first  
    one on the stack anyway.                                             
                                                                         
    On LINUX, all parameters to a function beyond  the  eight go on the 
    stack.  The first 8 go in registers  x0  through  x7 as normal (for 
    LINUX).                                                              
*/                                                                    
                                                                       
SillyFunction:                                                        
        stp        x29, x30, [sp, -16]!    // Changes sp.               
        mov        x29, sp                 // set new sp                    
        ldr        x0, =fmt                                 
        mov        x1, x7                  // 第八个参数
        ldr        x2, [sp, 16]            // This does not alter the sp，第九个参数
        bl         printf                                                
        ldp        x29, x30, [sp], 16      // Undoes change to sp.     
        ret                                                          
                                                                          
main:                                                                   
        stp        x29, x30, [sp, -16]!    // sp down total of 16.      
        mov        x29, sp                                                
        mov        x0, 9                                                
        str        x0, [sp, -16]!          // sp down total of 32.     
        mov        x0, 1                                                
        mov        x1, 2                                                  
        mov        x2, 3                                            
        mov        x3, 4                                               
        mov        x4, 5                                              
        mov        x5, 6                                                  
        mov        x6, 7                                                   
        mov        x7, 8                                                   
        bl         SillyFunction                                           
        add        sp, sp, 16           // undoes change of sp by 16 due   
                                        // to function call.              
        ldp        x29, x30, [sp], 16   // undoes change to sp of 16.    
        ret                                                             
                                                                        
        .data                                                            
fmt:    .asciz    "This example hurts my brain: %ld %ld\n"           
                                                                       
        .end

After executing Line 24, the stack will have:

1 2	sp + 0 former contents of frame pointer sp + 8 return address for main

After executing Line 27, the stack will have:

sp + 0    9
sp + 8    garbage
sp + 16   former contents of frame pointer
sp + 24   return address for main

After executing Line 14, the stack will have:

sp + 0    return address for SillyFunction
sp + 8    garbage
sp + 16   9
sp + 24   garbage
sp + 32   former contents of frame pointer
sp + 40   return address for main

This means that Line 18 fetches p9 from memory and puts its value into x2 (where it becomes the third argument to printf()).

在 AArch64 中，栈空间常常是 以 16 字节为单位对齐分配的，但你可能 只写了其中的一部分数据，剩下的就没有被初始化，于是我们称它为 “garbage”（未定义的内容）。

The stack pointer in ARM V8 can only be manipulated in multiples of 16.

examples of calling some common C runtime functions

There are, by the way, two broad types of functions within the C runtime.

Some are implemented largely in the C runtime itself.
Others that exist in the C runtime act as wrappers for functions implemented within the OS itself. These are called “system calls”.

For the purposes of calling functions in the C runtime, there is no practical difference between these two types. Note however, there are ways of calling system calls directly using the svc instruction.

“C runtime”（C 运行时）指的是一组在程序运行时提供支持的函数、变量和基础机制，主要用于支持 C 语言标准库和程序的初始化/终止。这套系统通常被称为 C runtime library（C 运行时库），在不同平台中常见的实现有：

GNU/Linux 下的 glibc
Windows 下的 MSVCRT
macOS 下的 libSystem.dylib（包含 libc）

C runtime 做了哪些事？

程序初始化
- 在 main() 执行之前，C runtime 会设置好堆栈、初始化全局变量、调用构造函数等。
- 典型入口点是 _start → __libc_start_main() → main()。
提供标准库函数
- 如 printf(), malloc(), exit(), fopen() 等，这些函数由 C runtime 实现或封装。
管理资源
- 比如内存分配、文件句柄、线程等的生命周期管理。
提供系统调用封装
- 比如你调用 write()，它其实是调用了一个 C runtime 提供的 wrapper，最终通过 syscall 或 svc 指令访问内核。

system calls

Many C runtime functions are just wrappers for system calls. For example if you call open() from the C runtime, the function will perform a few bookkeeping operations and then make the actual system call.

What IS a system call?

The short answer is a system call is a sort-of function call that is serviced by the operating system itself, within its own private region of memory and with access to internal features and data structures.

Our programs run in “userland”. The technical name for userland on the ARM64 processor is EL0 (Exception Level 0).

We can operate within the kernel’s space only through carefully controlled mechanisms - such as system calls. The technical name for where the kernel (or system) generally operates is called EL1.

There are two higher Exception Levels (EL2 and EL3) which are beyond the scope of this book.

Mechanism of making a system call

First, like any function call, parameters need to be set up. The first parameter goes in the first register, etc.

Second, a number associated with the specific system call we wish to make is loaded in a specific register (w8).

Finally, a special instruction svc causes a trap which elevates us out of userland into kernel space. Said differently, svc causes a transition from EL0 to EL1. There, various checks are done and the actual code for the system call is run.

A description of returning from a system call is beyond the scope of this book. Hint: just as there’s a special instruction that escalates from EL0 to EL1, there is a special instruction that does the reverse.

the number associated with a particular system call

reference:

syscalls

https://gpages.juszkiewicz.com.pl/syscalls-table/syscalls.html

example getpid()

#include <stdio.h>                                                
#include <unistd.h>                                               
                                                                  
int main() {                                                      
    printf("Greetings from: %d\n", getpid());                     
    return 0;                                                     
}

Written in assembly language using C runtime

        .global main                                              
        .text                                                     
        .align  2                                                 
                                                                  
main:   stp     x29, x30, [sp, -16]!                              
        bl      getpid                                            
        mov     w1, w0                                            
        ldr     x0, =fmt                                          
        bl      printf                                            
        ldp     x29, x30, [sp], 16                                
        mov     w0, wzr                                           
        ret                                                       
                                                                  
        .data                                                     
fmt:    .asciz  "Greetings from: %d\n"                            
                                                                  
        .end

And finally: calling the system call directly

        .global main                                              
        .text                                                     
        .align  2                                                 
                                                                  
main:   stp     x29, x30, [sp, -16]!                              
        mov     x8, 172                 // getpid on ARM64        
        svc     0                       // trap to EL1            
        mov     w1, w0                                            
        ldr     x0, =fmt                                          
        bl      printf                                            
        ldp     x29, x30, [sp], 16                                
        mov     w0, wzr                                           
        ret                                                       
                                                                  
        .data                                                     
fmt:    .asciz  "Greetings from: %d\n"                            
                                                                  
        .end

We chose getpid() because it doesn’t require any parameters. Using the C runtime, we simply bl to it. Calling the system call directly is different in that we must first load x8 with the number that corresponds to getpid() for the AARCH64 architecture.

/*  Perry Kivolowitz
    Example of file operations.
*/
        .text
        .global main
        .align  2

/*  This program will
    * open() a file in the current directory,
    * write() some text to it, 
    * seek back to the beginning of the file,
    * read() each line, printing it
    * close() the file
*/
// 使用 .req 给寄存器取别名，便于阅读。例如，fd 其实就是 w28，代表文件描述符。
retval  .req    w27
fd 		.req	w28

main:   stp     x29, x30, [sp, -16]!
        stp     x27, x28, [sp, -16]!
        bl      open_file

        // w0 will contain either the file descriptor of the new
        // file or -1 for a failure. Note that the value in w0
        // has also been copied to "fd" - a register alias.
        cmp 	w0, wzr
        bge 	1f

        // If we get here, the open has failed. Use perror() to
        // print a meaningful error and branch to exit. The return
        // code of the program will be set to non-zero inside fail.
        ldr     x0, =fname
        bl      fail
        b       99f

1:		// When we get here, the file is open. Write some data to it.
        // If write_file returns non-zero, it signifies an error. If
        // so, branch to the file closing code since the file is open
        // after printing an error message.
        bl		write_data
        cbz	    w0, 10f

        // If we get here, there was an error in write_data. Print
        // a reasonable error message then branch to the clean usleep
        // code.
        ldr     x0, =wf     // load legend
        bl      fail        // print error
        b       50f         // branch to clean up.

        // Seek back to position zero preparing to read the file back.
        // The return value in x0 (off_t) is the return value of
        // lseek(). 
10:     bl      seek_zero
        cbz     x0, 20f

        // If we get here, the seek failed. Cause a reasonable
        // message to be printed then branch to the clean up code.
        ldr     x0, =sf
        bl      fail
        b       50f

20:     // When we get here, we have to read from the file and print
        // the results. To ignore the complexity of memory allocation
        // and buffer overrun potential, we'll read one character at a 
        // time looking the end-of-file.

        // ssize_t read(int fildes, void *buf, size_t nbyte);
        mov     w0, fd
        ldr     x1, =buffer
        mov     x2, 1
        bl      read
        // Check the return value - should be 1.
        cbz     x0,50f      // zero means EOF - that's OK.
        // If x0 is negative, that IS a problem.
        cmp     x0, xzr
        bge     25f
        // The return value is negative - this is an error.
        ldr     x0, =rf
        bl      fail
        b       99f

25:     // Write the character sitting in buffer to the console.
        mov     w0, 1
        ldr     x1, =buffer
        mov     x2, 1
        bl      write
        // We will ignore the return value for the sake of brevity.
        // There are plenty of examples of handling a potential error
        // elsewhere in this code.
        // --
        b       20b

        // When we get here, we are done. Close the file.
50:		mov		w0, fd
        bl 		close
        mov 	retval, wzr

99:     ldp     x27, x28, [sp], 16
        ldp     x29, x30, [sp], 16
        mov     w0, retval
        ret

/*	open_file()
    This function attempts to open a file for both reading and
    writing. Return values will be checked to ensure the file is
    opened. If successful, the fd is returned (and is squirreled
    away in register "fd"). If unsuccessful, the -1 returned by
    open() is passed back to the caller.

    Explanation of the magic numbers:

    int open(const char *pathname, int flags, mode_t mode);

    octal 102 for flags is O_RDRW | O_CREAT
    octal 600 for mode is rw------- i.e. read and write for
        the owner but no permissions for anyone else.

	There is a version of open() that takes two parameters. However,
	if O_CREAT is specified, the three parameter version is required.
*/

        .equ    O_FLAGS, 0102
        .equ    O_MODE, 0600

open_file:
        stp      x29, x30, [sp, -16]!
        ldr      x0, =fname
        mov      w1, O_FLAGS
        mov      w2, O_MODE
        bl       open
        mov      fd, w0
        ldp      x29, x30, [sp], 16
        ret


/*  This function uses perror() to print a meaningful error
    message in the event of a failure. The string value
    passed to perror() arrives to us as a pointer in x0.
*/

fail:
        stp     x29, x30, [sp, -16]!
        bl 		perror
        mov		retval, 1
        ldp     x29, x30, [sp], 16
        ret

/*  ssize_t write(int fd, const void *buf, size_t count);

	This function will write a string to the file descriptor contained
	in "fd" (a register alias).
*/

write_data:
        stp     x29, x30, [sp, -16]!
        str     x20, [sp, -16]!
        mov     w0, fd              // file descriptor
        ldr     x1, =txt            // address to print from
        ldr     x2, =txt_s          // load pointer to size
        ldr     x2, [x2]            // dereference the pointer
        mov     w20, w2             // need this value for error check.
        bl      write
        cmp     x0, x20             // Did we write the expected amount?
        bne     90f
        // successful write - return 0
        mov     x0, xzr
        b       99f
90:     // failure - ensure we return non-zero!
        mov     x0, 1
99:     ldr     x20, [sp], 16
        ldp     x29, x30, [sp], 16
        ret

/*  off_t lseek(int fd, off_t offset, int whence);
*/
seek_zero:
        stp     x29, x30, [sp, -16]!
        mov     w0, fd          // file descriptor
        mov     x1, xzr         // beginning of file
        mov     w2, wzr         // SEEK_SET - absolute offset
        bl      lseek
        ldp     x29, x30, [sp], 16
        ret

        .data
prog:	.asciz	"file_ops"
wf:     .asciz  "write failed"
rf:     .asciz  "read failed"
sf:     .asciz  "lseek failed"
fname:	.asciz	"test.txt"
txt:	.asciz	"some data\n"
txt_s:	.word	txt_s - txt - 1		// strlen(txt)，txt：“some data”的总长度
buffer: .word   0
        .end

floating point

what are floating points numbers?

reference

CSAPP DataLab

https://even629.com/posts/42856/

IEEE 754

register

There are four highest level ideas relating to floating point operations on AARCH64.

There is another complete register set for floating point values.
There are alternative instructions just for floating point values.
There are exotic instructions that operate on sets of floating point values (SIMD).
There are instructions to go back and forth to and from the integer registers.

regs

上图展示了 ARM64 架构中 SIMD（Single Instruction, Multiple Data）寄存器 V0 的不同视图与访问方式，包括不同位宽的排列方式（Arrangement Specifiers）与 Lane（通道）索引。

图解说明

这个图以 V0 寄存器为例，展示了 如何用不同的排列方式访问其内容：

层级	类型	说明
最底层	`V0`	整个 128-bit 的 V0 寄存器
向上	`V0.2D`, `V0.4S`, `V0.8H`, `V0.16B`	以不同大小的数据视图访问 V0： - D = 64-bit（2 × 64bit） - S = 32-bit（4 × 32bit） - H = 16-bit（8 × 16bit） - B = 8-bit（16 × 8bit）
再上	`V0.2D[0]`, `V0.4S[0]` 等	每个 lane 的索引，比如： - `V0.4S[2]` 表示第 3 个 32-bit 单元 - `V0.16B[15]` 表示第 16 个 8-bit 字节
最上层	`B0`, `H0`, `S0`, `D0`	是对 `V0` 的 alias，按位宽访问（只访问最低位的数据）

truncation towards zero

truncate(截断)

In C and C++, truncation is what we get from:

1 2	integer_variable = int(floating_variable); // C++ integer_variable = (int) floating_variable; // C

The instruction is fcvtz - convert towards zero. Then, the choice as to whether to produce a signed or unsigned result is defined by the final letterL u or s.

Mnemonic	Meaning
fcvtzu	Truncate (always towards 0) producing an unsigned int
fcvtzs	Truncate (always towards 0) producing a signed int

fcvtzu: Float Convert to Unsigned integer, with truncation toward zero
fcvtzs: Float Convert to Signed integer, with truncation toward zero

this instruction which completely discards the fractional value is said by the ARM documentation as doing rounding not truncating.

The the choice of source register defined whether you are converting a double or single precision floating point value.

Source Register	Converts a
dX	`double` to an integer
sX	`float` to an integer

Destination Register	Converts a
xX	64 bit integer
wX	32 bit or less integer

Examples where d is a double and f is a float:

C++	Instruction
`int32_t(d)`	`fcvtzs w0, d0`
`uint32_t(d)`	`fcvtzu w0, d0`
`int64_t(d)`	`fcvtzs x0, d0`
`uint64_t(d)`	`fcvtzu x0, d0`

example

    .section .text
    .global main
    .type main, @function // 表示 告诉汇编器和链接器：main 是一个函数符号（symbol）
    //.type <symbol>, @<type> 是 GAS（GNU Assembler）的一条伪指令，用于给符号指定类型。
    // <symbol>：符号名，比如 main
    // @<type>：符号类型，这里是 @function，表示这是一个函数，而不是变量或标签


main:
    stp     x29, x30, [sp, -16]!     // 保存 frame pointer 和 link register
    mov     x29, sp

    // 保存浮点寄存器
    stp     d20, d21, [sp, -16]!
    stp     d22, d23, [sp, -16]!

    // 加载提示信息
    ldr     x0, =leg
    bl      printf

    // 加载 vless 数据到 d20-d23
    ldr     x0, =vless
    ldr     d20, [x0]            // dless = 5.49
    ldr     d21, [x0, #8]        // dmore = 5.51
    ldr     d22, [x0, #16]       // ndless = -5.49
    ldr     d23, [x0, #24]       // ndmore = -5.51

    // fcvtps: 向上取整（+∞）
    fcvtps  x1, d20
    fcvtps  x2, d21
    ldr     x0, =fmt1
    bl      printf

    fcvtps  x1, d22
    fcvtps  x2, d23
    ldr     x0, =fmt1
    bl      printf

    // fcvtns: 四舍五入 (tie to even)
    fcvtns  x1, d20
    fcvtns  x2, d21
    ldr     x0, =fmt2
    bl      printf

    fcvtns  x1, d22
    fcvtns  x2, d23
    ldr     x0, =fmt2
    bl      printf

    // fcvtzs: 向 0 取整
    fcvtzs  x1, d20
    fcvtzs  x2, d21
    ldr     x0, =fmt4
    bl      printf

    fcvtzs  x1, d22
    fcvtzs  x2, d23
    ldr     x0, =fmt4
    bl      printf

    // fcvtas: 四舍五入 (tie away from zero)
    fcvtas  x1, d20
    fcvtas  x2, d21
    ldr     x0, =fmt3
    bl      printf

    fcvtas  x1, d22
    fcvtas  x2, d23
    ldr     x0, =fmt3
    bl      printf

    // 恢复浮点寄存器和返回地址
    ldp     d22, d23, [sp], #16
    ldp     d20, d21, [sp], #16
    ldp     x29, x30, [sp], #16
    mov     w0, wzr
    ret

    .section .rodata
vless:
    .double 5.49
    .double 5.51
    .double -5.49
    .double -5.51

fmt1:
    .asciz "fcvtps less: %ld more: %ld\n"
fmt2:
    .asciz "fcvtns less: %ld more: %ld\n"
fmt3:
    .asciz "fcvtas less: %ld more: %ld\n"
fmt4:
    .asciz "fcvtzs less: %ld more: %ld\n"
leg:
    .asciz "less values are +/- 5.49. more values are +/- 5.51.\n"

Notice all the values were truncated to the whole number that is closer to zero.

Truncation Away From Zero

Truncation away from zero is not as easy. In fact, it cannot be performed with a single instruction.

In C (and C++):

1	iv = (int(fv) == fv) ? int(fv) : int(fv) + ((fv < 0) ? -1 : 1);

If the fv is already equal to a whole number, the integer value will be that whole number. Other wise the iv is the whole number further away from zero.

In C++, a more sophisticated version would require and could look like:

template <typename T>
int MyTruncate(T x) {
    return int((x < 0) ? floor(x) : ceil(x));
}

floor() always truncates downward (towards more negative).
ceil() always truncates upwards (towards more positive).

RoundAwayFromZero:
        fcmp    d0, 0
        ble     1f
        // Value is positive, truncate towards positive infinity (ceil)
        frintp  d0, d0
        b       2f
1:      // Value is negative, truncate towards negative infinity (floor)
        frintm  d0, d0
2:      fcvtzs  x0, d0
        ret

frintp（Round toward +∞）
frintm（Round toward -∞）
frintz（Round toward 0）
frinta（Round to nearest, tie away from 0）
frintn（Round to nearest, tie to even）

rounding conversion

rounding(四舍五入)
An instruction which does what we normally think of as rounding is frinta. This is the conversion “to nearest with ties going away.” So, 5.5 goes to 6 as one would expect from “rounding.”

converting an integer to a float point value

In C / C++:

1 2	double_var = double(integer_var); // C++ double_var = (double)integer_var; // C

Is handled by two instructions:

scvtf converts a signed integer to a floating point value
ucvtf converts an unsigned integer to a floating point value
The name of the destination register controls which kind of floating point value is made. For example, specifying dX makes a double etc.

The name of the destination register controls which kind of floating point value is made. For example, specifying dX makes a double etc.

floating point literals

Recall that all AARCH64 instructions are 4 bytes long. Recall also that this means that there are constraints on what can be specified as a literal since the literal must be encoded into the 4 byte instruction. If the literal is too large, an assembler error will result.

Given that floating point values are always at least 4 bytes long themselves, using floating point literals is extremely constrained. For example:

1 2	fmov d0, 1 // 1 fmov d0, 1.1 // 2

Line 1 will pass muster but Line 2 will cause an error.

To load a float, you could translate the value to binary and do as the following:

        .text                                                   
        .global main                                            
        .align    2                                             
                                                                
main:   str        x30, [sp, -16]!                              
        ldr        s0, =0x3fc00000                              
        fcvt       d0, s0                                       
        ldr        x0, =fmt                                     
        bl         printf                                       
        ldr        x30, [sp], 16                                
        mov        w0, wzr                                      
        ret                                                     
                                                                
        .data                                                   
fmt:    .asciz    "%f\n"                                        
        .end

printf() only knows how to print double precision values. When you specify a float, it will convert it to a double before emitting it.

Translating floats and doubles by hand isn’t a common practice for humans, though compilers are happy to do so.

Instead for us humans, the assembler directives .float and .double are used more frequently to specify float and double values putting them into RAM.
a example:

        .global main                                            
        .text                                                   
        .align  2                                               
                                                                
counter .req    x20                                             
dptr    .req    x21                                             
fptr    .req    x22                                             
        .equ    max, 4                                              
                                                                    
main:   stp     counter, x30, [sp, -16]!                            
        stp     dptr, fptr, [sp, -16]!                              
        ldr     dptr, =d                                            
        ldr     fptr, =f                                            
        mov     counter, xzr                                        
                                                                    
1:      cmp     counter, max                                        
        beq     2f                                                  
                                                                    
        ldr     d0, [dptr, counter, lsl 3]                          
        ldr     s1, [fptr, counter, lsl 2]                          
        fcvt    d1, s1                                              
        ldr     x0, =fmt                                            
        add     counter, counter, 1                                 
        mov     x1, counter                                         
        bl      printf                                              
        b       1b                                                  
                                                                    
2:      ldp     dptr, fptr, [sp], 16                                
        ldp     counter, x30, [sp], 16                              
        mov     w0, wzr                                             
        ret                                                         
                                                                    
        .data                                                       
fmt:    .asciz  "%d %f %f\n"                                       
d:      .double 1.111111, 2.222222, 3.333333, 4.444444              
f:      .float  1.111111, 2.222222, 3.333333, 4.444444             
                                                                    
        .end

指令	全称/缩写	作用	常见用法示例
`.req`	register require（非官方缩写）	给寄存器起别名	`foo .req x0` 表示以后写 `foo` 就等于 `x0`
`.equ`	equate	定义一个常量符号	`BUF_SIZE .equ 64` 表示 `BUF_SIZE = 64`

On Linux, just as w/x0 through w/x7 are scratch registers and used to pass parameters, s/d0 and s/d7 are as well beginning with the 0 register.

即：

📥 整数参数传递：
x0 ~ x7（或 32 位的 w0 ~ w7）用于传递前 8 个整数类参数（int、pointer、long 等）。

超过 8 个就通过栈传递。

📥 浮点参数传递：
d0 ~ d7（64 位 double 类型）或 s0 ~ s7（32 位 float 类型）用于传递前 8 个浮点参数。

超过 8 个浮点参数也是通过栈传递。

Fitting 32 bits into a 32 bit bag

1	ldr s0, =0x3fc00000 // 伪指令！我们以为它直接把 0x3fc00000 加载进 s0

编译器不能直接把任意 32 位值硬编码进指令中（因为一条 ARM 指令本身就只有 32 位）。

所以它实际上是：

将字面量值 0x3fc00000 写到内存的某个地方（通常靠近当前函数底部）。
生成一条 ldr 指令，用 PC-relative load 的方式从这个地址加载该值。
这块被称为一个 literal pool，它是一些常量的集合。

We expected line 6 to read:

1	ldr s0, =0x3fc00000

Instead we find:

1	b+ 0x784 <main+4> ldr s0, 0x7a0 <main+32>

Scan downward to find 0x7a0:

1	0x7a0 <main+32> .inst 0x3fc00000 ; undefined

伪指令	实际效果	GDB中看到的实际汇编
`ldr s0, =0x3fc00000`	把常量加载进 `s0` 寄存器	`ldr s0, #literal_addr` `literal_addr: .inst 0x3fc00000`
`ldr x0, =fmt`	加载字符串指针地址	`ldr x0, #literal_addr` `literal_addr: .inst 地址值`
`.inst 0x3fc00000`	手动插入一个 32 位数据（不一定是有效指令）	存放常量（不是执行）

.inst 的含义
全称：.inst = insert instruction
用途：直接插入一条 ARM 指令的机器码（通常是 32 位十六进制值）

1	.inst 0xd65f03c0 // 实际是 ret 指令

这个例子中，.inst 后的机器码 0xd65f03c0 是 ret 指令的 32 位编码。也就是说：

ret

等价于：

1	.inst 0xd65f03c0

在上面的例子中，可以用.inst定义一个地址，从该地址中加载

为什么不用 mov reg, #imm ？

mov 有立即数编码限制，不能加载任意 32 位值。
超过范围时，必须用 ldr 从内存加载。
fmov

The fmov instruction is used to move floating point values in and out of floating point registers and to some degree, moving data between integer and floating point registers.

loading floating point numbers as immediate values

Just as we saw with integer registers, some values can be used as immediate values and some cannot. It comes down to how many bits are necessary to encode the value. Too many bits… not enough room to fit in a 4 byte instruction plus the opcode.

For example, this works:

1	mov x0, 65535

but this does not:

1	mov x0, 65537

The constraints placed on immediate values for fmov are much tighter because floating point numbers are far more complex than integers.

fmov d0, #imm 能否工作，取决于该浮点数是否能在8位编码空间内被精确表示：

结构	位数	说明
符号位	1 bit	表示正或负
指数部分	3 bits	控制大小（乘以 2 的幂）
尾数部分	4 bits	仅能由 1/2、1/4、1/8、1/16 组合构成

fmov d0, 1.0        // ✅ OK：整数 1 是 2⁰，指数可编码
fmov d0, 1.5        // ✅ OK：1 + 0.5 = 2⁰ + 2⁻¹，指数/尾数都能编码
fmov d0, 1.75       // ✅ OK：1 + 0.5 + 0.25 = 2⁰ + 2⁻¹ + 2⁻²
fmov d0, 1.875      // ✅ OK：+ 2⁻³
fmov d0, 1.9375     // ✅ OK：+ 2⁻⁴
fmov d0, 1.96875    // ❌ 不行：需要 2⁻⁵，尾数超出 4 位

大浮点不能用 fmov，改用 ldr。

fmov 是“位复制器”，不是“精度转换器”。你要改数值精度，就必须用 fcvt 系列。

half precision

Support for half precision (16 bit) floating point values does exist but there is no complete agreement on how different compilers support them. Indeed, there are not one but two competing half precision formats out there. These are the IEEE and GOOGLE types. Further still, many open source developers have created their own implementations with potentially clashing naming conventions.

1
2
3

__fp16 Foo(__fp16 g, __fp16 f) {
    return g + f;
}

compiles to:

fcvt    s1, h1
fcvt    s0, h0
fadd    s0, s0, s1
fcvt    h0, s0
ret

Notice each half precision value is converted to single precision. So, from C and C++ working with half precision values can be inefficient.

On the other hand, if you are willing to use intrinsics and one of the SIMD instruction sets offered by ARM, then knock yourself out. Be aware that doing so ties your code to the ARM processor in ways which you might regret later.

bit manipulation

Bit fields are a feature of the C and C++ language which completely hide what is often called “bit bashing”.

the ordering of bits in a bit field is not guaranteed to be the same on different platforms and even between different compilers on the same platform.

位域是一种用来在结构体内精确控制成员所占二进制位数的语法，通常用于硬件寄存器、协议头等空间敏感的场景。
语法格式

struct 结构体名 {
    类型 成员名 : 位宽;
    ...
};

example:

struct BF {
    unsigned char a : 1;
    unsigned char b : 2;
    unsigned char c : 5;
};

a 用 1 位，能表示 0 或 1
b 用 2 位，能表示 0 ~ 3
c 用 5 位，能表示 0 ~ 31
三个成员总共占 1 + 2 + 5 = 8 位，即 1 字节

虽然每个成员是个位宽，但整体大小通常向整型对齐（这里是 1 字节，因为 8 位正好一字节）。
不同编译器对位域对齐和填充细节可能略有差异。
访问时可以像普通成员一样：

struct BF bf;
bf.a = 1;
bf.b = 3;
bf.c = 31;

编译器会自动对位域进行掩码和移位处理。

Consider a data structure for which there will be potentially millions of instances in RAM. Or, perhaps billions of instances on disc. Suppose you need 8 boolean members in every instance. The C++ standard does not define the size of a bool instead leaving it to be implementation dependent. Some implementations equate bool to int, four bytes in length. Some implement bool with a char, or 1 byte in length.

Let’s assume the smallest case and equate a bool with char. Our struct, for which there may be millions or billions of instances requires 8 bool so therefore 8 bytes. Times millions or billions.

Bit fields can come to your aid here by using a single bit per boolean value. In the best case, 8 bytes collapse to 1 byte. In a worse case, 8 x 4 = 32 bytes collapsed into 1.

假设使用最小单位，即每个 bool 是 1 字节：

struct S {
    bool b0;
    bool b1;
    bool b2;
    bool b3;
    bool b4;
    bool b5;
    bool b6;
    bool b7;
};

这个结构体大小为 8 字节（1 字节 × 8 个 bool）。
如果有百万个实例，占用的内存就是 8MB，如果有十亿个实例，则是 8GB。
对于 4 字节的 bool 实现，则大小直接变成 32 字节，每亿实例就是 3.2GB。

解决方案：使用位域压缩布尔值
用位域，将 8 个布尔值定义为 1 位大小：

struct S {
    unsigned char b0 : 1;
    unsigned char b1 : 1;
    unsigned char b2 : 1;
    unsigned char b3 : 1;
    unsigned char b4 : 1;
    unsigned char b5 : 1;
    unsigned char b6 : 1;
    unsigned char b7 : 1;
};

8 个 1-bit 成员合起来正好占 1 字节。

这样 8 字节压缩成 1 字节，节省了大量空间。

In Computer Science there is an eternal tension between space and time. The following is a law:

If you want something to go faster, it will cost more memory.

If you want to save memory, what you’re doing will take more time.

This law shows up here… recall the example of where we wanted to save memory by collapsing 8 bool into 1 byte? To save that memory we will slow down because accessing the right bits takes a couple of instructions where overwriting a bool implemented as an int takes just one instruction.

As for the assembly language that bit field will produce, it depends upon optimization level. Unoptimized, the code produced will be much longer and cumbersome than the “sophisticated” assembly language.

endian

the ARM swing both ways: the litte-endian and the big-endian. But:

The standard toolchain emits little endian code. It is a big task to install the big-endian version of the toolchain.

Here is a quote from Wikipedia:

1	ARM, C-Sky, and RISC-V have no relevant big-endian deployments, and can be considered little-endian in practice.

The common Intel processors are also little-endian.

assembly macros

An early innovation in assemblers was the introduction of a macro capability. Given what could be considered a certain amount of tedium in coding in asm, macros provide a simple form of meta programming where a series of statements can be encapsulated by a single macro. Think of a macro as an early form of C++ templated function (kinda but not really).

Here’s an example of an assembly language macro:

.macro LLD_ADDR xreg, label 
        adrp    \xreg, \label@PAGE
        add     \xreg, \xreg, \label@PAGEOFF
.endm
```asm
Here's how it might be used:
```asm
        LLD_ADDR x0, fmt

This gets expanded to:

1 2	adrp x0, fmt@PAGE add x0, x0, fmt@PAGEOFF

gcc on Linux does not run assembly language files through the C pre-processor if the asm file ends in .s but WILL if the file ends in .S

Genaral Use

AASCIZ

AASCIZ label, string

This macro invokes .asciz with the string set to string and the label set to label. In addition, this macro ensures that the string begins on a 4-byte-aligned boundary.

PUSH_P, PUSH_R, POP_P and POP_R

These macros save some repetitive typing. For example:

1	PUSH_P x29, x30

resolves to:

1	stp x29, x30, [sp, -16]!

START_PROC and END_PROC

Place START_PROC after the label introducing a function.

Place END_PROC after the last ret of the function.

These resolve to: .cfi_startproc and .cfi_endproc respectively.

MIN and MAX

Handy more readable macros for determining minima and maxima. Note that the macro performs a cmp which subtracts src_b from src_a (discarding the results) in order to set the flags to be interpreted by the following csel.

Signature:

1	MIN src_a, src_b, dest

The smaller of src_a and src_b is put into dest.

Signature:

1	MAX src_a, src_b, dest

The larger of src_a and src_b is put into dest.

MOD

MOD macro used above is defined as:

.macro  MOD         src_a, src_b, dest, scratch
        sdiv        \scratch, \src_a, \src_b
        msub        \dest, \scratch, \src_b, \src_a
.endm

GLABEL

Mark a label as global, Makes a label available externally.

Signature:

1	GLABEL label

An underscore is prepended.

CRT

Calling CRT(C runtime) functions
If you create your own function without an underscore, just call it as usual.
If you need to call a function such as those found in the C runtime library, use this macro in this way:

1	CRT strlen

MAIN

Declaring main()
Put MAIN on a line by itself. Notice there is no colon.

errno

The externally defined errno is accessed via a CRT function which isn’t seen when coding in C and C++. The function is named differently on Mac versus Linux. To get the address of errno use:

1	ERRNO_ADDR

This macro makes the correct CRT call and leaves the address of errno in x0.

Loads and Stores

GLD_PTR

Loads the address of a label and then dereferences it where, on Apple the label is in the global space and on Linux is a relatively close label.

Signature:

1	GLD_PTR xreg, label

When this macro finishes, the specified x register contains what 64 bit value lives at the specified label.

GLD_ADDR

Loads the address of the label into the specified x register. No dereferencing takes place. On Apple machines, the label will be found in the global space.

Signature:

1	GLD_ADDR xreg, label

When this macro completes, the address of the label is in the x register.

LLD_ADDR

Similar to GLD_ADDR this macro loads the address of a “local” label.

Signature:

1	LLD_ADDR xreg, label

When this macro completes, the address of the label is in the x register.

LLD_DBL

Signature:

1	LLD_DBL xreg, dreg, label

When this macro completes, a double that lives at the specified local label will sit in the specified double register.

LLD_FLT

Signature:

1	LLD_FLT xreg, sreg, label

When this macro completes, a float that lives at the specified local label will sit in the specified single precision register.

performance

Undoing Stack Pointer Changes

A small tip concerning undoing changes to the stack pointer. You might think that changes to the stack made by str or stp and their cousins must be undone with ldr or ldp and their cousins.

This depends.

If you need to get back the original contents of a register pushed onto the stack, then an ldr or ldp is appropriate. However, if you don’t need to get the original contents of a register back, then it is faster to undo a change to the stack using addition.

Take for example the use of printf(). On Apple Silicon systems, you must send arguments to printf() by pushing them onto the stack. However, when printf() completes, you have no need for the values that you pushed. As shown above, simply add the right (multiple of 16) to the stack pointer. This is faster as the addition makes no reference to RAM (or caches) as the ldr would.

other stuff

let the assembler itself calculate the length for you

        .global        main                                             
        .align         2                                                
        .text                                                           
                                                                        
main:   str            x30, [sp, -16]!                                  
        mov            w0, 1             // stdout                      
        ldr            x1, =s            // pointer to string           
        ldr            x2, =ssize        // pointer to computed length  
        ldr            w2, [x2]          // actual length of string     
        bl             write                                            
                                                                        
        ldr            x0, =fmt                                         
        ldr            x1, =s                                           
        ldr            x2, =ssize                                       
        ldr            w2, [x2]                                         
        bl             printf                                           
                                                                        
        ldr            x30, [sp], 16                                    
        mov            w0, wzr                                          
        ret                                                             
                                                                        
        .data                                                           
                                                                        
s:      .asciz         "Hello, World!\n"                                
ssize:  .word          ssize - s - 1        // accounts for null at end 
fmt:    .asciz         "str: %slen: %d\n"   // accounts for newline     
                                                                        
        .end

atomic operations

Load Linked, Store Condition

        .text                                                 
        .p2align    2                                         
                                                              
#if defined(__APPLE__)                                        
        .global     _LoadLinkedStoreConditional               
_LoadLinkedStoreConditional:                                  
#else                                                         
        .global     LoadLinkedStoreConditional                
LoadLinkedStoreConditional:                                   
#endif                                                        
1:      ldaxr       w1, [x0]                                  
        add         w1, w1, 1                                 
        stlxr       w2, w1, [x0]                              
        cbnz        w2, 1b                                    
        ret

LL/SC 是一种乐观并发控制机制。它大致逻辑是：

Load-Linked（LDAXR）：加载一个地址的值，并“观察”该地址是否被改动。
你可以修改这个值（如加1）。
Store-Conditional（STLXR）：尝试写回这个新值，如果在这之间地址内容没有被别人改过，则写入成功；否则失败。
成功与否会通过 STLXR 的返回值告诉你（0 表示成功，非 0 表示失败）。

llsc

Implementations of operations on atomic variables were improved in the second version of ARMv8, called ARMv8.1. The load linked and store conditional instructions are still available but several new instructions were added which perform certain operations such as addition, subtraction and various bitwise operations in a single atomic instruction.

For example:

    mov       w1, 1
    ldaddal   w1, w0, [x0]
``
does the same work of atomically adding one to the value in memory pointed to by x0.



##### spin-lock

Here is the source code to the spin-lock for ARM V8.

Lock

```asm
Lock:                                                              
        START_PROC                                                 
        mov         w3, 1           // 准备存储的值：1 表示“加锁”  
1:      ldaxr       w1, [x0]        // 原子加载并标记 exclusive 访问                      
        cbnz        w1, 1b          // 如果锁不为 0（被别人持有），继续自旋           
        stlxr       w2, w3, [x0]    // 尝试原子写入，成功则 w2=0                  
        cbnz        w2, 1b          // 如果失败（有竞争），继续自旋
        ret                                                        
        END_PROC

stlxr: 如果 exclusive tag 还有效（没人抢走锁），那么将 w3 的值写入 *x0，并将结果放入 w2（0 表示成功）

ldaxr dereferencing the lock itself (once again an int32_t) and marks the location of the lock as being hopefully, exclusive.
Having gotten the value of the lock, its value is inspected and if found to be non-zero, we branch back to attempting to get it again - this is the spin.
If the contents of the lock is 0, its value in w1 is changed to non-zero. Note, this could be made a bit better if a value of 1 was stored in another w register and simply used directly on line 10.
stlxr w2, w3, [x0] conditionally stores the changed value back to the location of the lock. If the stlxr returns 0, we got the lock. If not, we start over - somebody else got in there ahead of us. Perhaps this happened because we were descheduled. Perhaps we lost the lock to another thread running on a different core.

unlock

Unlock:                                                           
        START_PROC                                                
        str         wzr, [x0]       // 写 0 表示释放锁                            
        dmb         ish             // 内存屏障，跨核同步                    
        ret                                              
        END_PROC

All it does is set to value of the lock to zero. The correct operation of the lock requires that no bad actor simply stomps on the lock by calling Unlock without first owning the lock. Just say no to lock stompers.
dmb ish sets up a data memory barrier across each processor - it makes sure threads running on different cores see the update correctly. This code seemed to work without this line but intuition suggests it could be important. In Lock() the stlxr instruction has an implied data memory barrier.

总结（伪代码角度）
🔒 Lock(x0):

do {
    w1 = *x0;      // atomic exclusive load
    if (w1 != 0) continue;
    result = atomic_store_exclusive(x0, 1);  // try to set lock
} while (result != 0);  // someone else beat us