x86-64机器代码(Linux系统调用):78个字节
RDTSC自旋循环计时,Linux sys_write
系统调用。
x86-64没有提供在运行时查询RDTSC“参考时钟”频率的便捷方法。您可以阅读MSR(并据此进行计算),但这需要内核模式或root + open /dev/cpu/%d/msr
,因此我决定将频率设为构建时间常数。(FREQ_RDTSC
根据需要进行调整:任何32位常量都不会更改机器代码的大小)
请注意,多年来x86 CPU具有固定的RDTSC频率,因此除非您采取措施禁用频率更改,否则它可用作时间源,而不是核心时钟周期性能计数器。(有一些实际的性能计数器可用来计算实际CPU周期。)通常,它以标称的标贴频率滴答,例如,对于我的i7-6700k,其频率为4.0GHz,无论是否采用涡轮增压或节能模式。无论如何,这种繁忙等待时间并不取决于平均负载(就像经过校准的延迟环路一样),并且对CPU的节能也不敏感。
该代码适用于参考频率低于2 ^ 32 Hz(即高达〜4.29 GHz)的任何x86。除此之外,时间戳的低32位将在1秒内完全结束,因此我也必须查看edx
结果的高32位。
总结:
推入00:00:00\n
堆栈。然后循环:
sys_write
系统调用
- ADC环路在数字(开始于最后)由1包装/递增时间搬出与处理
cmp
/ cmov
,与CF结果提供搬入下一个数字。
rdtsc
并节省开始时间。
- 旋转
rdtsc
直到增量> = TDTSC频率的每秒刻度数。
NASM列表:
1 Address ; mov %1, %2 ; use this macro to copy 64-bit registers in 2 bytes (no REX prefix)
2 Machine code %macro MOVE 2
3 bytes push %2
4 pop %1
5 %endmacro
6
7 ; frequency as a build-time constant because there's no easy way detect it without root + system calls, or kernel mode.
8 FREQ_RDTSC equ 4000000000
9 global _start
10 _start:
11 00000000 6A0A push 0xa ; newline
12 00000002 48BB30303A30303A3030 mov rbx, "00:00:00"
13 0000000C 53 push rbx
14 ; rsp points to `00:00:00\n`
20
21 ; rbp = 0 (Linux process startup. push imm8 / pop is as short as LEA for small constants)
22 ; low byte of rbx = '0'
23 .print:
24 ; edx potentially holds garbage (from rdtsc)
25
26 0000000D 8D4501 lea eax, [rbp+1] ; __NR_write = 1
27 00000010 89C7 mov edi, eax ; fd = 1 = stdout
28 MOVE rsi, rsp
28 00000012 54 <1> push %2
28 00000013 5E <1> pop %1
29 00000014 8D5008 lea edx, [rax-1 + 9] ; len = 9 bytes.
30 00000017 0F05 syscall ; sys_write(1, buf, 9)
31
32 ;; increment counter string: least-significant digits are at high addresses (in printing order)
33 00000019 FD std ; so loop backwards from the end, wrapping each digit manually
34 0000001A 488D7E07 lea rdi, [rsi+7]
35 MOVE rsi, rdi
35 0000001E 57 <1> push %2
35 0000001F 5E <1> pop %1
36
37 ;; edx=9 from the system call
38 00000020 83C2FA add edx, -9 + 3 ; edx=3 and set CF (so the low digit of seconds will be incremented by the carry-in)
39 ;stc
40 .string_increment_60: ; do {
41 00000023 66B93902 mov cx, 0x0200 + '9' ; saves 1 byte vs. ecx.
42 ; cl = '9' = wrap limit for manual carry of low digit. ch = 2 = digit counter
43 .digitpair:
44 00000027 AC lodsb
45 00000028 1400 adc al, 0 ; carry-in = cmp from previous iteration; other instructions preserve CF
46 0000002A 38C1 cmp cl, al ; manual carry-out + wrapping at '9' or '5'
47 0000002C 0F42C3 cmovc eax, ebx ; bl = '0'. 1B shorter than JNC over a MOV al, '0'
48 0000002F AA stosb
49
50 00000030 8D49FC lea ecx, [rcx-4] ; '9' -> '5' for the tens digit, so we wrap at 59
51 00000033 FECD dec ch
52 00000035 75F0 jnz .digitpair
53 ; hours wrap from 59 to 00, so the max count is 59:59:59
54
55 00000037 AC lodsb ; skip the ":" separator
56 00000038 AA stosb ; and increment rdi by storing the byte back again. scasb would clobber CF
57
58 00000039 FFCA dec edx
59 0000003B 75E6 jnz .string_increment_60
60
61 ; busy-wait for 1 second. Note that time spent printing isn't counted, so error accumulates with a bias in one direction
62 0000003D 0F31 rdtsc ; looking only at the 32-bit low halves works as long as RDTSC freq < 2^32 = ~4.29GHz
63 0000003F 89C1 mov ecx, eax ; ecx = start
64 .spinwait:
65 ; pause
66 00000041 0F31 rdtsc ; edx:eax = reference cycles since boot
67 00000043 29C8 sub eax, ecx ; delta = now - start. This may wrap, but now we have the delta ready for a normal compare
68 00000045 3D00286BEE cmp eax, FREQ_RDTSC ; } while(delta < counts_per_second)
69 ; cmp eax, 40 ; fast count to test printing
70 0000004A 72F5 jb .spinwait
71
72 0000004C EBBF jmp .print
next address = 0x4E = size = 78 bytes.
取消注释该pause
指令以节省大量功率:在不使用的情况下pause
,一个内核加热约15摄氏度,在使用时仅加热约9摄氏度pause
。(在Skylake上,它的pause
睡眠时间为〜100个周期,而不是〜5 个周期。我认为,如果运行rdtsc
速度也不慢,它将节省更多时间,因此CPU不会在很多时间上工作)。
32位版本将短几个字节,例如,使用32位版本将其推入初始00:00:00 \ n字符串。
16 ; mov ebx, "00:0"
17 ; push rbx
18 ; bswap ebx
19 ; mov dword [rsp+4], ebx ; in 32-bit mode, mov-imm / push / bswap / push would be 9 bytes vs. 11
并且也使用1个字节dec edx
。该int 0x80
系统调用ABI不会使用ESI / EDI,因此寄存器设置为系统调用与LODSB / STOSB可能比较简单。