x86机器代码(使用128位SSE1&AVX的SIMD 4x浮点数)94字节
x86机器代码(使用256位AVX的SIMD 4x两倍)123字节
float
通过了问题中的测试用例,但是其循环退出阈值足够小以至于会发生这种情况,因此很容易陷入带有随机输入的无限循环中。
SSE1压缩单精度指令长3个字节,但SSE2和简单AVX指令长4个字节。(像这样的标量单指令sqrtss
也有4个字节长,这就是sqrtps
即使我只在乎低位元素也要使用的原因。它甚至不比现代硬件上的sqrtss慢)。我将AVX用于非破坏性目标,与movaps + op相比可节省2个字节。
在double版本中,我们仍然可以做一些movlhps
复制64位块的操作(因为通常我们只关心水平和的低位元素)。256位SIMD向量的水平和还需要额外的费用vextractf128
才能获得高一半,而对于float则需要缓慢但较小的2x haddps
策略。的double
版本还需要2x 8字节常量,而不是2x 4字节常量。总体来说,它的大小接近该float
版本的4/3 。
mean(a,b) = mean(a,a,b,b)
对于所有这4个方法,我们只需将输入最多复制4个元素,而不必实现length = 2。因此,我们可以将几何均值硬编码为例如4th-root = sqrt(sqrt)。而且我们只需要一个FP常数4.0
。
我们所有4个都有一个SIMD向量[a_i, b_i, c_i, d_i]
。据此,我们将4个均值计算为单独寄存器中的标量,然后将它们重新混合在一起以进行下一次迭代。 (在SIMD向量上进行水平操作不方便,但是在足够的情况下,我们需要对所有4个元素执行相同的操作,以使其平衡。我从x87版本开始,但是它变得很长,而且不好玩。)
的循环退出条件}while(quadratic - harmonic > 4e-5)
(或的较小常量double
)基于@RobinRyder的R答案和Kevin Cruijssen的Java答案:二次均值始终是最大量级,谐波均值始终是最小量(忽略舍入误差)。因此,我们可以检查这两者之间的差异以检测收敛。我们将算术平均值作为标量结果返回。它通常在这两者之间,并且最不容易出现舍入错误。
浮动版本:float meanmean_float_avx(__m128);
与arg 一样可调用,并在xmm0中返回值。(因此x86-64 System V或Windows x64 vectorcall,而不是x64 fastcall。)或声明return类型,__m128
以便可以得到二次和谐波均值进行测试。
float
让它在xmm0和xmm1中使用2个单独的args会多花1个字节:我们需要一个shufps
带有imm8(而不是unpcklps xmm0,xmm0
)的shuffle在一起并复制2个输入。
40 address align 32
41 code bytes global meanmean_float_avx
42 meanmean_float_avx:
43 00000000 B9[52000000] mov ecx, .arith_mean ; allows 2-byte call reg, and a base for loading constants
44 00000005 C4E2791861FC vbroadcastss xmm4, [rcx-4] ; float 4.0
45
46 ;; mean(a,b) = mean(a,b,a,b) for all 4 types of mean
47 ;; so we only ever have to do the length=4 case
48 0000000B 0F14C0 unpcklps xmm0,xmm0 ; [b,a] => [b,b,a,a]
49
50 ; do{ ... } while(quadratic - harmonic > threshold);
51 .loop:
52 ;;; XMM3 = geometric mean: not based on addition. (Transform to log would be hard. AVX512ER has exp with 23-bit accuracy, but not log. vgetexp = floor(lofg2(x)), so that's no good.)
53 ;; sqrt once *first*, making magnitudes closer to 1.0 to reduce rounding error. Numbers are all positive so this is safe.
54 ;; both sqrts first was better behaved, I think.
55 0000000E 0F51D8 sqrtps xmm3, xmm0 ; xmm3 = 4th root(x)
56 00000011 F30F16EB movshdup xmm5, xmm3 ; bring odd elements down to even
57 00000015 0F59EB mulps xmm5, xmm3
58 00000018 0F12DD movhlps xmm3, xmm5 ; high half -> low
59 0000001B 0F59DD mulps xmm3, xmm5 ; xmm3[0] = hproduct(sqrt(xmm))
60 ; sqrtps xmm3, xmm3 ; sqrt(hprod(sqrt)) = 4th root(hprod)
61 ; common final step done after interleaving with quadratic mean
62
63 ;;; XMM2 = quadratic mean = max of the means
64 0000001E C5F859E8 vmulps xmm5, xmm0,xmm0
65 00000022 FFD1 call rcx ; arith mean of squares
66 00000024 0F14EB unpcklps xmm5, xmm3 ; [quad^2, geo^2, ?, ?]
67 00000027 0F51D5 sqrtps xmm2, xmm5 ; [quad, geo, ?, ?]
68
69 ;;; XMM1 = harmonic mean = min of the means
70 0000002A C5D85EE8 vdivps xmm5, xmm4, xmm0 ; 4/x
71 0000002E FFD1 call rcx ; arithmetic mean (under inversion)
72 00000030 C5D85ECD vdivps xmm1, xmm4, xmm5 ; 4/. (the factor of 4 cancels out)
73
74 ;;; XMM5 = arithmetic mean
75 00000034 0F28E8 movaps xmm5, xmm0
76 00000037 FFD1 call rcx
77
78 00000039 0F14E9 unpcklps xmm5, xmm1 ; [arith, harm, ?,?]
79 0000003C C5D014C2 vunpcklps xmm0, xmm5,xmm2 ; x = [arith, harm, quad, geo]
80
81 00000040 0F5CD1 subps xmm2, xmm1 ; largest - smallest mean: guaranteed non-negative
82 00000043 0F2E51F8 ucomiss xmm2, [rcx-8] ; quad-harm > convergence_threshold
83 00000047 73C5 jae .loop
84
85 ; return with the arithmetic mean in the low element of xmm0 = scalar return value
86 00000049 C3 ret
87
88 ;;; "constant pool" between the main function and the helper, like ARM literal pools
89 0000004A ACC52738 .fpconst_threshold: dd 4e-5 ; 4.3e-5 is the highest we can go and still pass the main test cases
90 0000004E 00008040 .fpconst_4: dd 4.0
91 .arith_mean: ; returns XMM5 = hsum(xmm5)/4.
92 00000052 C5D37CED vhaddps xmm5, xmm5 ; slow but small
93 00000056 C5D37CED vhaddps xmm5, xmm5
94 0000005A 0F5EEC divps xmm5, xmm4 ; divide before/after summing doesn't matter mathematically or numerically; divisor is a power of 2
95 0000005D C3 ret
96 0000005E 5E000000 .size: dd $ - meanmean_float_avx
0x5e = 94 bytes
(使用创建的NASM清单nasm -felf64 mean-mean.asm -l/dev/stdout | cut -b -34,$((34+6))-
。用剥离清单部分,并使用恢复源cut -b 34- > mean-mean.asm
)
SIMD水平求和除以4(即算术平均值)是在一个单独的函数中实现的call
(我们使用函数指针摊销地址成本)。在4/x
之前/之后,x^2
之前或之后和sqrt之后,我们得到调和平均值和二次平均值。(编写这些div
指令而不是乘以一个可精确表示的指令是很痛苦的0.25
。)
几何均值分别通过乘法和链接平方根实现。或者先使用一个sqrt来减小指数幅度并可能有助于数值精度。日志仅floor(log2(x))
通过AVX512 不可用vgetexpps/pd
。可以通过AVX512ER(仅至强融核)使用Exp,但精度仅为2 ^ -23。
混合使用128位AVX指令和旧版SSE并不是性能问题。可以在Haswell上混合使用256位AVX和旧版SSE,但是在Skylake上,它可能会潜在地为SSE指令创建潜在的错误依赖关系。我认为我的double
版本避免了任何不必要的循环承载的dep链,并且避免了div / sqrt延迟/吞吐量方面的瓶颈。
双重版本:
108 global meanmean_double_avx
109 meanmean_double_avx:
110 00000080 B9[E8000000] mov ecx, .arith_mean
111 00000085 C4E27D1961F8 vbroadcastsd ymm4, [rcx-8] ; float 4.0
112
113 ;; mean(a,b) = mean(a,b,a,b) for all 4 types of mean
114 ;; so we only ever have to do the length=4 case
115 0000008B C4E37D18C001 vinsertf128 ymm0, ymm0, xmm0, 1 ; [b,a] => [b,a,b,a]
116
117 .loop:
118 ;;; XMM3 = geometric mean: not based on addition.
119 00000091 C5FD51D8 vsqrtpd ymm3, ymm0 ; sqrt first to get magnitude closer to 1.0 for better(?) numerical precision
120 00000095 C4E37D19DD01 vextractf128 xmm5, ymm3, 1 ; extract high lane
121 0000009B C5D159EB vmulpd xmm5, xmm3
122 0000009F 0F12DD movhlps xmm3, xmm5 ; extract high half
123 000000A2 F20F59DD mulsd xmm3, xmm5 ; xmm3 = hproduct(sqrt(xmm0))
124 ; sqrtsd xmm3, xmm3 ; xmm3 = 4th root = geomean(xmm0) ;deferred until quadratic
125
126 ;;; XMM2 = quadratic mean = max of the means
127 000000A6 C5FD59E8 vmulpd ymm5, ymm0,ymm0
128 000000AA FFD1 call rcx ; arith mean of squares
129 000000AC 0F16EB movlhps xmm5, xmm3 ; [quad^2, geo^2]
130 000000AF 660F51D5 sqrtpd xmm2, xmm5 ; [quad , geo]
131
132 ;;; XMM1 = harmonic mean = min of the means
133 000000B3 C5DD5EE8 vdivpd ymm5, ymm4, ymm0 ; 4/x
134 000000B7 FFD1 call rcx ; arithmetic mean under inversion
135 000000B9 C5DB5ECD vdivsd xmm1, xmm4, xmm5 ; 4/. (the factor of 4 cancels out)
136
137 ;;; XMM5 = arithmetic mean
138 000000BD C5FC28E8 vmovaps ymm5, ymm0
139 000000C1 FFD1 call rcx
140
141 000000C3 0F16E9 movlhps xmm5, xmm1 ; [arith, harm]
142 000000C6 C4E35518C201 vinsertf128 ymm0, ymm5, xmm2, 1 ; x = [arith, harm, quad, geo]
143
144 000000CC C5EB5CD1 vsubsd xmm2, xmm1 ; largest - smallest mean: guaranteed non-negative
145 000000D0 660F2E51F0 ucomisd xmm2, [rcx-16] ; quad - harm > threshold
146 000000D5 77BA ja .loop
147
148 ; vzeroupper ; not needed for correctness, only performance
149 ; return with the arithmetic mean in the low element of xmm0 = scalar return value
150 000000D7 C3 ret
151
152 ; "literal pool" between the function
153 000000D8 95D626E80B2E113E .fpconst_threshold: dq 1e-9
154 000000E0 0000000000001040 .fpconst_4: dq 4.0 ; TODO: golf these zeros? vpbroadcastb and convert?
155 .arith_mean: ; returns YMM5 = hsum(ymm5)/4.
156 000000E8 C4E37D19EF01 vextractf128 xmm7, ymm5, 1
157 000000EE C5D158EF vaddpd xmm5, xmm7
158 000000F2 C5D17CED vhaddpd xmm5, xmm5 ; slow but small
159 000000F6 C5D35EEC vdivsd xmm5, xmm4 ; only low element matters
160 000000FA C3 ret
161 000000FB 7B000000 .size: dd $ - meanmean_double_avx
0x7b = 123 bytes
C测试线束
#include <immintrin.h>
#include <stdio.h>
#include <math.h>
static const struct ab_avg {
double a,b;
double mean;
} testcases[] = {
{1, 1, 1},
{1, 2, 1.45568889},
{100, 200, 145.568889},
{2.71, 3.14, 2.92103713},
{0.57, 1.78, 1.0848205},
{1.61, 2.41, 1.98965438},
{0.01, 100, 6.7483058},
};
// see asm comments for order of arith, harm, quad, geo
__m128 meanmean_float_avx(__m128); // or float ...
__m256d meanmean_double_avx(__m128d); // or double ...
int main(void) {
int len = sizeof(testcases) / sizeof(testcases[0]);
for(int i=0 ; i<len ; i++) {
const struct ab_avg *p = &testcases[i];
#if 1
__m128 arg = _mm_set_ps(0,0, p->b, p->a);
double res = meanmean_float_avx(arg)[0];
#else
__m128d arg = _mm_loadu_pd(&p->a);
double res = meanmean_double_avx(arg)[0];
#endif
double allowed_diff = (p->b - p->a) / 100000.0;
double delta = fabs(p->mean - res);
if (delta > 1e-3 || delta > allowed_diff) {
printf("%f %f => %.9f but we got %.9f. delta = %g allowed=%g\n",
p->a, p->b, p->mean, res, p->mean - res, allowed_diff);
}
}
while(1) {
double a = drand48(), b = drand48(); // range= [0..1)
if (a>b) {
double tmp=a;
a=b;
b=tmp; // sorted
}
// a *= 0.00000001;
// b *= 123156;
// a += 1<<11; b += (1<<12)+1; // float version gets stuck inflooping on 2048.04, 4097.18 at fpthreshold = 4e-5
// a *= 1<<11 ; b *= 1<<11; // scaling to large magnitude makes sum of squares loses more precision
//a += 1<<11; b+= 1<<11; // adding to large magnitude is hard for everything, catastrophic cancellation
#if 1
printf("testing float %g, %g\n", a, b);
__m128 arg = _mm_set_ps(0,0, b, a);
__m128 res = meanmean_float_avx(arg);
double quad = res[2], harm = res[1]; // same order as double... for now
#else
printf("testing double %g, %g\n", a, b);
__m128d arg = _mm_set_pd(b, a);
__m256d res = meanmean_double_avx(arg);
double quad = res[2], harm = res[1];
#endif
double delta = fabs(quad - harm);
double allowed_diff = (b - a) / 100000.0; // calculated in double even for the float case.
// TODO: use the double res as a reference for float res
// instead of just checking quadratic vs. harmonic mean
if (delta > 1e-3 || delta > allowed_diff) {
printf("%g %g we got q=%g, h=%g, a=%g. delta = %g, allowed=%g\n",
a, b, quad, harm, res[0], quad-harm, allowed_diff);
}
}
}
构建:
nasm -felf64 mean-mean.asm &&
gcc -no-pie -fno-pie -g -O2 -march=native mean-mean.c mean-mean.o
显然,您需要具有AVX支持的CPU或类似Intel SDE的仿真器。要在没有本机AVX支持的主机上进行编译,请使用-march=sandybridge
或-mavx
运行:通过硬编码的测试用例,但是对于浮动版本,随机测试用例通常无法(b-a)/10000
通过问题中设置的阈值。
$ ./a.out
(note: empty output before the first "testing float" means clean pass on the constant test cases)
testing float 3.90799e-14, 0.000985395
3.90799e-14 0.000985395 we got q=3.20062e-10, h=3.58723e-05, a=2.50934e-05. delta = -3.5872e-05, allowed=9.85395e-09
testing float 0.041631, 0.176643
testing float 0.0913306, 0.364602
testing float 0.0922976, 0.487217
testing float 0.454433, 0.52675
0.454433 0.52675 we got q=0.48992, h=0.489927, a=0.489925. delta = -6.79493e-06, allowed=7.23169e-07
testing float 0.233178, 0.831292
testing float 0.56806, 0.931731
testing float 0.0508319, 0.556094
testing float 0.0189148, 0.767051
0.0189148 0.767051 we got q=0.210471, h=0.210484, a=0.21048. delta = -1.37389e-05, allowed=7.48136e-06
testing float 0.25236, 0.298197
0.25236 0.298197 we got q=0.274796, h=0.274803, a=0.274801. delta = -6.19888e-06, allowed=4.58374e-07
testing float 0.531557, 0.875981
testing float 0.515431, 0.920261
testing float 0.18842, 0.810429
testing float 0.570614, 0.886314
testing float 0.0767746, 0.815274
testing float 0.118352, 0.984891
0.118352 0.984891 we got q=0.427845, h=0.427872, a=0.427863. delta = -2.66135e-05, allowed=8.66539e-06
testing float 0.784484, 0.893906
0.784484 0.893906 we got q=0.838297, h=0.838304, a=0.838302. delta = -7.09295e-06, allowed=1.09422e-06
FP错误足以使某些输入的四次伤害小于零。
或未加a += 1<<11; b += (1<<12)+1;
注释:
testing float 2048, 4097
testing float 2048.04, 4097.18
^C (stuck in an infinite loop).
这些问题都不会发生double
。 注释掉printf
每个测试之前的,以查看输出为空(该if(delta too high)
块中没有内容)。
TODO:使用该double
版本作为该版本的参考float
,而不只是查看它们如何以四次危害收敛。