我们一直在测试使用2x Xeon Gold 6154 CPU和Supermicro X11DPH-I主板以及96GB RAM的服务器,发现与仅使用1个CPU(一个插槽为空),类似双CPU的运行相比,内存存在一些非常奇怪的性能问题。 CPU Haswell Xeon E5-2687Wv3(用于该系列测试,但其他Broadwell的性能类似),Broadwell-E i7和Skylake-X i9(用于比较)。
可以预期,具有更快内存的Skylake Xeon处理器在处理各种memcpy功能甚至内存分配(我们在下面的测试中未涵盖,我们找到了一种解决方法)时,其性能将比Haswell更快,但同时安装了两个CPU ,Skylake Xeon的速度几乎是Haswell Xeon的一半,与i7-6800k相比,甚至更低。甚至更奇怪的是,当使用Windows VirtualAllocExNuma分配NUMA节点进行内存分配时,虽然普通内存复制功能预期在远程节点上的性能要比本地节点差,但使用SSE,MMX和AVX寄存器的内存复制功能却执行得很多在远程NUMA节点上的速度比在本地节点上快(什么?)。如上所述,借助Skylake Xeons,
我不确定这是否是主板或CPU上的错误,或者是UPI与QPI的错误,还是以上都不是,但BIOS设置的组合似乎都没有用。在BIOS中禁用NUMA(测试结果中未包括)确实可以提高使用SSE,MMX和AVX寄存器的所有复制功能的性能,但是所有其他普通内存复制功能也会遭受很大的损失。
对于我们的测试程序,我们同时使用内联汇编函数和_mm
内在函数进行了测试,除了汇编函数(msvc ++不会针对x64编译asm)之外,我们将Windows 10与Visual Studio 2017一起用于所有其他功能,我们使用mingw / msys的gcc到使用-c -O2
标记编译obj文件,该标记包含在msvc ++链接器中。
如果系统使用的是NUMA节点,我们将对每个NUMA节点使用VirtualAllocExNuma测试两个新的用于内存分配的运算符,并对每个内存复制函数进行100个平均16MB的内存缓冲区副本的累积平均值,然后轮换我们在哪个内存分配上在每组测试之间。
所有100个源缓冲区和100个目标缓冲区都是64字节对齐的(为了兼容使用流功能的AVX512),并分别初始化为源缓冲区的增量数据和目标缓冲区的0xff。
在每种配置下,每台计算机上平均的副本数量各不相同,因为某些副本的速度快得多,而另一些配置的速度慢得多。
结果如下:
Haswell Xeon E5-2687Wv3具有32GB DDR4-2400(10c / 20t,25 MB的L3缓存)的 Supermicro X10DAi上的1个CPU(1个空插槽)。但是请记住,基准测试通过100对16MB缓冲区循环,因此我们可能没有获得L3缓存命中率。
---------------------------------------------------------------------------
Averaging 7000 copies of 16MB of data per function for operator new
---------------------------------------------------------------------------
std::memcpy averaging 2264.48 microseconds
asm_memcpy (asm) averaging 2322.71 microseconds
sse_memcpy (intrinsic) averaging 1569.67 microseconds
sse_memcpy (asm) averaging 1589.31 microseconds
sse2_memcpy (intrinsic) averaging 1561.19 microseconds
sse2_memcpy (asm) averaging 1664.18 microseconds
mmx_memcpy (asm) averaging 2497.73 microseconds
mmx2_memcpy (asm) averaging 1626.68 microseconds
avx_memcpy (intrinsic) averaging 1625.12 microseconds
avx_memcpy (asm) averaging 1592.58 microseconds
avx512_memcpy (intrinsic) unsupported on this CPU
rep movsb (asm) averaging 2260.6 microseconds
具有64GB内存的Supermicro X10DAi上的Haswell Dual Xeon E5-2687Wv3 2 cpu
---------------------------------------------------------------------------
Averaging 6900 copies of 16MB of data per function for VirtualAllocExNuma to NUMA node 0(local)
---------------------------------------------------------------------------
std::memcpy averaging 3179.8 microseconds
asm_memcpy (asm) averaging 3177.15 microseconds
sse_memcpy (intrinsic) averaging 1633.87 microseconds
sse_memcpy (asm) averaging 1663.8 microseconds
sse2_memcpy (intrinsic) averaging 1620.86 microseconds
sse2_memcpy (asm) averaging 1727.36 microseconds
mmx_memcpy (asm) averaging 2623.07 microseconds
mmx2_memcpy (asm) averaging 1691.1 microseconds
avx_memcpy (intrinsic) averaging 1704.33 microseconds
avx_memcpy (asm) averaging 1692.69 microseconds
avx512_memcpy (intrinsic) unsupported on this CPU
rep movsb (asm) averaging 3185.84 microseconds
---------------------------------------------------------------------------
Averaging 6900 copies of 16MB of data per function for VirtualAllocExNuma to NUMA node 1
---------------------------------------------------------------------------
std::memcpy averaging 3992.46 microseconds
asm_memcpy (asm) averaging 4039.11 microseconds
sse_memcpy (intrinsic) averaging 3174.69 microseconds
sse_memcpy (asm) averaging 3129.18 microseconds
sse2_memcpy (intrinsic) averaging 3161.9 microseconds
sse2_memcpy (asm) averaging 3141.33 microseconds
mmx_memcpy (asm) averaging 4010.17 microseconds
mmx2_memcpy (asm) averaging 3211.75 microseconds
avx_memcpy (intrinsic) averaging 3003.14 microseconds
avx_memcpy (asm) averaging 2980.97 microseconds
avx512_memcpy (intrinsic) unsupported on this CPU
rep movsb (asm) averaging 3987.91 microseconds
---------------------------------------------------------------------------
Averaging 6900 copies of 16MB of data per function for operator new
---------------------------------------------------------------------------
std::memcpy averaging 3172.95 microseconds
asm_memcpy (asm) averaging 3173.5 microseconds
sse_memcpy (intrinsic) averaging 1623.84 microseconds
sse_memcpy (asm) averaging 1657.07 microseconds
sse2_memcpy (intrinsic) averaging 1616.95 microseconds
sse2_memcpy (asm) averaging 1739.05 microseconds
mmx_memcpy (asm) averaging 2623.71 microseconds
mmx2_memcpy (asm) averaging 1699.33 microseconds
avx_memcpy (intrinsic) averaging 1710.09 microseconds
avx_memcpy (asm) averaging 1688.34 microseconds
avx512_memcpy (intrinsic) unsupported on this CPU
rep movsb (asm) averaging 3175.14 microseconds
配备48GB DDR4-2666(18c / 36t,24.75 MB的三级缓存)的 Supermicro X11DPH-I上的Skylake Xeon Gold 6154 1 CPU(1个空插槽)
---------------------------------------------------------------------------
Averaging 5000 copies of 16MB of data per function for operator new
---------------------------------------------------------------------------
std::memcpy averaging 1832.42 microseconds
asm_memcpy (asm) averaging 1837.62 microseconds
sse_memcpy (intrinsic) averaging 1647.84 microseconds
sse_memcpy (asm) averaging 1710.53 microseconds
sse2_memcpy (intrinsic) averaging 1645.54 microseconds
sse2_memcpy (asm) averaging 1794.36 microseconds
mmx_memcpy (asm) averaging 2030.51 microseconds
mmx2_memcpy (asm) averaging 1816.82 microseconds
avx_memcpy (intrinsic) averaging 1686.49 microseconds
avx_memcpy (asm) averaging 1716.15 microseconds
avx512_memcpy (intrinsic) averaging 1761.6 microseconds
rep movsb (asm) averaging 1977.6 microseconds
配备96GB DDR4-2666的Supermicro X11DPH-I上的Skylake Xeon Gold 6154 2 CPU
---------------------------------------------------------------------------
Averaging 4100 copies of 16MB of data per function for VirtualAllocExNuma to NUMA node 0(local)
---------------------------------------------------------------------------
std::memcpy averaging 3131.6 microseconds
asm_memcpy (asm) averaging 3070.57 microseconds
sse_memcpy (intrinsic) averaging 3297.72 microseconds
sse_memcpy (asm) averaging 3423.38 microseconds
sse2_memcpy (intrinsic) averaging 3274.31 microseconds
sse2_memcpy (asm) averaging 3413.48 microseconds
mmx_memcpy (asm) averaging 2069.53 microseconds
mmx2_memcpy (asm) averaging 3694.91 microseconds
avx_memcpy (intrinsic) averaging 3118.75 microseconds
avx_memcpy (asm) averaging 3224.36 microseconds
avx512_memcpy (intrinsic) averaging 3156.56 microseconds
rep movsb (asm) averaging 3155.36 microseconds
---------------------------------------------------------------------------
Averaging 4100 copies of 16MB of data per function for VirtualAllocExNuma to NUMA node 1
---------------------------------------------------------------------------
std::memcpy averaging 5309.77 microseconds
asm_memcpy (asm) averaging 5330.78 microseconds
sse_memcpy (intrinsic) averaging 2350.61 microseconds
sse_memcpy (asm) averaging 2402.57 microseconds
sse2_memcpy (intrinsic) averaging 2338.61 microseconds
sse2_memcpy (asm) averaging 2475.51 microseconds
mmx_memcpy (asm) averaging 2883.97 microseconds
mmx2_memcpy (asm) averaging 2517.69 microseconds
avx_memcpy (intrinsic) averaging 2356.07 microseconds
avx_memcpy (asm) averaging 2415.22 microseconds
avx512_memcpy (intrinsic) averaging 2487.01 microseconds
rep movsb (asm) averaging 5372.98 microseconds
---------------------------------------------------------------------------
Averaging 4100 copies of 16MB of data per function for operator new
---------------------------------------------------------------------------
std::memcpy averaging 3075.1 microseconds
asm_memcpy (asm) averaging 3061.97 microseconds
sse_memcpy (intrinsic) averaging 3281.17 microseconds
sse_memcpy (asm) averaging 3421.38 microseconds
sse2_memcpy (intrinsic) averaging 3268.79 microseconds
sse2_memcpy (asm) averaging 3435.76 microseconds
mmx_memcpy (asm) averaging 2061.27 microseconds
mmx2_memcpy (asm) averaging 3694.48 microseconds
avx_memcpy (intrinsic) averaging 3111.16 microseconds
avx_memcpy (asm) averaging 3227.45 microseconds
avx512_memcpy (intrinsic) averaging 3148.65 microseconds
rep movsb (asm) averaging 2967.45 microseconds
华硕ROG Rampage VI Extreme上的Skylake-X i9-7940X,配备32GB DDR4-4266(14c / 28t,19.25 MB的L3缓存)(超频至3.8GHz / 4.4GHz Turbo,DDR 4040MHz,目标AVX频率3737MHz,目标AVX- 512频率3535MHz,目标缓存频率2424MHz)
---------------------------------------------------------------------------
Averaging 6500 copies of 16MB of data per function for operator new
---------------------------------------------------------------------------
std::memcpy averaging 1750.87 microseconds
asm_memcpy (asm) averaging 1748.22 microseconds
sse_memcpy (intrinsic) averaging 1743.39 microseconds
sse_memcpy (asm) averaging 3120.18 microseconds
sse2_memcpy (intrinsic) averaging 1743.37 microseconds
sse2_memcpy (asm) averaging 2868.52 microseconds
mmx_memcpy (asm) averaging 2255.17 microseconds
mmx2_memcpy (asm) averaging 3434.58 microseconds
avx_memcpy (intrinsic) averaging 1698.49 microseconds
avx_memcpy (asm) averaging 2840.65 microseconds
avx512_memcpy (intrinsic) averaging 1670.05 microseconds
rep movsb (asm) averaging 1718.77 microseconds
华硕X99上的Broadwell i7-6800k,配备24GB DDR4-2400(6c / 12t,15 MB三级缓存)
---------------------------------------------------------------------------
Averaging 64900 copies of 16MB of data per function for operator new
---------------------------------------------------------------------------
std::memcpy averaging 2522.1 microseconds
asm_memcpy (asm) averaging 2615.92 microseconds
sse_memcpy (intrinsic) averaging 1621.81 microseconds
sse_memcpy (asm) averaging 1669.39 microseconds
sse2_memcpy (intrinsic) averaging 1617.04 microseconds
sse2_memcpy (asm) averaging 1719.06 microseconds
mmx_memcpy (asm) averaging 3021.02 microseconds
mmx2_memcpy (asm) averaging 1691.68 microseconds
avx_memcpy (intrinsic) averaging 1654.41 microseconds
avx_memcpy (asm) averaging 1666.84 microseconds
avx512_memcpy (intrinsic) unsupported on this CPU
rep movsb (asm) averaging 2520.13 microseconds
汇编函数派生自xine-libs中的fast_memcpy,主要用于与msvc ++的优化程序进行比较。
该测试的源代码可在https://github.com/marcmicalizzi/memcpy_test获得(发布时过长)
有没有其他人遇到这个问题,或者没有人对为什么会发生这种情况有任何见解?
更新2018-05-15 13:40EST
因此,正如彼得·科德斯(Peter Cordes)所建议的那样,我已经更新了测试以比较预取与未预取,以及NT商店与常规商店,并调整了在每个函数中完成的预取(我对编写预取没有任何有意义的经验,所以如果我对此有任何错误,请让我知道,我将相应地调整测试。预取确实会产生影响,因此至少起了一定作用。这些更改反映在我先前为寻找源代码的任何人所做的GitHub链接的最新修订中。
我还添加一个SSE4.1的memcpy,由于之前SSE4.1我无法找到任何_mm_stream_load
(I专门用于_mm_stream_load_si128
)SSE的功能,所以sse_memcpy
和sse2_memcpy
使用NT商店不能完全,并且还有该avx_memcpy
函数使用AVX2功能用于流加载。
我选择不对纯存储和纯加载访问模式进行测试,因为我不确定纯存储是否有意义,因为如果没有加载要访问的寄存器,则数据将毫无意义且无法验证。
新测试的有趣结果是,在Xeon Skylake Dual Socket设置上(仅在该设置上),对于16MB内存复制,存储功能实际上比NT流功能快得多。同样,仅在该设置上(并且仅在BIOS中启用了LLC预取),在某些测试(SSE,SSE4.1)中的预取性能优于prefetcht0且没有预取。
此新测试的原始结果太长,无法添加到帖子中,因此它们与源代码在以下位置发布在同一git存储库中 results-2018-05-15
我仍然不明白为什么对于流式NT存储,在Skylake SMP设置下,远程NUMA节点更快,尽管使用常规存储仍然比在本地NUMA节点上更快
prefetchnta
和NT商店中!这是一个重大的重要事实,您没有理会它!有关ERMSB 与NT向量存储区与常规向量存储区的更多讨论,请参见memcpy的Enhanced REP MOVSBrep movsb
。与MMX vs. SSE相比,解决这些问题将更加有用。可能只是使用AVX和/或AVX512并尝试将NT与常规相提并论,和/或省略SW预取。
prefetchnta
绕过L3和L2(因为L3不包含在内),因此它对预取距离更敏感(为时已晚,数据必须再次从DRAM而不是L3一直传到DRAM),所以它更“脆弱”(对调整正确的距离敏感)。但是,如果我正确地读取了asm,则您的预取距离看起来就很短,不到500个字节。@Mysticial在SKX上进行的测试发现,prefetchnta
该uarch可能会大大减慢速度),他不建议这样做。