Xeon Skylake SMP出现意外和无法解释的缓慢(和异常)内存性能


27

我们一直在测试使用2x Xeon Gold 6154 CPU和Supermicro X11DPH-I主板以及96GB RAM的服务器,发现与仅使用1个CPU(一个插槽为空),类似双CPU的运行相比,内存存在一些非常奇怪的性能问题。 CPU Haswell Xeon E5-2687Wv3(用于该系列测试,但其他Broadwell的性能类似),Broadwell-E i7和Skylake-X i9(用于比较)。

可以预期,具有更快内存的Skylake Xeon处理器在处理各种memcpy功能甚至内存分配(我们在下面的测试中未涵盖,我们找到了一种解决方法)时,其性能将比Haswell更快,但同时安装了两个CPU ,Skylake Xeon的速度几乎是Haswell Xeon的一半,与i7-6800k相比,甚至更低。甚至更奇怪的是,当使用Windows VirtualAllocExNuma分配NUMA节点进行内存分配时,虽然普通内存复制功能预期在远程节点上的性能要比本地节点差,但使用SSE,MMX和AVX寄存器的内存复制功能却执行得很多在远程NUMA节点上的速度比在本地节点上快(什么?)。如上所述,借助Skylake Xeons,

我不确定这是否是主板或CPU上的错误,或者是UPI与QPI的错误,还是以上都不是,但BIOS设置的组合似乎都没有用。在BIOS中禁用NUMA(测试结果中未包括)确实可以提高使用SSE,MMX和AVX寄存器的所有复制功能的性能,但是所有其他普通内存复制功能也会遭受很大的损失。

对于我们的测试程序,我们同时使用内联汇编函数和_mm内在函数进行了测试,除了汇编函数(msvc ++不会针对x64编译asm)之外,我们将Windows 10与Visual Studio 2017一起用于所有其他功能,我们使用mingw / msys的gcc到使用-c -O2标记编译obj文件,该标记包含在msvc ++链接器中。

如果系统使用的是NUMA节点,我们将对每个NUMA节点使用VirtualAllocExNuma测试两个新的用于内存分配的运算符,并对每个内存复制函数进行100个平均16MB的内存缓冲区副本的累积平均值,然后轮换我们在哪个内存分配上在每组测试之间。

所有100个源缓冲区和100个目标缓冲区都是64字节对齐的(为了兼容使用流功能的AVX512),并分别初始化为源缓冲区的增量数据和目标缓冲区的0xff。

在每种配置下,每台计算机上平均的副本数量各不相同,因为某些副本的速度快得多,而另一些配置的速度慢得多。

结果如下:

Haswell Xeon E5-2687Wv3具有32GB DDR4-2400(10c / 20t,25 MB的L3缓存)的 Supermicro X10DAi上的1个CPU(1个空插槽)。但是请记住,基准测试通过100对16MB缓冲区循环,因此我们可能没有获得L3缓存命中率。

---------------------------------------------------------------------------
Averaging 7000 copies of 16MB of data per function for operator new
---------------------------------------------------------------------------
std::memcpy                      averaging 2264.48 microseconds
asm_memcpy (asm)                 averaging 2322.71 microseconds
sse_memcpy (intrinsic)           averaging 1569.67 microseconds
sse_memcpy (asm)                 averaging 1589.31 microseconds
sse2_memcpy (intrinsic)          averaging 1561.19 microseconds
sse2_memcpy (asm)                averaging 1664.18 microseconds
mmx_memcpy (asm)                 averaging 2497.73 microseconds
mmx2_memcpy (asm)                averaging 1626.68 microseconds
avx_memcpy (intrinsic)           averaging 1625.12 microseconds
avx_memcpy (asm)                 averaging 1592.58 microseconds
avx512_memcpy (intrinsic)        unsupported on this CPU
rep movsb (asm)                  averaging 2260.6 microseconds

具有64GB内存的Supermicro X10DAi上的Haswell Dual Xeon E5-2687Wv3 2 cpu

---------------------------------------------------------------------------
Averaging 6900 copies of 16MB of data per function for VirtualAllocExNuma to NUMA node 0(local)
---------------------------------------------------------------------------
std::memcpy                      averaging 3179.8 microseconds
asm_memcpy (asm)                 averaging 3177.15 microseconds
sse_memcpy (intrinsic)           averaging 1633.87 microseconds
sse_memcpy (asm)                 averaging 1663.8 microseconds
sse2_memcpy (intrinsic)          averaging 1620.86 microseconds
sse2_memcpy (asm)                averaging 1727.36 microseconds
mmx_memcpy (asm)                 averaging 2623.07 microseconds
mmx2_memcpy (asm)                averaging 1691.1 microseconds
avx_memcpy (intrinsic)           averaging 1704.33 microseconds
avx_memcpy (asm)                 averaging 1692.69 microseconds
avx512_memcpy (intrinsic)        unsupported on this CPU
rep movsb (asm)                  averaging 3185.84 microseconds
---------------------------------------------------------------------------
Averaging 6900 copies of 16MB of data per function for VirtualAllocExNuma to NUMA node 1
---------------------------------------------------------------------------
std::memcpy                      averaging 3992.46 microseconds
asm_memcpy (asm)                 averaging 4039.11 microseconds
sse_memcpy (intrinsic)           averaging 3174.69 microseconds
sse_memcpy (asm)                 averaging 3129.18 microseconds
sse2_memcpy (intrinsic)          averaging 3161.9 microseconds
sse2_memcpy (asm)                averaging 3141.33 microseconds
mmx_memcpy (asm)                 averaging 4010.17 microseconds
mmx2_memcpy (asm)                averaging 3211.75 microseconds
avx_memcpy (intrinsic)           averaging 3003.14 microseconds
avx_memcpy (asm)                 averaging 2980.97 microseconds
avx512_memcpy (intrinsic)        unsupported on this CPU
rep movsb (asm)                  averaging 3987.91 microseconds
---------------------------------------------------------------------------
Averaging 6900 copies of 16MB of data per function for operator new
---------------------------------------------------------------------------
std::memcpy                      averaging 3172.95 microseconds
asm_memcpy (asm)                 averaging 3173.5 microseconds
sse_memcpy (intrinsic)           averaging 1623.84 microseconds
sse_memcpy (asm)                 averaging 1657.07 microseconds
sse2_memcpy (intrinsic)          averaging 1616.95 microseconds
sse2_memcpy (asm)                averaging 1739.05 microseconds
mmx_memcpy (asm)                 averaging 2623.71 microseconds
mmx2_memcpy (asm)                averaging 1699.33 microseconds
avx_memcpy (intrinsic)           averaging 1710.09 microseconds
avx_memcpy (asm)                 averaging 1688.34 microseconds
avx512_memcpy (intrinsic)        unsupported on this CPU
rep movsb (asm)                  averaging 3175.14 microseconds

配备48GB DDR4-2666(18c / 36t,24.75 MB的三级缓存)的 Supermicro X11DPH-I上的Skylake Xeon Gold 6154 1 CPU(1个空插槽

---------------------------------------------------------------------------
Averaging 5000 copies of 16MB of data per function for operator new
---------------------------------------------------------------------------
std::memcpy                      averaging 1832.42 microseconds
asm_memcpy (asm)                 averaging 1837.62 microseconds
sse_memcpy (intrinsic)           averaging 1647.84 microseconds
sse_memcpy (asm)                 averaging 1710.53 microseconds
sse2_memcpy (intrinsic)          averaging 1645.54 microseconds
sse2_memcpy (asm)                averaging 1794.36 microseconds
mmx_memcpy (asm)                 averaging 2030.51 microseconds
mmx2_memcpy (asm)                averaging 1816.82 microseconds
avx_memcpy (intrinsic)           averaging 1686.49 microseconds
avx_memcpy (asm)                 averaging 1716.15 microseconds
avx512_memcpy (intrinsic)        averaging 1761.6 microseconds
rep movsb (asm)                  averaging 1977.6 microseconds

配备96GB DDR4-2666的Supermicro X11DPH-I上的Skylake Xeon Gold 6154 2 CPU

---------------------------------------------------------------------------
Averaging 4100 copies of 16MB of data per function for VirtualAllocExNuma to NUMA node 0(local)
---------------------------------------------------------------------------
std::memcpy                      averaging 3131.6 microseconds
asm_memcpy (asm)                 averaging 3070.57 microseconds
sse_memcpy (intrinsic)           averaging 3297.72 microseconds
sse_memcpy (asm)                 averaging 3423.38 microseconds
sse2_memcpy (intrinsic)          averaging 3274.31 microseconds
sse2_memcpy (asm)                averaging 3413.48 microseconds
mmx_memcpy (asm)                 averaging 2069.53 microseconds
mmx2_memcpy (asm)                averaging 3694.91 microseconds
avx_memcpy (intrinsic)           averaging 3118.75 microseconds
avx_memcpy (asm)                 averaging 3224.36 microseconds
avx512_memcpy (intrinsic)        averaging 3156.56 microseconds
rep movsb (asm)                  averaging 3155.36 microseconds
---------------------------------------------------------------------------
Averaging 4100 copies of 16MB of data per function for VirtualAllocExNuma to NUMA node 1
---------------------------------------------------------------------------
std::memcpy                      averaging 5309.77 microseconds
asm_memcpy (asm)                 averaging 5330.78 microseconds
sse_memcpy (intrinsic)           averaging 2350.61 microseconds
sse_memcpy (asm)                 averaging 2402.57 microseconds
sse2_memcpy (intrinsic)          averaging 2338.61 microseconds
sse2_memcpy (asm)                averaging 2475.51 microseconds
mmx_memcpy (asm)                 averaging 2883.97 microseconds
mmx2_memcpy (asm)                averaging 2517.69 microseconds
avx_memcpy (intrinsic)           averaging 2356.07 microseconds
avx_memcpy (asm)                 averaging 2415.22 microseconds
avx512_memcpy (intrinsic)        averaging 2487.01 microseconds
rep movsb (asm)                  averaging 5372.98 microseconds
---------------------------------------------------------------------------
Averaging 4100 copies of 16MB of data per function for operator new
---------------------------------------------------------------------------
std::memcpy                      averaging 3075.1 microseconds
asm_memcpy (asm)                 averaging 3061.97 microseconds
sse_memcpy (intrinsic)           averaging 3281.17 microseconds
sse_memcpy (asm)                 averaging 3421.38 microseconds
sse2_memcpy (intrinsic)          averaging 3268.79 microseconds
sse2_memcpy (asm)                averaging 3435.76 microseconds
mmx_memcpy (asm)                 averaging 2061.27 microseconds
mmx2_memcpy (asm)                averaging 3694.48 microseconds
avx_memcpy (intrinsic)           averaging 3111.16 microseconds
avx_memcpy (asm)                 averaging 3227.45 microseconds
avx512_memcpy (intrinsic)        averaging 3148.65 microseconds
rep movsb (asm)                  averaging 2967.45 microseconds

华硕ROG Rampage VI Extreme上的Skylake-X i9-7940X,配备32GB DDR4-4266(14c / 28t,19.25 MB的L3缓存)(超频至3.8GHz / 4.4GHz Turbo,DDR 4040MHz,目标AVX频率3737MHz,目标AVX- 512频率3535MHz,目标缓存频率2424MHz)

---------------------------------------------------------------------------
Averaging 6500 copies of 16MB of data per function for operator new
---------------------------------------------------------------------------
std::memcpy                      averaging 1750.87 microseconds
asm_memcpy (asm)                 averaging 1748.22 microseconds
sse_memcpy (intrinsic)           averaging 1743.39 microseconds
sse_memcpy (asm)                 averaging 3120.18 microseconds
sse2_memcpy (intrinsic)          averaging 1743.37 microseconds
sse2_memcpy (asm)                averaging 2868.52 microseconds
mmx_memcpy (asm)                 averaging 2255.17 microseconds
mmx2_memcpy (asm)                averaging 3434.58 microseconds
avx_memcpy (intrinsic)           averaging 1698.49 microseconds
avx_memcpy (asm)                 averaging 2840.65 microseconds
avx512_memcpy (intrinsic)        averaging 1670.05 microseconds
rep movsb (asm)                  averaging 1718.77 microseconds

华硕X99上的Broadwell i7-6800k,配备24GB DDR4-2400(6c / 12t,15 MB三级缓存)

---------------------------------------------------------------------------
Averaging 64900 copies of 16MB of data per function for operator new
---------------------------------------------------------------------------
std::memcpy                      averaging 2522.1 microseconds
asm_memcpy (asm)                 averaging 2615.92 microseconds
sse_memcpy (intrinsic)           averaging 1621.81 microseconds
sse_memcpy (asm)                 averaging 1669.39 microseconds
sse2_memcpy (intrinsic)          averaging 1617.04 microseconds
sse2_memcpy (asm)                averaging 1719.06 microseconds
mmx_memcpy (asm)                 averaging 3021.02 microseconds
mmx2_memcpy (asm)                averaging 1691.68 microseconds
avx_memcpy (intrinsic)           averaging 1654.41 microseconds
avx_memcpy (asm)                 averaging 1666.84 microseconds
avx512_memcpy (intrinsic)        unsupported on this CPU
rep movsb (asm)                  averaging 2520.13 microseconds

汇编函数派生自xine-libs中的fast_memcpy,主要用于与msvc ++的优化程序进行比较。

该测试的源代码可在https://github.com/marcmicalizzi/memcpy_test获得(发布时过长)

有没有其他人遇到这个问题,或者没有人对为什么会发生这种情况有任何见解?


更新2018-05-15 13:40EST

因此,正如彼得·科德斯(Peter Cordes)所建议的那样,我已经更新了测试以比较预取与未预取,以及NT商店与常规商店,并调整了在每个函数中完成的预取(我对编写预取没有任何有意义的经验,所以如果我对此有任何错误,请让我知道,我将相应地调整测试。预取确实会产生影响,因此至少起了一定作用。这些更改反映在我先前为寻找源代码的任何人所做的GitHub链接的最新修订中。

我还添加一个SSE4.1的memcpy,由于之前SSE4.1我无法找到任何_mm_stream_load(I专门用于_mm_stream_load_si128)SSE的功能,所以sse_memcpysse2_memcpy使用NT商店不能完全,并且还有该avx_memcpy函数使用AVX2功能用于流加载。

我选择不对纯存储和纯加载访问模式进行测试,因为我不确定纯存储是否有意义,因为如果没有加载要访问的寄存器,则数据将毫无意义且无法验证。

新测试的有趣结果是,在Xeon Skylake Dual Socket设置上(在该设置上),对于16MB内存复制,存储功能实际上比NT流功能快得多。同样,在该设置上(并且仅在BIOS中启用了LLC预取),在某些测试(SSE,SSE4.1)中的预取性能优于prefetcht0且没有预取。

此新测试的原始结果太长,无法添加到帖子中,因此它们与源代码在以下位置发布在同一git存储库中 results-2018-05-15

我仍然不明白为什么对于流式NT存储,在Skylake SMP设置下,远程NUMA节点更快,尽管使用常规存储仍然比在本地NUMA节点上更快


1
还没有机会消化您的数据,但另请参见为什么Skylake在单线程内存吞吐量方面比Broadwell-E好得多? (将四核Skylake与多核Broadwell进行比较,并发现在单核带宽受一个核中的最大内存并发性而非DRAM控制器限制的多核系统中,更高的内存/ L3延迟的缺点。)根据Mysticial的测试和其他结果,SKX通常对L3 /内存的每个内核具有高延迟/低带宽。您可能会看到。
彼得·科德斯

1
您的任何副本都使用NT商店吗?我刚刚检查了一下,除MMX以外的所有副本都在使用prefetchnta和NT商店中!这是一个重大的重要事实,您没有理会它!有关ERMSB 与NT向量存储区与常规向量存储区的更多讨论,请参见memcpy的Enhanced REP MOVSBrep movsb。与MMX vs. SSE相比,解决这些问题将更加有用。可能只是使用AVX和/或AVX512并尝试将NT与常规相提并论,和/或省略SW预取。
彼得·科德斯

1
您是否调整了SKX机器的预取距离?SKX prefetchnta绕过L3和L2(因为L3不包含在内),因此它对预取距离更敏感(为时已晚,数据必须再次从DRAM而不是L3一直传到DRAM),所以它更“脆弱”(对调整正确的距离敏感)。但是,如果我正确地读取了asm,则您的预取距离看起来就很短,不到500个字节。@Mysticial在SKX上进行的测试发现,prefetchnta该uarch可能会大大减慢速度),他不建议这样做。
彼得·科德斯

1
您在这里肯定会有一些有趣的结果,但是我们需要将它们从各种效果中解脱出来。带有和不带有NT存储区的数字都可以告诉我们有关NUMA行为的一些有用信息。至少在Broadwell / Haswell上,填充第2个套接字甚至会导致本地L3丢失也无法监听远程CPU。双路E5 Xeon没有侦听过滤器。我认为Gold Xeons 确实具有探听过滤器,因为它们不仅可以在双插槽系统中运行。但是我不确定它有多大,或者这实际上意味着什么:PI尚未在多路插座上进行内存性能调整。
彼得·科德斯

2
SKX是根本不同的互连。网格而不是环。这是一个有趣的结果,但并非令人难以置信,并且可能并不表示配置错误。IDK,希望其他在硬件方面有更多经验的人可以提供更多帮助。
彼得·科德斯

Answers:


0

您的记忆是错误的排名吗?也许添加第二个CPU时,您的主板的内存排名有些奇怪?我知道当您使用Quad CPU机器时,它们会做各种奇怪的事情来使内存正常工作,如果您使用的内存排名不正确,有时它会起作用,但时脉会回到速度的1/4或1/2。也许SuperMicro在该板上做了一些工作,以使DDR4和Dual CPU成为四通道,并且它使用的是类似的数学方法。错误的等级== 1/2速度。


情况似乎并非如此,所有内存均为1R8,并且与主板的micromicro qvl匹配。值得一查!
Marc Micalizzi

我完全知道这是一个不同的系统,但这也是我所指的。qrl.dell.com/Files/en-us/Html/Manuals/R920/… 您会注意到,随着增加记忆棒/ CPU的数量,等级要求也会发生变化。
thelanranger
By using our site, you acknowledge that you have read and understand our Cookie Policy and Privacy Policy.
Licensed under cc by-sa 3.0 with attribution required.