C ++：将一个操作数保留在寄存器中的速度大大提高

Question 1

我一直在尝试通过定时使用以下代码对数组元素进行缩放和求和的例程来了解在L1缓存中存储数组对内存的影响（我知道我应该将结果按'最后是a'；关键是要在循环内进行乘法和加法-到目前为止，编译器尚未弄清楚要分解出'a'）：

double sum(double a,double* X,int size)
{
    double total = 0.0;
    for(int i = 0;  i < size; ++i)
    {
        total += a*X[i];
    }
    return total;
}

#define KB 1024
int main()
{
    //Approximately half the L1 cache size of my machine
    int operand_size = (32*KB)/(sizeof(double)*2);
    printf("Operand size: %d\n", operand_size);
    double* X = new double[operand_size];
    fill(X,operand_size);

    double seconds = timer();
    double result;
    int n_iterations = 100000;
    for(int i = 0; i < n_iterations; ++i)
    {
        result = sum(3.5,X,operand_size);
        //result += rand();  
    }
    seconds = timer() - seconds; 

    double mflops = 2e-6*double(n_iterations*operand_size)/seconds;
    printf("Vector size %d: mflops=%.1f, result=%.1f\n",operand_size,mflops,result);
    return 0;
}

注意，为简洁起见，不包含timer（）和fill（）例程。如果要运行代码，可以在这里找到其完整源代码：

http://codepad.org/agPWItZS

现在，这里变得有趣了。这是输出：

Operand size: 2048
Vector size 2048: mflops=588.8, result=-67.8

尽管X的所有元素都应在循环迭代之间保留在缓存中，但这完全是未缓存的性能。查看由以下程序生成的汇编代码：

g++ -O3 -S -fno-asynchronous-unwind-tables register_opt_example.cpp

我注意到求和函数循环中有一个奇怪的地方：

L55:
    movsd   (%r12,%rax,8), %xmm0
    mulsd   %xmm1, %xmm0
    addsd   -72(%rbp), %xmm0
    movsd   %xmm0, -72(%rbp)
    incq    %rax
    cmpq    $2048, %rax
    jne L55

说明：

    addsd   -72(%rbp), %xmm0
    movsd   %xmm0, -72(%rbp)

表示它正在堆栈中的sum（）中存储“ total”的值，并在每次循环迭代时对其进行读写。我修改了程序集，以便将此操作数保存在一个寄存器中：

...
addsd   %xmm0, %xmm3
...

这个小小的变化可以极大地提高性能：

Operand size: 2048
Vector size 2048: mflops=1958.9, result=-67.8

tl; dr 我的问题是：为什么要用寄存器替换单个内存位置访问，从而使代码加速这么快，因为单个位置应该存储在L1缓存中？哪些架构因素使之成为可能？重复写入一个堆栈位置会完全破坏缓存的有效性，这似乎很奇怪。

附录

我的gcc版本是：

Target: i686-apple-darwin10
Configured with: /var/tmp/gcc/gcc-5646.1~2/src/configure --disable-checking --enable-werror --prefix=/usr --mandir=/share/man --enable-languages=c,objc,c++,obj-c++ --program-transform-name=/^[cg][^.-]*$/s/$/-4.2/ --with-slibdir=/usr/lib --build=i686-apple-darwin10 --with-gxx-include-dir=/include/c++/4.2.1 --program-prefix=i686-apple-darwin10- --host=x86_64-apple-darwin10 --target=i686-apple-darwin10
Thread model: posix
gcc version 4.2.1 (Apple Inc. build 5646) (dot 1)

我的CPU是：

英特尔至强X5650

Question 2

这可能是更长的依赖项链与负载错误预测*的组合。

较长的依赖链：

首先，我们确定关键的依赖路径。然后，我们看一下以下提供的指令等待时间：http : //www.agner.org/optimize/instruction_tables.pdf（第117页）

在未优化的版本中，关键的依赖路径为：

addsd -72(%rbp), %xmm0
movsd %xmm0, -72(%rbp)

在内部，它可能分解为：

负载（2个周期）
已添加（3个周期）
储存（3个周期）

如果我们看一下优化版本，那就是：

已添加（3个周期）

因此，您有8个周期与3个周期。几乎是三分之一。

我不确定Nehalem处理器产品线对存储负载依赖项的敏感程度以及转发性能如何。但是有理由相信它不为零。

负载存储错误：

现代处理器以您可以想象的更多方式使用预测。其中最著名的可能是分支预测。鲜为人知的一种是负载预测。

当处理器看到负载时，它将立即加载所有未完成的写入操作。将假定这些写操作不会与加载的值冲突。

如果发现较早的写入与负载冲突，则必须重新执行该负载，并且计算将回滚到负载点。（与分支错误预测回滚的方式几乎相同）

这里的相关性：

不用说，现代处理器将能够同时执行此循环的多个迭代。因此，处理器将尝试执行加载（addsd -72(%rbp), %xmm0)在完成movsd %xmm0, -72(%rbp)之前迭代中的store（）之前）。

结果？先前的存储与负载冲突-因此发生了错误的预测并回滚。

_{*请注意，我不确定名称“ Load Prediction”。我只是在Intel文档中读到了它，他们似乎并没有给它起个名字。}

Question 3

我想问题不在于缓存/内存访问，而在于处理器（代码的执行）。这里有几个明显的瓶颈。

这里的性能数字是基于我使用的盒子（沙桥或韦斯特米尔）

由于处理器可以同时进行加法和乘法运算，因此标量数学的最高性能为2.7Ghz x2 FLOPS / Clock x2。代码的理论效率为0.6 /（2.7 * 2）= 11％

所需带宽：每（+）和（x）2倍-> 4bytes / Flop 4字节* 5.4GFLOPS = 21.6GB / s

如果您知道它最近被读取过，可能是L1（89GB / s），L2（42GB / s）或L3（24GB / s），所以我们可以排除缓存B / W

内存暂存系统为18.9 GB / s，因此即使在主内存中，峰值性能也应接近18.9 / 21.6GB / s = 87.5％

可能希望尽早整理请求（通过展开）

即使进行推测性执行，tot + = a * X [i]也将被序列化，因为必须先评估tot（n）才能启动tot（n + 1）

第一个展开循环
将i移动8并执行

{//your func
    for( int i = 0; i < size; i += 8 ){
        tot += a * X[i];
        tot += a * X[i+1];
        ...
        tot += a * X[i+7];
    }
    return tot
}

使用多个累加器
这将打破依赖关系，并让我们避免加法管道停顿

{//your func//
    int tot,tot2,tot3,tot4;
    tot = tot2 = tot3 = tot4 = 0
    for( int i = 0; i < size; i += 8 ) 
        tot  += a * X[i];
        tot2 += a * X[i+1];
        tot3 += a * X[i+2];
        tot4 += a * X[i+3];
        tot  += a * X[i+4];
        tot2 += a * X[i+5];
        tot3 += a * X[i+6];
        tot4 += a * X[i+7];
    }
    return tot + tot2 + tot3 + tot4;
}

更新在SandyBridge框上运行此文件后，我可以访问：（2.7GHZ SandyBridge，带有-O2 -march = native -mtune = native

原始代码：

Operand size: 2048  
Vector size 2048: mflops=2206.2, result=61.8  
2.206 / 5.4 = 40.8%

改进的代码：

Operand size: 2048  
Vector size 2048: mflops=5313.7, result=61.8  
5.3137 / 5.4 = 98.4%

Question 4

我实际上无法重现此内容，因为我的编译器（gcc 4.7.2）保留total在寄存器中。

我怀疑速度缓慢的主要原因与L1缓存无关，而是由于存储在

movsd   %xmm0, -72(%rbp)

以及后续迭代的负载：

addsd   -72(%rbp), %xmm0