我有两列制表符分隔的整数，第一列是随机整数，第二列是标识组的整数，可以通过此程序生成该整数。（generate_groups.cc）

#include <cstdlib>
#include <iostream>
#include <ctime>

int main(int argc, char* argv[]) {
  int num_values = atoi(argv[1]);
  int num_groups = atoi(argv[2]);

  int group_size = num_values / num_groups;
  int group = -1;

  std::srand(42);

  for (int i = 0; i < num_values; ++i) {
    if (i % group_size == 0) {
      ++group;
    }
    std::cout << std::rand() << '\t' << group << '\n';
  }

  return 0;
}

然后，我使用第二个程序（sum_groups.cc）计算每组的总和。

#include <iostream>
#include <chrono>
#include <vector>

// This is the function whose performance I am interested in
void grouped_sum(int* p_x, int *p_g, int n, int* p_out) {
  for (size_t i = 0; i < n; ++i) {
    p_out[p_g[i]] += p_x[i];
  }
}

int main() {
  std::vector<int> values;
  std::vector<int> groups;
  std::vector<int> sums;

  int n_groups = 0;

  // Read in the values and calculate the max number of groups
  while(std::cin) {
    int value, group;
    std::cin >> value >> group;
    values.push_back(value);
    groups.push_back(group);
    if (group > n_groups) {
      n_groups = group;
    }
  }
  sums.resize(n_groups);

  // Time grouped sums
  std::chrono::system_clock::time_point start = std::chrono::system_clock::now();
  for (int i = 0; i < 10; ++i) {
    grouped_sum(values.data(), groups.data(), values.size(), sums.data());
  }
  std::chrono::system_clock::time_point end = std::chrono::system_clock::now();

  std::cout << (end - start).count() << std::endl;

  return 0;
}

如果我随后在给定大小的数据集上运行这些程序，然后重新排列同一数据集的行顺序，则重新排列的数据计算总和的速度比有序数据快约2倍。

g++ -O3 generate_groups.cc -o generate_groups
g++ -O3 sum_groups.cc -o sum_groups
generate_groups 1000000 100 > groups
shuf groups > groups2
sum_groups < groups
sum_groups < groups2
sum_groups < groups2
sum_groups < groups
20784
8854
8220
21006

我本来希望按组排序的原始数据具有更好的数据局部性并且速度更快，但是我观察到相反的行为。我想知道是否有人可以假设原因？

c++ performance

— 吉姆
source

我不知道，但是您正在写入sums向量的超出范围的元素-如果您做了正常的事情并传递了对向量的引用，而不是指向数据元素的指针，然后使用.at()或operator[]执行了边界的调试模式检查你会看到。

— 肖恩

您是否已验证“ groups2”文件中是否包含所有数据，并且已全部读取和处理了这些数据？中间某个地方可能有EOF角色吗？

— 1201ProgramAlarm

该程序具有未定义的行为，因为您从不调整大小sum。而不是sums.reserve(n_groups);必须致电sums.resize(n_groups);-这是@Shawn所暗示的。

— 尤金（Eugene）

请注意（例如，参见此处或此处），成对的向量表现出预期的效果，而不是两个向量（值和组）。

— Bob__

您对数据进行了排序，对吗？但是，这也对组进行了排序，这对xpression产生了影响p_out[p_g[i]] += p_x[i];。也许按照原始的混乱顺序，这些组实际上在访问p_out阵列方面表现出良好的聚类性。对值进行排序可能会导致不良的组索引访问模式p_out。

— 卡兹（Kaz）

设置/使其变慢

首先，该程序将在大约相同的时间运行，无论：

sumspeed$ time ./sum_groups < groups_shuffled 
11558358

real    0m0.705s
user    0m0.692s
sys 0m0.013s

sumspeed$ time ./sum_groups < groups_sorted
24986825

real    0m0.722s
user    0m0.711s
sys 0m0.012s

大部分时间都花在输入循环中。但是，由于我们对感兴趣，因此请grouped_sum()忽略它。

将基准循环从10次迭代更改为1000次迭代，grouped_sum()开始控制运行时间：

sumspeed$ time ./sum_groups < groups_shuffled 
1131838420

real    0m1.828s
user    0m1.811s
sys 0m0.016s

sumspeed$ time ./sum_groups < groups_sorted
2494032110

real    0m3.189s
user    0m3.169s
sys 0m0.016s

性能差异

现在，我们可以使用它perf来找到程序中最热门的地方。

sumspeed$ perf record ./sum_groups < groups_shuffled
1166805982
[ perf record: Woken up 1 times to write data ]
[kernel.kallsyms] with build id 3a2171019937a2070663f3b6419330223bd64e96 not found, continuing without symbols
Warning:
Processed 4636 samples and lost 6.95% samples!

[ perf record: Captured and wrote 0.176 MB perf.data (4314 samples) ]

sumspeed$ perf record ./sum_groups < groups_sorted
2571547832
[ perf record: Woken up 2 times to write data ]
[kernel.kallsyms] with build id 3a2171019937a2070663f3b6419330223bd64e96 not found, continuing without symbols
[ perf record: Captured and wrote 0.420 MB perf.data (10775 samples) ]

他们之间的区别：

sumspeed$ perf diff
[...]
# Event 'cycles:uppp'
#
# Baseline  Delta Abs  Shared Object        Symbol                                                                  
# ........  .........  ...................  ........................................................................
#
    57.99%    +26.33%  sum_groups           [.] main
    12.10%     -7.41%  libc-2.23.so         [.] _IO_getc
     9.82%     -6.40%  libstdc++.so.6.0.21  [.] std::num_get<char, std::istreambuf_iterator<char, std::char_traits<c
     6.45%     -4.00%  libc-2.23.so         [.] _IO_ungetc
     2.40%     -1.32%  libc-2.23.so         [.] _IO_sputbackc
     1.65%     -1.21%  libstdc++.so.6.0.21  [.] 0x00000000000dc4a4
     1.57%     -1.20%  libc-2.23.so         [.] _IO_fflush
     1.71%     -1.07%  libstdc++.so.6.0.21  [.] std::istream::sentry::sentry
     1.22%     -0.77%  libstdc++.so.6.0.21  [.] std::istream::operator>>
     0.79%     -0.47%  libstdc++.so.6.0.21  [.] __gnu_cxx::stdio_sync_filebuf<char, std::char_traits<char> >::uflow
[...]

中有更多时间main()，这可能是grouped_sum()内联的。太好了，非常感谢。

性能注释

有没有在时间都花在差别里面 main()？

随机播放：

sumspeed$ perf annotate -i perf.data.old
[...]
       │     // This is the function whose performance I am interested in
       │     void grouped_sum(int* p_x, int *p_g, int n, int* p_out) {
       │       for (size_t i = 0; i < n; ++i) {
       │180:   xor    %eax,%eax
       │       test   %rdi,%rdi
       │     ↓ je     1a4
       │       nop
       │         p_out[p_g[i]] += p_x[i];
  6,88 │190:   movslq (%r9,%rax,4),%rdx
 58,54 │       mov    (%r8,%rax,4),%esi
       │     #include <chrono>
       │     #include <vector>
       │
       │     // This is the function whose performance I am interested in
       │     void grouped_sum(int* p_x, int *p_g, int n, int* p_out) {
       │       for (size_t i = 0; i < n; ++i) {
  3,86 │       add    $0x1,%rax
       │         p_out[p_g[i]] += p_x[i];
 29,61 │       add    %esi,(%rcx,%rdx,4)
[...]

排序：

sumspeed$ perf annotate -i perf.data
[...]
       │     // This is the function whose performance I am interested in
       │     void grouped_sum(int* p_x, int *p_g, int n, int* p_out) {
       │       for (size_t i = 0; i < n; ++i) {
       │180:   xor    %eax,%eax
       │       test   %rdi,%rdi
       │     ↓ je     1a4
       │       nop
       │         p_out[p_g[i]] += p_x[i];
  1,00 │190:   movslq (%r9,%rax,4),%rdx
 55,12 │       mov    (%r8,%rax,4),%esi
       │     #include <chrono>
       │     #include <vector>
       │
       │     // This is the function whose performance I am interested in
       │     void grouped_sum(int* p_x, int *p_g, int n, int* p_out) {
       │       for (size_t i = 0; i < n; ++i) {
  0,07 │       add    $0x1,%rax
       │         p_out[p_g[i]] += p_x[i];
 43,28 │       add    %esi,(%rcx,%rdx,4)
[...]

不，这是两个相同的指令。因此，在两种情况下它们都需要花费很长时间，但对数据进行排序时甚至更糟。

性能统计

好的。但是我们应该将它们运行相同的次数，因此由于某种原因，每条指令必须变慢。让我们看看怎么perf stat说。

sumspeed$ perf stat ./sum_groups < groups_shuffled 
1138880176

 Performance counter stats for './sum_groups':

       1826,232278      task-clock (msec)         #    0,999 CPUs utilized          
                72      context-switches          #    0,039 K/sec                  
                 1      cpu-migrations            #    0,001 K/sec                  
             4 076      page-faults               #    0,002 M/sec                  
     5 403 949 695      cycles                    #    2,959 GHz                    
       930 473 671      stalled-cycles-frontend   #   17,22% frontend cycles idle   
     9 827 685 690      instructions              #    1,82  insn per cycle         
                                                  #    0,09  stalled cycles per insn
     2 086 725 079      branches                  # 1142,639 M/sec                  
         2 069 655      branch-misses             #    0,10% of all branches        

       1,828334373 seconds time elapsed

sumspeed$ perf stat ./sum_groups < groups_sorted
2496546045

 Performance counter stats for './sum_groups':

       3186,100661      task-clock (msec)         #    1,000 CPUs utilized          
                 5      context-switches          #    0,002 K/sec                  
                 0      cpu-migrations            #    0,000 K/sec                  
             4 079      page-faults               #    0,001 M/sec                  
     9 424 565 623      cycles                    #    2,958 GHz                    
     4 955 937 177      stalled-cycles-frontend   #   52,59% frontend cycles idle   
     9 829 009 511      instructions              #    1,04  insn per cycle         
                                                  #    0,50  stalled cycles per insn
     2 086 942 109      branches                  #  655,014 M/sec                  
         2 078 204      branch-misses             #    0,10% of all branches        

       3,186768174 seconds time elapsed

只有一件事很突出：stalled-cycles-frontend。

好的，指令流水线正在停滞。在前端。确切的说，这可能在微体系结构之间有所不同。

我有一个猜测。如果您很慷慨，您甚至可以称其为假设。

假设

通过对输入进行排序，可以增加写入的局部性。实际上，它们将非常本地化；您所做的几乎所有添加操作都将写入与上一个相同的位置。

这对缓存很有用，但对管道却没有用。您正在引入数据依赖关系，从而阻止下一条加法指令继续执行，直到前一条加法完成（或者使结果可用于后续指令）为止。

那是你的问题。

我认为。

修复它

多个和向量

实际上，让我们尝试一下。如果我们使用多个和向量，在每次加法之间切换它们，然后在最后求和，该怎么办？它花费了我们一些局部性，但是应该删除数据依赖项。

（代码不是很漂亮；不要判断我，互联网！）

#include <iostream>
#include <chrono>
#include <vector>

#ifndef NSUMS
#define NSUMS (4) // must be power of 2 (for masking to work)
#endif

// This is the function whose performance I am interested in
void grouped_sum(int* p_x, int *p_g, int n, int** p_out) {
  for (size_t i = 0; i < n; ++i) {
    p_out[i & (NSUMS-1)][p_g[i]] += p_x[i];
  }
}

int main() {
  std::vector<int> values;
  std::vector<int> groups;
  std::vector<int> sums[NSUMS];

  int n_groups = 0;

  // Read in the values and calculate the max number of groups
  while(std::cin) {
    int value, group;
    std::cin >> value >> group;
    values.push_back(value);
    groups.push_back(group);
    if (group >= n_groups) {
      n_groups = group+1;
    }
  }
  for (int i=0; i<NSUMS; ++i) {
    sums[i].resize(n_groups);
  }

  // Time grouped sums
  std::chrono::system_clock::time_point start = std::chrono::system_clock::now();
  int* sumdata[NSUMS];
  for (int i = 0; i < NSUMS; ++i) {
    sumdata[i] = sums[i].data();
  }
  for (int i = 0; i < 1000; ++i) {
    grouped_sum(values.data(), groups.data(), values.size(), sumdata);
  }
  for (int i = 1; i < NSUMS; ++i) {
    for (int j = 0; j < n_groups; ++j) {
      sumdata[0][j] += sumdata[i][j];
    }
  }
  std::chrono::system_clock::time_point end = std::chrono::system_clock::now();

  std::cout << (end - start).count() << " with NSUMS=" << NSUMS << std::endl;

  return 0;
}

（哦，我还修复了n_groups的计算；它被减一了。）

结果

配置我的makefile以将-DNSUMS=...arg赋予编译器后，我可以这样做：

sumspeed$ for n in 1 2 4 8 128; do make -s clean && make -s NSUMS=$n && (perf stat ./sum_groups < groups_shuffled && perf stat ./sum_groups < groups_sorted)  2>&1 | egrep '^[0-9]|frontend'; done
1134557008 with NSUMS=1
       924 611 882      stalled-cycles-frontend   #   17,13% frontend cycles idle   
2513696351 with NSUMS=1
     4 998 203 130      stalled-cycles-frontend   #   52,79% frontend cycles idle   
1116188582 with NSUMS=2
       899 339 154      stalled-cycles-frontend   #   16,83% frontend cycles idle   
1365673326 with NSUMS=2
     1 845 914 269      stalled-cycles-frontend   #   29,97% frontend cycles idle   
1127172852 with NSUMS=4
       902 964 410      stalled-cycles-frontend   #   16,79% frontend cycles idle   
1171849032 with NSUMS=4
     1 007 807 580      stalled-cycles-frontend   #   18,29% frontend cycles idle   
1118732934 with NSUMS=8
       881 371 176      stalled-cycles-frontend   #   16,46% frontend cycles idle   
1129842892 with NSUMS=8
       905 473 182      stalled-cycles-frontend   #   16,80% frontend cycles idle   
1497803734 with NSUMS=128
     1 982 652 954      stalled-cycles-frontend   #   30,63% frontend cycles idle   
1180742299 with NSUMS=128
     1 075 507 514      stalled-cycles-frontend   #   19,39% frontend cycles idle

和向量的最佳数量可能取决于您CPU的流水线深度。我7岁的超极本CPU可能可以用比新型花式台式机CPU所需的更少的向量最大化处理流程。

显然，更多并不一定更好。当我疯狂使用128个和向量时，我们开始遭受缓存未命中的更多痛苦-改组后的输入变得比排序慢，就像您最初预期的那样。我们来了整整一圈！:)

寄存器中的每组总和

（这是在编辑中添加的）

啊，书呆子了！如果您知道输入将被排序并且正在寻找更高的性能，那么至少在我的计算机上，函数的以下重写（没有多余的总和）会更快。

// This is the function whose performance I am interested in
void grouped_sum(int* p_x, int *p_g, int n, int* p_out) {
  int i = n-1;
  while (i >= 0) {
    int g = p_g[i];
    int gsum = 0;
    do {
      gsum += p_x[i--];
    } while (i >= 0 && p_g[i] == g);
    p_out[g] += gsum;
  }
}

这一技巧的窍门在于，它允许编译器将gsum变量（即组的总和）保留在寄存器中。我猜测（但可能是非常错误的），这样做速度更快，因为此处的管道中的反馈循环可能更短，并且/或者更少的内存访问。一个好的分支预测器会使对组相等性的额外检查便宜。

结果

混音输入太糟糕了...

sumspeed$ time ./sum_groups < groups_shuffled
2236354315

real    0m2.932s
user    0m2.923s
sys 0m0.009s

...但是比排序输入的“许多总和”解决方案快40％。

sumspeed$ time ./sum_groups < groups_sorted
809694018

real    0m1.501s
user    0m1.496s
sys 0m0.005s

许多小组的速度会比一些小组的速度慢，因此，这是否是较快的实现，实际上取决于您的数据。而且，与以往一样，在您的CPU型号上。

多个和向量，具有偏移量而不是位掩码

Sopel建议了四个展开的扩展，以替代我的位掩码方法。我已经实施了他们建议的通用版本，可以处理不同的建议NSUMS。我指望编译器为我们展开内部循环（至少这样做是这样做的NSUMS=4）。

#include <iostream>
#include <chrono>
#include <vector>

#ifndef NSUMS
#define NSUMS (4) // must be power of 2 (for masking to work)
#endif

#ifndef INNER
#define INNER (0)
#endif
#if INNER
// This is the function whose performance I am interested in
void grouped_sum(int* p_x, int *p_g, int n, int** p_out) {
  size_t i = 0;
  int quadend = n & ~(NSUMS-1);
  for (; i < quadend; i += NSUMS) {
    for (int k=0; k<NSUMS; ++k) {
      p_out[k][p_g[i+k]] += p_x[i+k];
    }
  }
  for (; i < n; ++i) {
    p_out[0][p_g[i]] += p_x[i];
  }
}
#else
// This is the function whose performance I am interested in
void grouped_sum(int* p_x, int *p_g, int n, int** p_out) {
  for (size_t i = 0; i < n; ++i) {
    p_out[i & (NSUMS-1)][p_g[i]] += p_x[i];
  }
}
#endif


int main() {
  std::vector<int> values;
  std::vector<int> groups;
  std::vector<int> sums[NSUMS];

  int n_groups = 0;

  // Read in the values and calculate the max number of groups
  while(std::cin) {
    int value, group;
    std::cin >> value >> group;
    values.push_back(value);
    groups.push_back(group);
    if (group >= n_groups) {
      n_groups = group+1;
    }
  }
  for (int i=0; i<NSUMS; ++i) {
    sums[i].resize(n_groups);
  }

  // Time grouped sums
  std::chrono::system_clock::time_point start = std::chrono::system_clock::now();
  int* sumdata[NSUMS];
  for (int i = 0; i < NSUMS; ++i) {
    sumdata[i] = sums[i].data();
  }
  for (int i = 0; i < 1000; ++i) {
    grouped_sum(values.data(), groups.data(), values.size(), sumdata);
  }
  for (int i = 1; i < NSUMS; ++i) {
    for (int j = 0; j < n_groups; ++j) {
      sumdata[0][j] += sumdata[i][j];
    }
  }
  std::chrono::system_clock::time_point end = std::chrono::system_clock::now();

  std::cout << (end - start).count() << " with NSUMS=" << NSUMS << ", INNER=" << INNER << std::endl;

  return 0;
}

结果

该测量了。请注意，由于昨天我在/ tmp中工作，因此我没有完全相同的输入数据。因此，这些结果不能直接与之前的结果进行比较（但可能足够接近）。

sumspeed$ for n in 2 4 8 16; do for inner in 0 1; do make -s clean && make -s NSUMS=$n INNER=$inner && (perf stat ./sum_groups < groups_shuffled && perf stat ./sum_groups < groups_sorted)  2>&1 | egrep '^[0-9]|frontend'; done; done1130558787 with NSUMS=2, INNER=0
       915 158 411      stalled-cycles-frontend   #   16,96% frontend cycles idle   
1351420957 with NSUMS=2, INNER=0
     1 589 408 901      stalled-cycles-frontend   #   26,21% frontend cycles idle   
840071512 with NSUMS=2, INNER=1
     1 053 982 259      stalled-cycles-frontend   #   23,26% frontend cycles idle   
1391591981 with NSUMS=2, INNER=1
     2 830 348 854      stalled-cycles-frontend   #   45,35% frontend cycles idle   
1110302654 with NSUMS=4, INNER=0
       890 869 892      stalled-cycles-frontend   #   16,68% frontend cycles idle   
1145175062 with NSUMS=4, INNER=0
       948 879 882      stalled-cycles-frontend   #   17,40% frontend cycles idle   
822954895 with NSUMS=4, INNER=1
     1 253 110 503      stalled-cycles-frontend   #   28,01% frontend cycles idle   
929548505 with NSUMS=4, INNER=1
     1 422 753 793      stalled-cycles-frontend   #   30,32% frontend cycles idle   
1128735412 with NSUMS=8, INNER=0
       921 158 397      stalled-cycles-frontend   #   17,13% frontend cycles idle   
1120606464 with NSUMS=8, INNER=0
       891 960 711      stalled-cycles-frontend   #   16,59% frontend cycles idle   
800789776 with NSUMS=8, INNER=1
     1 204 516 303      stalled-cycles-frontend   #   27,25% frontend cycles idle   
805223528 with NSUMS=8, INNER=1
     1 222 383 317      stalled-cycles-frontend   #   27,52% frontend cycles idle   
1121644613 with NSUMS=16, INNER=0
       886 781 824      stalled-cycles-frontend   #   16,54% frontend cycles idle   
1108977946 with NSUMS=16, INNER=0
       860 600 975      stalled-cycles-frontend   #   16,13% frontend cycles idle   
911365998 with NSUMS=16, INNER=1
     1 494 671 476      stalled-cycles-frontend   #   31,54% frontend cycles idle   
898729229 with NSUMS=16, INNER=1
     1 474 745 548      stalled-cycles-frontend   #   31,24% frontend cycles idle

NSUMS=8是的，内部循环是我计算机上最快的。与我的“本地gsum”方法相比，它还具有不为混洗输入带来可怕影响的额外好处。

有趣的是：NSUMS=16变得比差NSUMS=8。这可能是因为我们开始看到更多的高速缓存未命中，或者是因为我们没有足够的寄存器来正确展开内部循环。

— 尼尔·多尔科
source

很好玩 :)

— Snild Dolkow

太棒了！不知道perf。

— Tanveer Badar

我想知道在您的第一种方法中，使用4个不同的累加器手动展开4倍是否会产生更好的性能。类似于godbolt.org/z/S-PhFm

— Sopel，

谢谢你的建议。是的，这提高了性能，我已经将其添加到了答案中。

— Snild Dolkow

谢谢！我曾考虑过可能会出现这种情况，但不知道如何确定，谢谢您的详细回答！

— 吉姆（Jim）

这就是为什么已排序的组要比未排序的组要慢的原因；

首先，这里是求和循环的汇编代码：

008512C3  mov         ecx,dword ptr [eax+ebx]
008512C6  lea         eax,[eax+4]
008512C9  lea         edx,[esi+ecx*4] // &sums[groups[i]]
008512CC  mov         ecx,dword ptr [eax-4] // values[i]
008512CF  add         dword ptr [edx],ecx // sums[groups[i]]+=values[i]
008512D1  sub         edi,1
008512D4  jne         main+163h (08512C3h)

让我们看一下添加指令，这是导致此问题的主要原因；

008512CF  add         dword ptr [edx],ecx // sums[groups[i]]+=values[i]

当处理器首先执行此指令时，它将向edx中的地址发出一个内存读取（加载）请求，然后添加ecx的值，然后对同一地址发出写入（存储）请求。

处理器调用者内存重新排序中有一项功能

为了实现指令执行的性能优化，IA-32体系结构允许与Pentium 4，Intel Xeon和P6系列处理器中称为处理器排序的强排序模型有所不同。这些处理器排序变体（在此称为内存排序模型）允许提高性能的操作，例如允许读取先于缓冲写入。这些变体中任何一个的目标都是提高指令执行速度，同时即使在多处理器系统中也能保持存储器一致性。

有一个规则

读取可能会随着对不同位置的较旧写入而被重新排序，但对相同位置的较旧写入则不会重新排序。

因此，如果下一个迭代在写请求完成之前到达添加指令，则如果edx地址不同于先前的值并发出读请求，并且它在较早的写请求上重新排序，并且添加指令继续，它将不会等待。但是，如果地址相同，则添加指令将等到旧的写入完成。

请注意，循环很短，并且处理器可以比内存控制器完成写入内存请求的速度更快地执行它。

因此，对于已排序的组，您将连续多次从同一地址读取和写入，因此使用内存重新排序将失去性能提升；同时，如果使用随机组，则每个迭代可能具有不同的地址，因此读取将不会等待较早的写入并在其之前进行重新排序；添加指令不会等待上一条指令执行。

— 艾哈迈德·安特
source

为什么使用分类组的分组求和要比未分类组慢？