133

我喜欢D的某些功能，但是否会对运行时带来惩罚？

相比之下，我实现了一个简单的程序，该程序可以同时在C ++和D中计算许多短向量的标量积。结果令人惊讶：

D：18.9 s [请参见下文了解最终运行时间]
C ++：3.8 s

C ++的速度真的快五倍吗，或者我在D程序中犯了一个错误？

我在最近的中等Linux桌面上使用g ++ -O3（gcc-snapshot 2011-02-19）和dmd -O（dmd 2.052）编译了C ++。在多次运行中结果可重现，标准偏差可忽略不计。

这是C ++程序：

#include <iostream>
#include <random>
#include <chrono>
#include <string>

#include <vector>
#include <array>

typedef std::chrono::duration<long, std::ratio<1, 1000>> millisecs;
template <typename _T>
long time_since(std::chrono::time_point<_T>& time) {
      long tm = std::chrono::duration_cast<millisecs>( std::chrono::system_clock::now() - time).count();
  time = std::chrono::system_clock::now();
  return tm;
}

const long N = 20000;
const int size = 10;

typedef int value_type;
typedef long long result_type;
typedef std::vector<value_type> vector_t;
typedef typename vector_t::size_type size_type;

inline value_type scalar_product(const vector_t& x, const vector_t& y) {
  value_type res = 0;
  size_type siz = x.size();
  for (size_type i = 0; i < siz; ++i)
    res += x[i] * y[i];
  return res;
}

int main() {
  auto tm_before = std::chrono::system_clock::now();

  // 1. allocate and fill randomly many short vectors
  vector_t* xs = new vector_t [N];
  for (int i = 0; i < N; ++i) {
    xs[i] = vector_t(size);
      }
  std::cerr << "allocation: " << time_since(tm_before) << " ms" << std::endl;

  std::mt19937 rnd_engine;
  std::uniform_int_distribution<value_type> runif_gen(-1000, 1000);
  for (int i = 0; i < N; ++i)
    for (int j = 0; j < size; ++j)
      xs[i][j] = runif_gen(rnd_engine);
  std::cerr << "random generation: " << time_since(tm_before) << " ms" << std::endl;

  // 2. compute all pairwise scalar products:
  time_since(tm_before);
  result_type avg = 0;
  for (int i = 0; i < N; ++i)
    for (int j = 0; j < N; ++j) 
      avg += scalar_product(xs[i], xs[j]);
  avg = avg / N*N;
  auto time = time_since(tm_before);
  std::cout << "result: " << avg << std::endl;
  std::cout << "time: " << time << " ms" << std::endl;
}

这是D版本：

import std.stdio;
import std.datetime;
import std.random;

const long N = 20000;
const int size = 10;

alias int value_type;
alias long result_type;
alias value_type[] vector_t;
alias uint size_type;

value_type scalar_product(const ref vector_t x, const ref vector_t y) {
  value_type res = 0;
  size_type siz = x.length;
  for (size_type i = 0; i < siz; ++i)
    res += x[i] * y[i];
  return res;
}

int main() {   
  auto tm_before = Clock.currTime();

  // 1. allocate and fill randomly many short vectors
  vector_t[] xs;
  xs.length = N;
  for (int i = 0; i < N; ++i) {
    xs[i].length = size;
  }
  writefln("allocation: %i ", (Clock.currTime() - tm_before));
  tm_before = Clock.currTime();

  for (int i = 0; i < N; ++i)
    for (int j = 0; j < size; ++j)
      xs[i][j] = uniform(-1000, 1000);
  writefln("random: %i ", (Clock.currTime() - tm_before));
  tm_before = Clock.currTime();

  // 2. compute all pairwise scalar products:
  result_type avg = cast(result_type) 0;
  for (int i = 0; i < N; ++i)
    for (int j = 0; j < N; ++j) 
      avg += scalar_product(xs[i], xs[j]);
  avg = avg / N*N;
  writefln("result: %d", avg);
  auto time = Clock.currTime() - tm_before;
  writefln("scalar products: %i ", time);

  return 0;
}

— 拉尔斯
source

3

顺便说一句，您的程序在此行有一个bug ：（avg = avg / N*N操作顺序）。

— 弗拉基米尔·潘捷列夫2011年

4

您可以尝试使用数组/矢量操作重写代码digitalmars.com/d/2.0/arrays.html

— Michal Minich 2011年

10

为了提供更好的比较，您应该使用相同的编译器后端。DMD和DMC ++或GDC和G ++

— he_the_great

1

@Sion Sheevok不幸的是，dmd分析似乎不适用于Linux？（如果我错了，请纠正我，但如果我说dmd ... trace.def我得到了。。optlinkerror: unrecognized file extension def的dmd文档仅提及Windows。–

— Lars

1

啊，从不关心它弹出的.def文件。时间位于.log文件中。“它按链接程序链接它们的顺序包含功能列表”-也许这有助于optlink优化某些内容？另请注意，“此外，ld完全支持标准的“ * .def”文件，该文件可以在链接器命令行上像目标文件一样指定“-因此，如果您非常需要，可以尝试通过-L传递trace.def至。

— Trass3r 2011年

64

要启用所有优化并禁用所有安全检查，请使用以下DMD标志编译D程序：

-O -inline -release -noboundscheck

编辑：我已经尝试过使用g ++，dmd和gdc的程序。dmd确实落后，但是gdc的性能非常接近g ++。我使用的命令行是gdmd -O -release -inline（gdmd是gdc的包装，它接受dmd选项）。

查看汇编程序列表，看起来既没有dmd也没有gdc内联scalar_product，但是g ++ / gdc确实发出了MMX指令，因此它们可能会自动向量化循环。

— 弗拉基米尔·潘捷列夫
source

3

@CyberShadow：但是，如果您取消安全检查...是不是失去了D的一些重要功能？

— Matthieu M.

33

您正在失去C ++从未拥有的功能。大多数语言都不给您选择。

— 弗拉基米尔·潘捷列夫2011年

6

@CyberShadow：我们可以将其视为一种调试与发布版本吗？

— Francesco

7

@Bernard：在-release中，除安全功能外，所有代码的边界检查都已关闭。要真正关闭边界检查，请同时使用-release和-noboundscheck。

— Michal Minich 2011年

5

@CyberShadow谢谢！使用这些标志，运行时间大大提高。现在D为12.9 s。但运行时间仍是原来的3倍以上。@MatthieuM。我不介意用慢动作进行边界检查来测试程序，并且一旦调试完毕，就可以不进行边界检查而进行计算。（我现在对C ++也是如此。）

— Lars

32

降低D效率的一个重要因素是不完善的垃圾回收实现。不太强调GC的基准测试将显示与使用同一编译器后端编译的C和C ++代码非常相似的性能。严重强调GC的基准将显示D表现不佳。不过请放心，这是一个（尽管很严重的）实现质量问题，而不是缓慢的保证。此外，D使您能够选择退出GC并以对性能至关重要的位来调整内存管理，同时仍在对性能要求较低的95％的代码中使用它。

最近，我已经付出了一些努力来提高GC性能，并且至少在综合基准上，结果是相当惊人的。希望这些更改将被集成到接下来的几个发行版中，并将缓解此问题。

— dsimcha
source

1

我注意到您所做的更改之一是从除法转换为位移。那不是编译器要做的吗？

— GManNickG 2011年

3

@GMan：是的，如果您要除以的值在编译时已知。不，如果该值仅在运行时才知道，那是我进行优化的情况。

— dsimcha'2

@dsimcha：嗯。我认为如果您知道这样做，编译器也可以。实现质量问题，还是我错过了编译器无法证明需要满足某些条件的条件，但是您知道吗？（我现在正在学习D，所以关于编译器的这些小知识对我来说突然变得很有趣。:)）

— GManNickG 2011年

13

@GMan：仅当您要除以的数字是2的幂时，才可以使用移位。如果仅在运行时知道该数字，则编译器无法证明这一点，并且测试和分支将比仅使用div指令慢。我的情况很不寻常，因为该值仅在运行时才知道，但是我知道在编译时它将是2的幂。

— dsimcha 2011年

7

请注意，此示例中发布的程序不会在耗时的部分中进行分配。

— 弗拉基米尔·潘捷列夫2011年

27

这是一个非常有启发性的线程，感谢OP和助手的所有工作。

需要注意的是，该测试并未评估抽象/功能损失乃至后端质量的一般问题。它实际上侧重于一种优化（循环优化）。我认为可以肯定地说，gcc的后端比dmd的后端要精致一些，但是假设它们之间的差距对于所有任务都一样大，那是错误的。

— 安德烈·亚历山大（Andrei Alexandrescu）
source

4

我完全同意。如稍后所添加的，我主要对数值计算的性能感兴趣，其中循环优化可能是最重要的性能。您认为哪些其他优化对数值计算很重要？哪些计算可以测试它们？我想补充一下我的测试并实施更多测试（如果它们大致一样简单）。但是evtl。这是另一个问题吗？

— 拉斯

11

作为一名精通C ++的工程师，您是我的英雄。但是，应该尊重的是评论，而不是答案。

— 艾伦

14

绝对像是实现质量问题。

我使用OP的代码进行了一些测试，并进行了一些更改。实际上，对于LDC / clang ++，D的运行速度更快，它假设必须动态分配数组（xs以及相关的标量）。参见下面的一些数字。

给OP的问题

是否有意在C ++的每次迭代中使用相同的种子，而不是D？

建立

我已经调整了原始D源（称为scalar.d），使其可在平台之间移植。这仅涉及更改用于访问和修改数组大小的数字类型。

之后，我进行了以下更改：

用于uninitializedArray避免xs中标量的默认初始化（可能造成最大的不同）。这很重要，因为D通常会默认默默地初始化所有内容，而C ++不会。
列出打印代码并替换writefln为writeln
将进口改为选择性
使用pow运算符（^^）代替人工乘法来计算平均值的最后一步
删除了，size_type并用新index_type别名适当地替换了

...因此导致scalar2.cpp（pastebin）：

    import std.stdio : writeln;
    import std.datetime : Clock, Duration;
    import std.array : uninitializedArray;
    import std.random : uniform;

    alias result_type = long;
    alias value_type = int;
    alias vector_t = value_type[];
    alias index_type = typeof(vector_t.init.length);// Make index integrals portable - Linux is ulong, Win8.1 is uint

    immutable long N = 20000;
    immutable int size = 10;

    // Replaced for loops with appropriate foreach versions
    value_type scalar_product(in ref vector_t x, in ref vector_t y) { // "in" is the same as "const" here
      value_type res = 0;
      for(index_type i = 0; i < size; ++i)
        res += x[i] * y[i];
      return res;
    }

    int main() {
      auto tm_before = Clock.currTime;
      auto countElapsed(in string taskName) { // Factor out printing code
        writeln(taskName, ": ", Clock.currTime - tm_before);
        tm_before = Clock.currTime;
      }

      // 1. allocate and fill randomly many short vectors
      vector_t[] xs = uninitializedArray!(vector_t[])(N);// Avoid default inits of inner arrays
      for(index_type i = 0; i < N; ++i)
        xs[i] = uninitializedArray!(vector_t)(size);// Avoid more default inits of values
      countElapsed("allocation");

      for(index_type i = 0; i < N; ++i)
        for(index_type j = 0; j < size; ++j)
          xs[i][j] = uniform(-1000, 1000);
      countElapsed("random");

      // 2. compute all pairwise scalar products:
      result_type avg = 0;
      for(index_type i = 0; i < N; ++i)
        for(index_type j = 0; j < N; ++j)
          avg += scalar_product(xs[i], xs[j]);
      avg /= N ^^ 2;// Replace manual multiplication with pow operator
      writeln("result: ", avg);
      countElapsed("scalar products");

      return 0;
    }

经过测试scalar2.d（优先考虑优化速度）后，出于好奇，我main用foreach等效项替换了循环，并将其命名为scalar3.d（pastebin）：

    import std.stdio : writeln;
    import std.datetime : Clock, Duration;
    import std.array : uninitializedArray;
    import std.random : uniform;

    alias result_type = long;
    alias value_type = int;
    alias vector_t = value_type[];
    alias index_type = typeof(vector_t.init.length);// Make index integrals portable - Linux is ulong, Win8.1 is uint

    immutable long N = 20000;
    immutable int size = 10;

    // Replaced for loops with appropriate foreach versions
    value_type scalar_product(in ref vector_t x, in ref vector_t y) { // "in" is the same as "const" here
      value_type res = 0;
      for(index_type i = 0; i < size; ++i)
        res += x[i] * y[i];
      return res;
    }

    int main() {
      auto tm_before = Clock.currTime;
      auto countElapsed(in string taskName) { // Factor out printing code
        writeln(taskName, ": ", Clock.currTime - tm_before);
        tm_before = Clock.currTime;
      }

      // 1. allocate and fill randomly many short vectors
      vector_t[] xs = uninitializedArray!(vector_t[])(N);// Avoid default inits of inner arrays
      foreach(ref x; xs)
        x = uninitializedArray!(vector_t)(size);// Avoid more default inits of values
      countElapsed("allocation");

      foreach(ref x; xs)
        foreach(ref val; x)
          val = uniform(-1000, 1000);
      countElapsed("random");

      // 2. compute all pairwise scalar products:
      result_type avg = 0;
      foreach(const ref x; xs)
        foreach(const ref y; xs)
          avg += scalar_product(x, y);
      avg /= N ^^ 2;// Replace manual multiplication with pow operator
      writeln("result: ", avg);
      countElapsed("scalar products");

      return 0;
    }

我使用基于LLVM的编译器编译了每个测试，因为就性能而言，LDC似乎是D编译的最佳选择。在我的x86_64 Arch Linux安装中，我使用了以下软件包：

clang 3.6.0-3
ldc 1:0.15.1-4
dtools 2.067.0-2

我使用以下命令来编译每个命令：

C ++： clang++ scalar.cpp -o"scalar.cpp.exe" -std=c++11 -O3
D： rdmd --compiler=ldc2 -O3 -boundscheck=off <sourcefile>

结果

每个源版本的结果（原始控制台输出的屏幕截图）如下：

scalar.cpp （原始C ++）：

allocation: 2 ms

random generation: 12 ms

result: 29248300000

time: 2582 ms

C ++将标准设置为2582 ms。

scalar.d （修改的OP来源）：

allocation: 5 ms, 293 μs, and 5 hnsecs 

random: 10 ms, 866 μs, and 4 hnsecs 

result: 53237080000

scalar products: 2 secs, 956 ms, 513 μs, and 7 hnsecs

这花了〜2957 ms。比C ++实现要慢，但不要太多。

scalar2.d （索引/长度类型更改和uninitializedArray优化）：

allocation: 2 ms, 464 μs, and 2 hnsecs

random: 5 ms, 792 μs, and 6 hnsecs

result: 59

scalar products: 1 sec, 859 ms, 942 μs, and 9 hnsecs

换句话说，〜1860 ms。到目前为止，这是领先的。

scalar3.d （前言）：

allocation: 2 ms, 911 μs, and 3 hnsecs

random: 7 ms, 567 μs, and 8 hnsecs

result: 189

scalar products: 2 secs, 182 ms, and 366 μs

〜2182毫秒比慢scalar2.d，但比C ++版本快。

结论

通过正确的优化，使用可用的基于LLVM的编译器，D的实现实际上比同等的C ++实现快。对于大多数应用程序，D和C ++之间的当前差距似乎仅基于当前实现的限制。

— 埃里希·古柏勒
source

8

dmd是该语言的参考实现，因此大多数工作都放在前端以修复错误，而不是优化后端。

在这种情况下，“ in”会更快，因为您使用的是引用类型的动态数组。使用ref时，您将引入另一个间接级别（通常用于更改数组本身，而不仅仅是内容）。

向量通常使用const ref完全有意义的结构来实现。有关具有矢量操作和随机性负载的真实示例，请参见smallptD与smallpt。

请注意，64位也可以有所作为。我曾经想念在x64上gcc会编译64位代码，而dmd仍默认为32（当64位代码生成成熟时会更改）。“ dmd -m64 ...”大大提高了速度。

— Trass3r
source

7

C ++还是D更快，可能在很大程度上取决于您在做什么。我认为在将编写良好的C ++与编写良好的D代码进行比较时，它们通常具有相似的速度，或者C ++会更快，但是特定编译器设法优化的内容可能会完全摆脱该语言的影响。本身。

但是，在某些情况下，D极有可能在速度上击败C ++。我想到的主要内容是字符串处理。得益于D的数组切片功能，字符串（和数组）的处理速度比C ++中的处理速度快得多。对于D1，Tango的XML处理器非常快，这主要归功于D的数组切片功能（希望D2将在目前为Phobos使用的XML解析器完成后也具有类似的快速XML解析器）。因此，最终D或C ++是否会更快取决于您的工作。

现在，让我感到惊讶的是，在这种特殊情况下，您会看到速度上的差异，但这是我希望随着dmd的改进而改进的一种方式。使用gdc可能会产生更好的结果，并且考虑到基于gcc的语言，它可能会与语言本身（而不是后端）进行更紧密的比较。但是，如果可以做很多事情来加快dmd生成的代码的速度，这一点也不会让我感到惊讶。我不认为gcc在这一点上比dmd更成熟。代码优化是代码成熟度的主要成果之一。

最终，重要的是dmd在您的特定应用程序中的性能如何，但是我确实同意，知道C ++和D总体上的比较好是绝对不错的。从理论上讲，它们应该几乎相同，但这实际上取决于实现。我认为将需要一套全面的基准来真正测试当前两者之间的比较。

— 乔纳森·戴维斯
source

4

是的，如果两种语言的输入/输出速度都显着提高，或者纯数学语言的速度显着提高，我都会感到惊讶，但是字符串操作，内存管理和其他一些事情很容易让一种语言发光。

— Max Lybbert

1

比C ++ iostream更好（更快）做起来容易。但这主要是一个库实现问题（在最受欢迎的供应商提供的所有已知版本上）。

— Ben Voigt

4

您可以编写D的C代码，以便它更快，这取决于很多事情：

您使用什么编译器
您使用什么功能
您如何积极地进行优化

前者之间的差异并不公平。后者可能会给C ++一个优势，因为它（如果有的话）具有较少的繁重功能。第三个是有趣的：D代码在某些方面更易于优化，因为通常更易于理解。它还具有进行大量生成式编程的能力，从而可以用较短的形式编写冗长而重复但快速的代码。

— BCS
source

3

似乎是实施质量问题。例如，这是我一直在测试的内容：

import std.datetime, std.stdio, std.random;

version = ManualInline;

immutable N = 20000;
immutable Size = 10;

alias int value_type;
alias long result_type;
alias value_type[] vector_type;

result_type scalar_product(in vector_type x, in vector_type y)
in
{
    assert(x.length == y.length);
}
body
{
    result_type result = 0;

    foreach(i; 0 .. x.length)
        result += x[i] * y[i];

    return result;
}

void main()
{   
    auto startTime = Clock.currTime();

    // 1. allocate vectors
    vector_type[] vectors = new vector_type[N];
    foreach(ref vec; vectors)
        vec = new value_type[Size];

    auto time = Clock.currTime() - startTime;
    writefln("allocation: %s ", time);
    startTime = Clock.currTime();

    // 2. randomize vectors
    foreach(ref vec; vectors)
        foreach(ref e; vec)
            e = uniform(-1000, 1000);

    time = Clock.currTime() - startTime;
    writefln("random: %s ", time);
    startTime = Clock.currTime();

    // 3. compute all pairwise scalar products
    result_type avg = 0;

    foreach(vecA; vectors)
        foreach(vecB; vectors)
        {
            version(ManualInline)
            {
                result_type result = 0;

                foreach(i; 0 .. vecA.length)
                    result += vecA[i] * vecB[i];

                avg += result;
            }
            else
            {
                avg += scalar_product(vecA, vecB);
            }
        }

    avg = avg / (N * N);

    time = Clock.currTime() - startTime;
    writefln("scalar products: %s ", time);
    writefln("result: %s", avg);
}

用 ManualInline定义的情况下，我得到的是28秒，而在没有问题的情况下，我得到的是32秒。因此，编译器甚至没有内联这个简单的函数，我认为显然应该如此。

（我的命令行是dmd -O -noboundscheck -inline -release ...。）

— 曼尼克
source

1

除非您也将C ++时序进行比较，否则您的时序没有意义。

— deceleratedcaviar

3

@Daniel：你错过了重点。这是为了单独演示D优化，即得出的结论是：“因此，编译器甚至没有内联这个简单的函数，我认为这应该很明显。” 正如我在第一句话中明确指出的那样，我什至试图将其与C ++进行比较：“似乎是实现质量问题。”

— GManNickG 2011年

是的，很抱歉：)。您还会发现DMD编译器也根本不矢量化循环。

— deceleratedcaviar

D与C ++相比有多快？

给OP的问题

建立

结果

结论