为什么在Scala中压缩速度比压缩速度快？

38

我已经编写了一些Scala代码以对集合执行按元素操作。在这里，我定义了两种执行相同任务的方法。一种方法使用zip，另一种方法使用zipped。

def ES (arr :Array[Double], arr1 :Array[Double]) :Array[Double] = arr.zip(arr1).map(x => x._1 + x._2)

def ES1(arr :Array[Double], arr1 :Array[Double]) :Array[Double] = (arr,arr1).zipped.map((x,y) => x + y)

为了比较这两种方法的速度，我编写了以下代码：

def fun (arr : Array[Double] , arr1 : Array[Double] , f :(Array[Double],Array[Double]) => Array[Double] , itr : Int) ={
  val t0 = System.nanoTime()
  for (i <- 1 to itr) {
       f(arr,arr1)
       }
  val t1 = System.nanoTime()
  println("Total Time Consumed:" + ((t1 - t0).toDouble / 1000000000).toDouble + "Seconds")
}

我调用该fun方法并通过ES，ES1如下所示：

fun(Array.fill(10000)(math.random), Array.fill(10000)(math.random), ES , 100000)
fun(Array.fill(10000)(math.random), Array.fill(10000)(math.random), ES1, 100000)

结果表明，命名方法ES1，使用zipped速度更快的方法比ES使用zip。基于这些观察，我有两个问题。

为什么zipped要比zip？

在Scala中，有没有更快的方法可以对集合进行元素操作？

— 用户名
source

2

相关问题：stackoverflow.com/questions/59125910/...

— 马里奥大蒜愈伤

8

由于JIT决定第二次更积极地优化，因此看到了“乐趣”。或者是因为GC决定在ES运行时清理某些内容。或者是因为您的操作系统决定在ES测试运行时有更好的事情要做。可以是任何东西，这个微基准测试不是结论性的。

— Andrey Tyukin

1

您的计算机上有什么结果？快多少？

— Peeyush Kushwaha

对于相同的人口规模和配置，Zipp耗时32秒，而Zip耗时44秒

— user12140540

3

您的结果毫无意义。如果必须进行微基准测试，请使用JMH。

— OrangeDog

17

要回答第二个问题：

在Scala中，有没有更快的方法可以对集合进行元素明智的操作？

可悲的事实是，尽管简洁，提高了生产率，并且对功能语言不一定是性能最高的错误的适应能力强-使用高阶函数来定义要针对并非免费的集合执行的投影，而紧密的循环突出了这一点。正如其他人指出的那样，用于中间结果和最终结果的额外存储分配也将产生开销。

如果性能至关重要，尽管绝不是通用的，那么在像您这样的情况下，您可以将Scala的操作放回到当务之急，以便重新获得对内存使用的更直接控制并消除函数调用。

在您的特定示例中，zipped可以通过以下方式强制执行求和：预先分配一个固定的可变大小的固定可变数组（因为当集合之一用完元素时zip停止，因为zip停止了），然后将元素以适当的索引加在一起（因为访问按序索引数组元素是一个非常快的操作）。

在ES3测试套件中添加第三个功能：

def ES3(arr :Array[Double], arr1 :Array[Double]) :Array[Double] = {
   val minSize = math.min(arr.length, arr1.length)
   val array = Array.ofDim[Double](minSize)
   for (i <- 0 to minSize - 1) {
     array(i) = arr(i) + arr1(i)
   }
  array
}

在我的i7上，我得到以下响应时间：

OP ES Total Time Consumed:23.3747857Seconds
OP ES1 Total Time Consumed:11.7506995Seconds
--
ES3 Total Time Consumed:1.0255231Seconds

更令人发指的是对两个数组中的较短者进行直接就地突变，这显然会破坏其中一个数组的内容，并且仅在不再需要原始数组的情况下才这样做：

def ES4(arr :Array[Double], arr1 :Array[Double]) :Array[Double] = {
   val minSize = math.min(arr.length, arr1.length)
   val array = if (arr.length < arr1.length) arr else arr1
   for (i <- 0 to minSize - 1) {
      array(i) = arr(i) + arr1(i)
   }
  array
}

Total Time Consumed:0.3542098Seconds

但是很明显，数组元素的直接突变不是Scala的精神。

— 斯图尔特
source

2

上面的代码中没有并行化的内容。尽管这个特定的问题是可并行化的（因为多个线程可以在数组的不同部分上工作），但是仅对10k元素进行如此简单的操作并没有多大意义-创建和同步新线程的开销可能会超过任何好处。说实话，如果你需要这种级别的性能优化，你很可能会更好写这类拉斯特算法，去程或C.

— StuartLC

3

Array.tabulate(minSize)(i => arr(i) + arr1(i))创建数组将更像Scala，并且使用起来更快

— Sarvesh Kumar Singh

1

@SarveshKumarSingh这要慢得多。耗时近9秒

— user12140540

1

Array.tabulate应该是速度远远超过任一zip或zipped这里（和在我的基准）。

— 特拉维斯·布朗

1

@StuartLC“只有在解开并内联了高阶函数后，性能才等效。” 这不是很准确。甚至您for都对高级函数调用（foreach）不满意。在两种情况下，lambda只会实例化一次。

— 特拉维斯·布朗

50

没有其他答案提到速度差异的主要原因，因为该zipped版本避免了10,000个元组分配。正如其他几个答案所指出的那样，该zip版本涉及一个中间数组，而该zipped版本不涉及中间数组，但是为10,000个元素分配一个数组并不会使该zip版本变得如此糟糕—而是10,000个短命元组被放入该数组。这些由JVM上的对象表示，因此您要为即将被丢弃的事物进行一堆对象分配。

该答案的其余部分只是进一步详细介绍了如何确认这一点。

更好的基准测试

您确实想使用jmh之类的框架在JVM上负责任地进行任何基准测试，即使是负责任的部分也很困难，尽管设置jmh本身还不错。如果您有project/plugins.sbt这样的话：

addSbtPlugin("pl.project13.scala" % "sbt-jmh" % "0.3.7")

和build.sbt这样的（我使用的是2.11.8，因为你提到的你正在使用的）：

scalaVersion := "2.11.8"

enablePlugins(JmhPlugin)

然后，您可以像这样编写基准测试：

package zipped_bench

import org.openjdk.jmh.annotations._

@State(Scope.Benchmark)
@BenchmarkMode(Array(Mode.Throughput))
class ZippedBench {
  val arr1 = Array.fill(10000)(math.random)
  val arr2 = Array.fill(10000)(math.random)

  def ES(arr: Array[Double], arr1: Array[Double]): Array[Double] =
    arr.zip(arr1).map(x => x._1 + x._2)

  def ES1(arr: Array[Double], arr1: Array[Double]): Array[Double] =
    (arr, arr1).zipped.map((x, y) => x + y)

  @Benchmark def withZip: Array[Double] = ES(arr1, arr2)
  @Benchmark def withZipped: Array[Double] = ES1(arr1, arr2)
}

并运行sbt "jmh:run -i 10 -wi 10 -f 2 -t 1 zipped_bench.ZippedBench"：

Benchmark                Mode  Cnt     Score    Error  Units
ZippedBench.withZip     thrpt   20  4902.519 ± 41.733  ops/s
ZippedBench.withZipped  thrpt   20  8736.251 ± 36.730  ops/s

这表明该zipped版本的吞吐量提高了约80％，这可能与您的测量结果大致相同。

衡量分配

您也可以要求jmh使用以下方法衡量分配-prof gc：

Benchmark                                                 Mode  Cnt        Score       Error   Units
ZippedBench.withZip                                      thrpt    5     4894.197 ±   119.519   ops/s
ZippedBench.withZip:·gc.alloc.rate                       thrpt    5     4801.158 ±   117.157  MB/sec
ZippedBench.withZip:·gc.alloc.rate.norm                  thrpt    5  1080120.009 ±     0.001    B/op
ZippedBench.withZip:·gc.churn.PS_Eden_Space              thrpt    5     4808.028 ±    87.804  MB/sec
ZippedBench.withZip:·gc.churn.PS_Eden_Space.norm         thrpt    5  1081677.156 ± 12639.416    B/op
ZippedBench.withZip:·gc.churn.PS_Survivor_Space          thrpt    5        2.129 ±     0.794  MB/sec
ZippedBench.withZip:·gc.churn.PS_Survivor_Space.norm     thrpt    5      479.009 ±   179.575    B/op
ZippedBench.withZip:·gc.count                            thrpt    5      714.000              counts
ZippedBench.withZip:·gc.time                             thrpt    5      476.000                  ms
ZippedBench.withZipped                                   thrpt    5    11248.964 ±    43.728   ops/s
ZippedBench.withZipped:·gc.alloc.rate                    thrpt    5     3270.856 ±    12.729  MB/sec
ZippedBench.withZipped:·gc.alloc.rate.norm               thrpt    5   320152.004 ±     0.001    B/op
ZippedBench.withZipped:·gc.churn.PS_Eden_Space           thrpt    5     3277.158 ±    32.327  MB/sec
ZippedBench.withZipped:·gc.churn.PS_Eden_Space.norm      thrpt    5   320769.044 ±  3216.092    B/op
ZippedBench.withZipped:·gc.churn.PS_Survivor_Space       thrpt    5        0.360 ±     0.166  MB/sec
ZippedBench.withZipped:·gc.churn.PS_Survivor_Space.norm  thrpt    5       35.245 ±    16.365    B/op
ZippedBench.withZipped:·gc.count                         thrpt    5      863.000              counts
ZippedBench.withZipped:·gc.time                          thrpt    5      447.000                  ms

… gc.alloc.rate.norm可能是最有趣的部分，表明该zip版本的分配是的三倍zipped。

命令式实现

如果我知道将在对性能非常敏感的上下文中调用此方法，则可能会这样实现：

  def ES3(arr: Array[Double], arr1: Array[Double]): Array[Double] = {
    val minSize = math.min(arr.length, arr1.length)
    val newArr = new Array[Double](minSize)
    var i = 0
    while (i < minSize) {
      newArr(i) = arr(i) + arr1(i)
      i += 1
    }
    newArr
  }

请注意，与其他答案之一中的优化版本不同，此方法使用while而不是，for因为for仍然会将糖分解为Scala集合操作。我们可以比较此实现（withWhile），另一个答案的优化（但不是就地执行）实现withFor和两个原始实现：

Benchmark                Mode  Cnt       Score      Error  Units
ZippedBench.withFor     thrpt   20  118426.044 ± 2173.310  ops/s
ZippedBench.withWhile   thrpt   20  119834.409 ±  527.589  ops/s
ZippedBench.withZip     thrpt   20    4886.624 ±   75.567  ops/s
ZippedBench.withZipped  thrpt   20    9961.668 ± 1104.937  ops/s

强制性和功能性版本之间确实存在巨大差异，所有这些方法签名都完全相同，并且实现具有相同的语义。并不是命令式实现使用全局状态，等等。尽管zip和zipped版本更易读，但我个人认为命令式版本与“ Scala精神”没有任何关系，我会毫不犹豫的自己使用它们。

与制表

更新：我tabulate根据另一个答案中的评论在基准中添加了一个实现：

def ES4(arr: Array[Double], arr1: Array[Double]): Array[Double] = {
  val minSize = math.min(arr.length, arr1.length)
  Array.tabulate(minSize)(i => arr(i) + arr1(i))
}

它比zip版本要快得多，但仍然比命令版本要慢得多：

Benchmark                  Mode  Cnt      Score     Error  Units
ZippedBench.withTabulate  thrpt   20  32326.051 ± 535.677  ops/s
ZippedBench.withZip       thrpt   20   4902.027 ±  47.931  ops/s

这就是我所期望的，因为调用函数本质上并不昂贵，并且通过索引访问数组元素非常便宜。

— 特拉维斯·布朗
source

8

考虑 lazyZip

(as lazyZip bs) map { case (a, b) => a + b }

代替 zip

(as zip bs) map { case (a, b) => a + b }

Scala 2.13 添加 lazyZip以支持.zipped

与.zipon视图一起替换.zipped（现在已弃用）。（scala / collection-strawman＃223）

zipped（因此lazyZip）是速度比zip，因为如通过解释添和麦克艾伦，zip接着map将导致两个分开的转化由于严格，而zipped随后map将导致一气呵成由于懒惰执行的单次转化。

zipped给人Tuple2Zipped和分析Tuple2Zipped.map，

class Tuple2Zipped[...](val colls: (It1, It2)) extends ... {
  private def coll1 = colls._1
  private def coll2 = colls._2

  def map[...](f: (El1, El2) => B)(...) = {
    val b = bf.newBuilder(coll1)
    ...
    val elems1 = coll1.iterator
    val elems2 = coll2.iterator

    while (elems1.hasNext && elems2.hasNext) {
      b += f(elems1.next(), elems2.next())
    }

    b.result()
  }

我们看到了两个集合coll1，coll2并在每次迭代中进行迭代，并且在f传递给函数的过程中map都会应用

b += f(elems1.next(), elems2.next())

无需分配和转换中介结构。

应用特拉维斯标杆法，这里是新的比较lazyZip和弃用zipped哪里

@State(Scope.Benchmark)
@BenchmarkMode(Array(Mode.Throughput))
class ZippedBench {
  import scala.collection.mutable._
  val as = ArraySeq.fill(10000)(math.random)
  val bs = ArraySeq.fill(10000)(math.random)

  def lazyZip(as: ArraySeq[Double], bs: ArraySeq[Double]): ArraySeq[Double] =
    as.lazyZip(bs).map{ case (a, b) => a + b }

  def zipped(as: ArraySeq[Double], bs: ArraySeq[Double]): ArraySeq[Double] =
    (as, bs).zipped.map { case (a, b) => a + b }

  def lazyZipJavaArray(as: Array[Double], bs: Array[Double]): Array[Double] =
    as.lazyZip(bs).map{ case (a, b) => a + b }

  @Benchmark def withZipped: ArraySeq[Double] = zipped(as, bs)
  @Benchmark def withLazyZip: ArraySeq[Double] = lazyZip(as, bs)
  @Benchmark def withLazyZipJavaArray: ArraySeq[Double] = lazyZipJavaArray(as.toArray, bs.toArray)
}

给

[info] Benchmark                          Mode  Cnt      Score      Error  Units
[info] ZippedBench.withZipped            thrpt   20  20197.344 ± 1282.414  ops/s
[info] ZippedBench.withLazyZip           thrpt   20  25468.458 ± 2720.860  ops/s
[info] ZippedBench.withLazyZipJavaArray  thrpt   20   5215.621 ±  233.270  ops/s

lazyZip似乎表现比zipped上要好ArraySeq。有趣的是，使用时发现显著的性能下降lazyZip上Array。

— 马里奥·加里奇（Mario Galic）
source

lazyZip在Scala 2.13.1中可用。目前，我正在使用Scala 2.11.8

— user12140540

5

由于JIT编译，您应该始终对性能度量保持谨慎，但是可能的原因是它zipped很懒，并且Array在map调用过程中从原始值提取元素，而zip创建了一个新Array对象，然后调用map了该新对象。

— 提姆
source