如何在Spark 2.0+中编写单元测试？

Question 1

我一直在尝试找到一种合理的方法来SparkSession使用JUnit测试框架进行测试。尽管似乎有很好的示例SparkContext，SparkSession但即使在spark-testing-base的内部多个地方使用了相应的示例，我也无法弄清楚该示例如何工作。如果不是真正正确的方法，我很乐意尝试一种不使用基于火花测试的解决方案。

简单的测试用例（带有的完整MWE项目build.sbt）：

import com.holdenkarau.spark.testing.DataFrameSuiteBase
import org.junit.Test
import org.scalatest.FunSuite

import org.apache.spark.sql.SparkSession


class SessionTest extends FunSuite with DataFrameSuiteBase {

  implicit val sparkImpl: SparkSession = spark

  @Test
  def simpleLookupTest {

    val homeDir = System.getProperty("user.home")
    val training = spark.read.format("libsvm")
      .load(s"$homeDir\\Documents\\GitHub\\sample_linear_regression_data.txt")
    println("completed simple lookup test")
  }

}

用JUnit运行它的结果是在负载线处有一个NPE：

java.lang.NullPointerException
    at SessionTest.simpleLookupTest(SessionTest.scala:16)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
    at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
    at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
    at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
    at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
    at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
    at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
    at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
    at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
    at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
    at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
    at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
    at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
    at org.junit.runner.JUnitCore.run(JUnitCore.java:137)
    at com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:68)
    at com.intellij.rt.execution.junit.IdeaTestRunner$Repeater.startRunnerWithArgs(IdeaTestRunner.java:51)
    at com.intellij.rt.execution.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:237)
    at com.intellij.rt.execution.junit.JUnitStarter.main(JUnitStarter.java:70)

请注意，被加载的文件是否存在无关紧要；在正确配置的SparkSession中，将引发更明智的错误。

Question 2

感谢您提出这个悬而未决的问题。出于某种原因，在谈到Spark时，每个人都被分析深深吸引，以至于他们忘记了过去15年左右出现的出色软件工程实践。这就是为什么我们在课程中着重讨论测试和持续集成（例如DevOps等）的原因。

术语快速入门

一个真正的单元测试意味着你有过在测试每个组件的完全控制。不能与数据库，REST调用，文件系统甚至系统时钟进行交互；就像Gerard Mezaros将其放入xUnit测试模式一样，所有内容都必须“加倍”（例如，被嘲笑，存根等）。我知道这看起来像语义，但这确实很重要。未能理解这一点是您在持续集成中看到间歇性测试失败的主要原因之一。

我们仍然可以单元测试

因此，有了这种了解，RDD就不可能进行单元测试了。但是，在开发分析时仍然存在进行单元测试的地方。

考虑一个简单的操作：

rdd.map(foo).map(bar)

这里foo和bar是简单的功能。可以按照常规方式对它们进行单元测试，并且应该在尽可能多的情况下使用它们。毕竟，他们为什么要关心从测试夹具还是从哪里获得输入RDD？

不要忘记火花壳

这本身并不是测试，但是在这些早期阶段，您还应该在Spark Shell中进行试验，以找出您的转换，尤其是方法的后果。例如，您可以检查物理和逻辑查询计划，分区策略和保存，以及您的数据中包含许多不同的功能状态toDebugString，explain，glom，show，printSchema，等。我会让你探索那些。

您还可以local[2]在Spark shell和测试中将master设置为，以识别仅在开始分发工作后才可能出现的任何问题。

Spark集成测试

现在来看看有趣的东西。

为了在对辅助函数和/转换逻辑的质量充满信心之后对Spark进行集成测试，至关重要的是要做一些事情（无论构建工具和测试框架如何）：RDDDataFrame

增加JVM内存。
启用分叉，但禁用并行执行。
使用测试框架将Spark集成测试累积到套件中，SparkContext在所有测试之前初始化，在所有测试之后停止。

使用ScalaTest，您可以混合使用BeforeAndAfterAll（我通常更喜欢），或者BeforeAndAfterEach像@ShankarKoirala那样来初始化和拆除Spark工件。我知道这是个例外的合理场所，但是我真的不喜欢var您必须使用的那些可变变量。

贷款方式

另一种方法是使用贷款模式。

例如（使用ScalaTest）：

class MySpec extends WordSpec with Matchers with SparkContextSetup {
  "My analytics" should {
    "calculate the right thing" in withSparkContext { (sparkContext) =>
      val data = Seq(...)
      val rdd = sparkContext.parallelize(data)
      val total = rdd.map(...).filter(...).map(...).reduce(_ + _)

      total shouldBe 1000
    }
  }
}

trait SparkContextSetup {
  def withSparkContext(testMethod: (SparkContext) => Any) {
    val conf = new SparkConf()
      .setMaster("local")
      .setAppName("Spark test")
    val sparkContext = new SparkContext(conf)
    try {
      testMethod(sparkContext)
    }
    finally sparkContext.stop()
  }
}

如您所见，“贷款模式”利用高阶函数“贷款”SparkContext测试，然后在测试完成后将其处置。

痛苦编程（感谢Nathan）

这完全是一个优先事项，但是我更喜欢使用贷款模式并尽可能自行整理，然后再引入另一个框架。除了试图保持轻量级之外，框架有时还会添加很多“魔术”，这使得调试测试失败难以推理。因此，我采用了一种面向痛苦的编程方法，在这种情况下，我避免添加一个新框架，直到没有它的痛苦实在难以承受。但是，这完全取决于您。

替代框架的最佳选择当然是@ShankarKoirala提到的spark-testing-base。在这种情况下，上面的测试将如下所示：

class MySpec extends WordSpec with Matchers with SharedSparkContext {
      "My analytics" should {
        "calculate the right thing" in { 
          val data = Seq(...)
          val rdd = sc.parallelize(data)
          val total = rdd.map(...).filter(...).map(...).reduce(_ + _)

          total shouldBe 1000
        }
      }
 }

请注意，我不必采取任何措施来处理SparkContext。SharedSparkContext给我所有的-有sc作为SparkContext-获得自由。就我个人而言，尽管我不会为此目的引入这种依赖性，因为“贷款模式”恰好满足了我的需求。而且，由于分布式系统发生了太多的不可预测性，当在持续集成中出现问题时，必须追溯第三方库的源代码中发生的魔术，这确实是一个痛苦。

现在，基于spark-testing的真正亮点是基于Hadoop的帮助程序，如HDFSClusterLikeand YARNClusterLike。混合使用这些特征确实可以为您省去很多设置上的麻烦。另一个闪耀的地方是类似Scalacheck的属性和生成器-假定您当然了解基于属性的测试的工作原理以及为何有用。但是，我个人会一直推迟使用它，直到我的分析和测试达到这种复杂程度为止。

“只有西斯才能做到绝对。” -Obi-Wan Kenobi

当然，您不必选择任何一个。也许您可以对大多数测试使用“贷款模式”方法，而仅对少数几个更严格的测试使用“火花测试”基础。选择不是二进制的。你可以两者都做。

Spark流的集成测试

最后，我只想展示一个片段的示例，如果没有spark-testing-base，则带有内存值的SparkStreaming集成测试设置可能看起来像：

val sparkContext: SparkContext = ...
val data: Seq[(String, String)] = Seq(("a", "1"), ("b", "2"), ("c", "3"))
val rdd: RDD[(String, String)] = sparkContext.parallelize(data)
val strings: mutable.Queue[RDD[(String, String)]] = mutable.Queue.empty[RDD[(String, String)]]
val streamingContext = new StreamingContext(sparkContext, Seconds(1))
val dStream: InputDStream = streamingContext.queueStream(strings)
strings += rdd

这比看起来简单。实际上，它只是将一系列数据转换为队列以馈送到DStream。实际上，大多数只是与Spark API一起使用的样板设置。无论如何，您都可以将其与spark-testing-base中StreamingSuiteBase 找到的进行比较，以确定您更喜欢哪个。

这可能是我最长的帖子，所以我将其留在这里。我希望其他人能提出其他想法，以帮助改进所有其他应用程序开发的敏捷软件工程实践来提高我们的分析质量。

并为无耻的插件道歉，您可以查看我们的课程“使用Apache Spark进行分析”，我们在其中解决了许多这些想法，甚至更多。我们希望尽快有一个在线版本。

Question 3

您可以使用FunSuite和BeforeAndAfterEach编写一个简单的测试，如下所示

class Tests extends FunSuite with BeforeAndAfterEach {

  var sparkSession : SparkSession = _
  override def beforeEach() {
    sparkSession = SparkSession.builder().appName("udf testings")
      .master("local")
      .config("", "")
      .getOrCreate()
  }

  test("your test name here"){
    //your unit test assert here like below
    assert("True".toLowerCase == "true")
  }

  override def afterEach() {
    sparkSession.stop()
  }
}

您无需在测试中创建函数，只需编写为

test ("test name") {//implementation and assert}

Holden Karau编写了非常好的测试火花测试基础

您需要查看下面的一个简单示例

class TestSharedSparkContext extends FunSuite with SharedSparkContext {

  val expectedResult = List(("a", 3),("b", 2),("c", 4))

  test("Word counts should be equal to expected") {
    verifyWordCount(Seq("c a a b a c b c c"))
  }

  def verifyWordCount(seq: Seq[String]): Unit = {
    assertResult(expectedResult)(new WordCount().transform(sc.makeRDD(seq)).collect().toList)
  }
}

希望这可以帮助！

Question 4

从Spark 1.6开始，您可以使用SparkSharedSparkContext或SharedSQLContext该Spark用于其自己的单元测试：

class YourAppTest extends SharedSQLContext {

  var app: YourApp = _

  protected override def beforeAll(): Unit = {
    super.beforeAll()

    app = new YourApp
  }

  protected override def afterAll(): Unit = {
    super.afterAll()
  }

  test("Your test") {
    val df = sqlContext.read.json("examples/src/main/resources/people.json")

    app.run(df)
  }

由于Spark 2.3 SharedSparkSession可用：

class YourAppTest extends SharedSparkSession {

  var app: YourApp = _

  protected override def beforeAll(): Unit = {
    super.beforeAll()

    app = new YourApp
  }

  protected override def afterAll(): Unit = {
    super.afterAll()
  }

  test("Your test") {
    df = spark.read.json("examples/src/main/resources/people.json")

    app.run(df)
  }

更新：

Maven依赖项：

<dependency>
  <groupId>org.scalactic</groupId>
  <artifactId>scalactic</artifactId>
  <version>SCALATEST_VERSION</version>
</dependency>
<dependency>
  <groupId>org.scalatest</groupId>
  <artifactId>scalatest</artifactId>
  <version>SCALATEST_VERSION</version>
  <scope>test</scope>
</dependency>
<dependency>
  <groupId>org.apache.spark</groupId>
  <artifactId>spark-core</artifactId>
  <version>SPARK_VERSION</version>
  <type>test-jar</type>
  <scope>test</scope>
</dependency>
<dependency>
  <groupId>org.apache.spark</groupId>
  <artifactId>spark-sql</artifactId>
  <version>SPARK_VERSION</version>
  <type>test-jar</type>
  <scope>test</scope>
</dependency>

SBT依赖关系：

"org.scalactic" %% "scalactic" % SCALATEST_VERSION
"org.scalatest" %% "scalatest" % SCALATEST_VERSION % "test"
"org.apache.spark" %% "spark-core" % SPARK_VERSION % Test classifier "tests"
"org.apache.spark" %% "spark-sql" % SPARK_VERSION % Test classifier "tests"

此外，您可以检查Spark的测试源，那里有大量的各种测试套件。

更新2：

Apache Spark单元测试第1部分-核心组件

Apache Spark单元测试第2部分-Spark SQL

Apache Spark单元测试第3部分-流

Apache Spark集成测试

Question 5

我喜欢创建SparkSessionTestWrapper可以混入测试类的特征。Shankar的方法行之有效，但对于包含多个文件的测试套件而言，它的速度却令人望而却步。

import org.apache.spark.sql.SparkSession

trait SparkSessionTestWrapper {

  lazy val spark: SparkSession = {
    SparkSession.builder().master("local").appName("spark session").getOrCreate()
  }

}

该特征可以如下使用：

class DatasetSpec extends FunSpec with SparkSessionTestWrapper {

  import spark.implicits._

  describe("#count") {

    it("returns a count of all the rows in a DataFrame") {

      val sourceDF = Seq(
        ("jets"),
        ("barcelona")
      ).toDF("team")

      assert(sourceDF.count === 2)

    }

  }

}

在spark-spec项目中查看使用该SparkSessionTestWrapper方法的真实示例。

更新资料

当某些特征混入测试类时，spark-testing-base库会自动添加SparkSession（例如，DataFrameSuiteBase混入时，您可以通过spark变量）。

我创建了一个单独的测试库，称为spark-fast-tests以使用户在运行测试时完全控制SparkSession。我认为测试助手库不应设置SparkSession。用户应该能够按自己的意愿启动和停止其SparkSession（我喜欢创建一个SparkSession并在整个测试套件运行中使用它）。

这是运行中的spark-fast-testsassertSmallDatasetEquality方法的示例：

import com.github.mrpowers.spark.fast.tests.DatasetComparer

class DatasetSpec extends FunSpec with SparkSessionTestWrapper with DatasetComparer {

  import spark.implicits._

    it("aliases a DataFrame") {

      val sourceDF = Seq(
        ("jose"),
        ("li"),
        ("luisa")
      ).toDF("name")

      val actualDF = sourceDF.select(col("name").alias("student"))

      val expectedDF = Seq(
        ("jose"),
        ("li"),
        ("luisa")
      ).toDF("student")

      assertSmallDatasetEquality(actualDF, expectedDF)

    }

  }

}

Question 6

我可以用下面的代码解决问题

在项目pom中添加了spark-hive依赖性

class DataFrameTest extends FunSuite with DataFrameSuiteBase{
        test("test dataframe"){
        val sparkSession=spark
        import sparkSession.implicits._
        var df=sparkSession.read.format("csv").load("path/to/csv")
        //rest of the operations.
        }
        }

Question 7

使用JUnit进行单元测试的另一种方法

import org.apache.spark.sql.SparkSession
import org.junit.Assert._
import org.junit.{After, Before, _}

@Test
class SessionSparkTest {
  var spark: SparkSession = _

  @Before
  def beforeFunction(): Unit = {
    //spark = SessionSpark.getSparkSession()
    spark = SparkSession.builder().appName("App Name").master("local").getOrCreate()
    System.out.println("Before Function")
  }

  @After
  def afterFunction(): Unit = {
    spark.stop()
    System.out.println("After Function")
  }

  @Test
  def testRddCount() = {
    val rdd = spark.sparkContext.parallelize(List(1, 2, 3))
    val count = rdd.count()
    assertTrue(3 == count)
  }

  @Test
  def testDfNotEmpty() = {
    val sqlContext = spark.sqlContext
    import sqlContext.implicits._
    val numDf = spark.sparkContext.parallelize(List(1, 2, 3)).toDF("nums")
    assertFalse(numDf.head(1).isEmpty)
  }

  @Test
  def testDfEmpty() = {
    val sqlContext = spark.sqlContext
    import sqlContext.implicits._
    val emptyDf = spark.sqlContext.createDataset(spark.sparkContext.emptyRDD[Num])
    assertTrue(emptyDf.head(1).isEmpty)
  }
}

case class Num(id: Int)