map和flatMap有什么区别，并且每个都有一个很好的用例？

249

有人可以向我解释map和flatMap之间的区别，以及每种情况的好用例是什么？

“整理结果”是什么意思？到底有什么好处呢？

apache-spark

4

由于添加的星火标签，我会假设你问RDD.map及RDD.flatMap在Apache的星火。通常，Spark的RDD操作是在其对应的Scala收集操作之后建模的。stackoverflow.com/q/1059776/590203中的答案（讨论了Scala map和之间的区别）flatMap可能对您有所帮助。

— 乔什·罗森

1

这里的大多数示例似乎都假定flatMap仅在集合上运行，事实并非如此。

— Boon

stackoverflow.com/questions/26684562/…–

— gliptak

195

这是spark-shell会话的区别示例：

首先，一些数据-两行文本：

val rdd = sc.parallelize(Seq("Roses are red", "Violets are blue"))  // lines

rdd.collect

    res0: Array[String] = Array("Roses are red", "Violets are blue")

现在，map将长度为N的RDD转换为长度为N的另一个RDD。

例如，它从两条线映射到两条线长：

rdd.map(_.length).collect

    res1: Array[Int] = Array(13, 16)

但是flatMap（松散地说）将长度为N的RDD转换为N个集合的集合，然后将其展平为单个RDD结果。

rdd.flatMap(_.split(" ")).collect

    res2: Array[String] = Array("Roses", "are", "red", "Violets", "are", "blue")

每行有多个单词，多行，但最终得到一个单词输出数组

只是为了说明这一点，从行集合到单词集合的flatMapping看起来像：

["aa bb cc", "", "dd"] => [["aa","bb","cc"],[],["dd"]] => ["aa","bb","cc","dd"]

因此，输入RDD和输出RDD通常具有不同的大小flatMap。

如果我们尝试使用map我们的split函数，那么我们最终将使用嵌套结构（带有类型的单词数组的RDD RDD[Array[String]]），因为每个输入必须有一个精确的结果：

rdd.map(_.split(" ")).collect

    res3: Array[Array[String]] = Array(
                                     Array(Roses, are, red), 
                                     Array(Violets, are, blue)
                                 )

最后，一个有用的特殊情况是使用可能无法返回答案的函数进行映射，因此返回Option。我们可以flatMap用来过滤返回的元素，None并从返回a 的元素中提取值Some：

val rdd = sc.parallelize(Seq(1,2,3,4))

def myfn(x: Int): Option[Int] = if (x <= 2) Some(x * 10) else None

rdd.flatMap(myfn).collect

    res3: Array[Int] = Array(10,20)

（在此注意，Option的行为类似于具有一个元素或零个元素的列表）

— 脱氧核糖核酸
source

1

地图中的split会给["a b c", "", "d"] => [["a","b","c"],[],["d"]]吗？

— user2635088

1

是的-（但请注意，我的非正式记号只是表示某种类型的集合-实际上，split在字符串列表上进行映射将产生数组列表）

— DNA

2

感谢您撰写本文，这是我读过的最好的解释，可以区分两者之间的区别

— Rajiv

97

通常我们在hadoop中使用单词计数示例。我将使用相同的用例，并将使用map和flatMap，我们将看到它在处理数据方面的差异。

下面是示例数据文件。

hadoop is fast
hive is sql on hdfs
spark is superfast
spark is awesome

上面的文件将使用map和进行解析flatMap。

使用 `map`

>>> wc = data.map(lambda line:line.split(" "));
>>> wc.collect()
[u'hadoop is fast', u'hive is sql on hdfs', u'spark is superfast', u'spark is awesome']

输入有4行，输出大小也为4，即N个元素==> N个元素。

使用 `flatMap`

>>> fm = data.flatMap(lambda line:line.split(" "));
>>> fm.collect()
[u'hadoop', u'is', u'fast', u'hive', u'is', u'sql', u'on', u'hdfs', u'spark', u'is', u'superfast', u'spark', u'is', u'awesome']

输出与map不同。

让我们为每个键分配1作为值以获得字数。

fm：通过使用创建的RDD flatMap
wc：使用RDD创建 map

>>> fm.map(lambda word : (word,1)).collect()
[(u'hadoop', 1), (u'is', 1), (u'fast', 1), (u'hive', 1), (u'is', 1), (u'sql', 1), (u'on', 1), (u'hdfs', 1), (u'spark', 1), (u'is', 1), (u'superfast', 1), (u'spark', 1), (u'is', 1), (u'awesome', 1)]

而flatMap在RDD wc上将给出以下不良输出：

>>> wc.flatMap(lambda word : (word,1)).collect()
[[u'hadoop', u'is', u'fast'], 1, [u'hive', u'is', u'sql', u'on', u'hdfs'], 1, [u'spark', u'is', u'superfast'], 1, [u'spark', u'is', u'awesome'], 1]

如果map使用代替，则无法获得字数统计flatMap。

根据定义，map和之间的区别flatMap是：

map：通过将给定功能应用于RDD的每个元素，它返回一个新的RDD。函数map仅返回一项。

flatMap：与相似map，它通过向RDD的每个元素应用函数来返回新的RDD，但是输出被展平。

— 瑜珈
source

14

我觉得这个答案比公认的答案要好。

— 克里希纳

15

当您只需要复制粘贴输出文本时，为什么还要创建模糊的屏幕截图？

— nbubis

所以flatMap（）是map（）+“ flatten”，我知道这没有多大意义，但是在map（）之后我们可以使用任何“ flatten”函数吗？

— burakongun'9

2

您的代码有误导性的错字。的结果.map(lambda line:line.split(" "))不是字符串数组。您应该更改data.collect() 为wc.collect，您将看到一个数组数组。

— swdev

1

是的，但是命令的结果仍然是错误的。你跑了wc.collect()吗？

— swdev

18

如果您在Spark中询问RDD.map和RDD.flatMap之间的区别，则map将大小为N的RDD转换为大小为N的另一个。例如。

myRDD.map(x => x*2)

例如，如果myRDD由Doubles组成。

虽然flatMap可以将RDD转换为其他大小的另一个：例如：

myRDD.flatMap(x =>new Seq(2*x,3*x))

这将返回大小为2 * N的RDD或

myRDD.flatMap(x =>if x<10 new Seq(2*x,3*x) else new Seq(x) )

— 奥萨马
source

17

归结为您最初的问题：展平是什么意思？

使用flatMap时，“多维”集合变为“一维”集合。

val array1d = Array ("1,2,3", "4,5,6", "7,8,9")  
//array1d is an array of strings

val array2d = array1d.map(x => x.split(","))
//array2d will be : Array( Array(1,2,3), Array(4,5,6), Array(7,8,9) )

val flatArray = array1d.flatMap(x => x.split(","))
//flatArray will be : Array (1,2,3,4,5,6,7,8,9)

您想在以下情况下使用flatMap，

您的地图功能可以创建多层结构
但您想要的只是一个简单的平面一维结构，只需删除所有内部分组

— 拉穆
source

15

使用test.md作为一个例子：

➜  spark-1.6.1 cat test.md
This is the first line;
This is the second line;
This is the last line.

scala> val textFile = sc.textFile("test.md")
scala> textFile.map(line => line.split(" ")).count()
res2: Long = 3

scala> textFile.flatMap(line => line.split(" ")).count()
res3: Long = 15

scala> textFile.map(line => line.split(" ")).collect()
res0: Array[Array[String]] = Array(Array(This, is, the, first, line;), Array(This, is, the, second, line;), Array(This, is, the, last, line.))

scala> textFile.flatMap(line => line.split(" ")).collect()
res1: Array[String] = Array(This, is, the, first, line;, This, is, the, second, line;, This, is, the, last, line.)

如果使用mapmethod，则将获得的行test.md，对于flatMapmethod，将获得单词的数目。

该map方法类似于flatMap，他们都返回一个新的RDD。map方法经常使用返回新的RDD，flatMap方法经常使用拆分字。

— 邦邦
source

9

map返回相等数量的元素的RDD，而flatMap可能不相等。

flatMap筛选出丢失或不正确数据的示例用例。

map在各种情况下使用的示例用例，其中输入和输出的元素数相同。

number.csv

map.py将所有数字添加到add.csv中。

from operator import *

def f(row):
  try:
    return float(row)
  except Exception:
    return 0

rdd = sc.textFile('a.csv').map(f)

print(rdd.count())      # 7
print(rdd.reduce(add))  # 15.0

flatMap.py用于flatMap在添加之前过滤掉丢失的数据。与先前版本相比，添加的数字更少。

from operator import *

def f(row):
  try:
    return [float(row)]
  except Exception:
    return []

rdd = sc.textFile('a.csv').flatMap(f)

print(rdd.count())      # 5
print(rdd.reduce(add))  # 15.0

— 万尼克
source

8

map和flatMap相似，从某种意义上说，它们从输入的RDD中提取一条线并对其应用函数。它们的区别在于map中的函数仅返回一个元素，而flatMap中的函数可以将元素列表（0个或多个）作为迭代器返回。

另外，flatMap的输出将被展平。尽管flatMap中的函数返回一个元素列表，但flatMap返回一个RDD，它以平面方式（而不是列表）包含列表中的所有元素。

— 巴斯克
source

7

所有示例都很好。...这是很好的视觉插图...来源：DataFlair spark培训

Map：地图是Apache Spark中的转换操作。它适用于RDD的每个元素，并将结果作为新的RDD返回。在地图中，操作开发人员可以定义自己的自定义业务逻辑。相同的逻辑将应用于RDD的所有元素。

Spark RDD map函数根据一个自定义代码（由开发人员指定）将一个元素作为输入过程，并一次返回一个元素。Map将长度为N的RDD转换为长度为N的另一个RDD。输入和输出RDD通常具有相同数量的记录。

map使用scala的示例：

val x = spark.sparkContext.parallelize(List("spark", "map", "example",  "sample", "example"), 3)
val y = x.map(x => (x, 1))
y.collect
// res0: Array[(String, Int)] = 
//    Array((spark,1), (map,1), (example,1), (sample,1), (example,1))

// rdd y can be re writen with shorter syntax in scala as 
val y = x.map((_, 1))
y.collect
// res1: Array[(String, Int)] = 
//    Array((spark,1), (map,1), (example,1), (sample,1), (example,1))

// Another example of making tuple with string and it's length
val y = x.map(x => (x, x.length))
y.collect
// res3: Array[(String, Int)] = 
//    Array((spark,5), (map,3), (example,7), (sample,6), (example,7))

FlatMap：

A flatMap是转换操作。它适用于RDD的每个元素，并将结果作为new返回RDD。它类似于Map，但是FlatMap允许从map函数返回0、1或多个元素。在FlatMap操作中，开发人员可以定义自己的自定义业务逻辑。相同的逻辑将应用于RDD的所有元素。

“整理结果”是什么意思？

FlatMap函数根据自定义代码（由开发人员指定）将一个元素作为输入过程，并一次返回0个或多个元素。flatMap（）将长度为N的RDD转换为长度为M的另一个RDD。

flatMap使用scala的示例：

val x = spark.sparkContext.parallelize(List("spark flatmap example",  "sample example"), 2)

// map operation will return Array of Arrays in following case : check type of res0
val y = x.map(x => x.split(" ")) // split(" ") returns an array of words
y.collect
// res0: Array[Array[String]] = 
//  Array(Array(spark, flatmap, example), Array(sample, example))

// flatMap operation will return Array of words in following case : Check type of res1
val y = x.flatMap(x => x.split(" "))
y.collect
//res1: Array[String] = 
//  Array(spark, flatmap, example, sample, example)

// RDD y can be re written with shorter syntax in scala as 
val y = x.flatMap(_.split(" "))
y.collect
//res2: Array[String] = 
//  Array(spark, flatmap, example, sample, example)

— 拉姆·加迪亚拉姆
source

5

可以从下面的示例pyspark代码中看到区别：

rdd = sc.parallelize([2, 3, 4])
rdd.flatMap(lambda x: range(1, x)).collect()
Output:
[1, 1, 2, 1, 2, 3]


rdd.map(lambda x: range(1, x)).collect()
Output:
[[1], [1, 2], [1, 2, 3]]

— awadhesh pathak
source

3

Flatmap和Map都可以转换集合。

区别：

map（func）
返回一个新的分布式数据集，该数据集是通过将源的每个元素传递给函数func形成的。

flatMap（func）
与map相似，但是每个输入项都可以映射到0个或多个输出项（因此func应该返回Seq而不是单个项）。

转换函数：
map：一个元素输入->一个元素输出。
flatMap：-> 0或多个元素（集合）中的一个元素。

— 阿吉特·萨加尔
source

3

RDD.map 返回单个数组中的所有元素

RDD.flatMap 返回数组数组中的元素

假设我们在text.txt文件中有文本为

Spark is an expressive framework
This text is to understand map and faltMap functions of Spark RDD

使用地图

val text=sc.textFile("text.txt").map(_.split(" ")).collect

输出：

text: **Array[Array[String]]** = Array(Array(Spark, is, an, expressive, framework), Array(This, text, is, to, understand, map, and, faltMap, functions, of, Spark, RDD))

使用flatMap

val text=sc.textFile("text.txt").flatMap(_.split(" ")).collect

输出：

 text: **Array[String]** = Array(Spark, is, an, expressive, framework, This, text, is, to, understand, map, and, faltMap, functions, of, Spark, RDD)

— 维拉
source

2

对于所有想要PySpark相关的人：

转换示例：flatMap

>>> a="hello what are you doing"
>>> a.split()

['hello'，'what'，'are'，'you'，'doing']

>>> b=["hello what are you doing","this is rak"]
>>> b.split()

追溯（最近一次呼叫最近）：AttributeError：“ list”对象中的文件“”，第1行没有属性“ split”

>>> rline=sc.parallelize(b)
>>> type(rline)

>>> def fwords(x):
...     return x.split()


>>> rword=rline.map(fwords)
>>> rword.collect()

[['hello'，'what'，'are'，'you'，'doing']，['this'，'is'，'rak']

>>> rwordflat=rline.flatMap(fwords)
>>> rwordflat.collect()

['hello'，'what'，'are'，'you'，'doing'，'this'，'is'，'rak']

希望能帮助到你：）

— 拉克希斯·N·古达
source

2

map：RDD通过将函数应用于的每个元素，它返回一个新值RDD。.map中的函数只能返回一项。

flatMap：与map相似，它RDD通过对RDD的每个元素应用函数来返回一个新值，但是输出被展平。

同样，函数in flatMap可以返回元素列表（0或更多）

例如：

sc.parallelize([3,4,5]).map(lambda x: range(1,x)).collect()

输出：[[1、2]，[1、2、3]，[1、2、3、4]

sc.parallelize([3,4,5]).flatMap(lambda x: range(1,x)).collect()

输出：注意o / p在一个列表中展平[1、2、1、2、3、1、2、3、4]

资料来源：https : //www.linkedin.com/pulse/difference-between-map-flatmap-transformations-spark-pyspark-pandey/

— 普什卡·德什潘德
source

0

地图：

是一种将函数用作输入并将其应用于源RDD中的每个元素的高阶方法。

http://commandstech.com/difference-between-map-and-flatmap-in-spark-what-is-map-and-flatmap-with-examples/

flatMap：

具有输入功能的高阶方法和转换操作。

— 斯潘达纳河
source

-1

map和flatMap输出的差异：

1。flatMap

val a = sc.parallelize(1 to 10, 5)

a.flatMap(1 to _).collect()

输出：

 1, 1, 2, 1, 2, 3, 1, 2, 3, 4, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 6, 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6, 7, 8, 1, 2, 3, 4, 5, 6, 7, 8, 9, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10

2 . map：

val a = sc.parallelize(List("dog", "salmon", "salmon", "rat", "elephant"), 3)

val b = a.map(_.length).collect()

输出：

3 6 6 3 8

— 阿舒托什
source

-1

map（func）返回一个新的分布式数据集，该数据集通过将源的每个元素通过声明的函数func形成.so map（）是单项

一段时间

flatMap（func）与map相似，但是每个输入项都可以映射到0个或多个输出项，因此func应该返回Sequence而不是单个项。

— Kondas Lamar Jnr
source

map和flatMap有什么区别，并且每个都有一个很好的用例？

使用 map

使用 flatMap

使用 `map`

使用 `flatMap`