创建 RDD(parallelize & makeRDD)
- 创建一个新的 RDD
- makeRDD 底层调用了 parallelize 方法,这两个操作实际上是相同的
- parallelize 方法有两个入参 parallelize(seq, numSlices),seq指定一个数据集合,numSlices指定分区个数(多少个task来处理)
scala> sc.parallelize(List(1, 2, 3, 4, 5))
res2: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[2] at parallelize at <console>:25
scala> sc.makeRDD(List(1, 2, 3, 4, 5))
res4: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[4] at makeRDD at <console>:25
map & mapPartitions & mapPartitionsWithIndex
- map是对 RDD 中每个元素进行处理,mapPartitions 是对 RDD 中每个分区进行处理
- 如果把 RDD 的数据写入 MySQL,假设 RDD 中有 100 个元素,划分了 10 个分区,map 需要拿到100个 Connection,而mapPartitions 只需要拿到 10 个 Connection
- mapPartitionsWithIndex 与 mapPartitions 类似,但是可以拿到分区编号
scala> val rdd1 = sc.parallelize(List(1, 2, 3, 4, 5))
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[4] at parallelize at <console>:24
scala> rdd1.map(_ * 2).collect
res6: Array[Int] = Array(2, 4, 6, 8, 10)
scala> rdd1.mapPartitions(partition => partition.map(_ * 2)).collect
res8: Array[Int] = Array(2, 4, 6, 8, 10)
scala> rdd1.mapPartitionsWithIndex((index, partition) => {
| partition.map(x => s"分区编号是$index 元素是$x")
| }).collect
res1: Array[String] = Array(分区编号是0 元素是1, 分区编号是0 元素是2, 分区编号是1 元素是3, 分区编号是1 元素是4, 分区编号是1 元素是5)
mapValues
scala> val rdd2 = sc.parallelize(List(("aa", 18), ("bb", 23), ("cc", 5)))
rdd2: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[2] at parallelize at <console>:24
scala> rdd2.mapValues(_ + 1).collect
res2: Array[(String, Int)] = Array((aa,19), (bb,24), (cc,6))
flatMap
- flatMap = map + flatten,在 scala 和 Spark 中的操作是相同的
- flatten 是 scala 中对于 Array 类型的一个操作(不是 Spark 中的算子),可以通过拼接原数组的所有行元素来将一个二维数组“变平”
scala> val f = Array(Array(1, 2), Array(3, 4), Array(5, 6))
f: Array[Array[Int]] = Array(Array(1, 2), Array(3, 4), Array(5, 6))
scala> val m = f.map(_.map(_ * 2))
m: Array[Array[Int]] = Array(Array(2, 4), Array(6, 8), Array(10, 12))
scala> m.flatten
res0: Array[Int] = Array(2, 4, 6, 8, 10, 12)
scala> val f = Array(Array(1, 2), Array(3, 4), Array(5, 6))
f: Array[Array[Int]] = Array(Array(1, 2), Array(3, 4), Array(5, 6))
scala> f.flatMap(_.map(_ * 2))
res8: Array[Int] = Array(2, 4, 6, 8, 10, 12)
scala> sc.parallelize(List(List(1, 2), List(3, 4))).flatMap(_.map(_ * 2)).collect
res7: Array[Int] = Array(2, 4, 6, 8)
scala> sc.parallelize(1 to 5).flatMap(1 to _).collect
res9: Array[Int] = Array(1, 1, 2, 1, 2, 3, 1, 2, 3, 4, 1, 2, 3, 4, 5)
glom
- 把每个分区中的数据放入一个数组,返回一个大的数组包含这些数组
scala> sc.parallelize(1 to 30, 5).glom().collect
res11: Array[Array[Int]] = Array(Array(1, 2, 3, 4, 5, 6), Array(7, 8, 9, 10, 11, 12), Array(13, 14, 15, 16, 17, 18), Array(19, 20, 21, 22, 23, 24), Array(25, 26, 27, 28, 29,30))
sample
- 根据给定的随机种子,从 RDD 中随机地按指定比例选取一部分记录
- sample(withReplacement: Boolean, fraction: Double, seed: Long = Utils.random.nextLong)
- 第一个参数标识是否放回,即一个元素是否可以被多次抽取
- 第二个参数表示抽取的元素占 RDD 中所有元素的比例(不保证提供确切个数)
- 第三个参数表示生成随机数的随机种子
scala> sc.parallelize(1 to 30).sample(false, 0.1, 6).collect
res23: Array[Int] = Array(25)
filter
scala> sc.parallelize(1 to 30).filter(x => x > 20 && x % 2 == 0).collect
res25: Array[Int] = Array(22, 24, 26, 28, 30)
zipWithIndex
- 将 RDD[T] 转换为 RDD[(T, Long)],其中Long为编号,编号规则为首先按照分区,再按照每个分区内部元素的排序编号
scala> val rdd1 = sc.parallelize(List(6, 10, 8, 9, 20, 15))
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[1] at parallelize at <console>:24
scala> rdd1.zipWithIndex.collect
res1: Array[(Int, Long)] = Array((6,0), (10,1), (8,2), (9,3), (20,4), (15,5))
union & intersection & subtract
- union 表示将两个 RDD 中的元素放在一起,不去重
- intersection 表示求两个 RDD 中元素的交集
- subtract 表示求两个 RDD 中元素的差集
scala> val rdd1 = sc.parallelize(List(1, 2, 3, 4, 5, 6))
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[49] at parallelize at <console>:24
scala> val rdd2 = sc.parallelize(List(3, 4, 5, 6, 7, 8, 8))
rdd2: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[50] at parallelize at <console>:24
scala> rdd1.union(rdd2).collect
res26: Array[Int] = Array(1, 2, 3, 4, 5, 6, 3, 4, 5, 6, 7, 8, 8)
scala> rdd1.intersection(rdd2).collect
res27: Array[Int] = Array(4, 6, 3, 5)
scala> rdd1.subtract(rdd2).collect
res28: Array[Int] = Array(2, 1)
distinct
- 去重,可以传入分区个数
- 元素分到哪个分区是按照 值 % 分区个数 来确定的
- 4 % 4 = 0,8 % 4 = 0 所以他们分到了 0 号分区,剩下的以此类推
scala> val rdd1 = sc.parallelize(List(1, 2, 3, 4, 5, 6, 7, 8, 8, 9, 9, 10))
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[71] at parallelize at <console>:24
scala> rdd1.distinct(4).mapPartitionsWithIndex((index, partition) => {
| partition.map(x => println(s"分区是$index 元素是$x"))
| }).collect
分区是1 元素是1
分区是1 元素是9
分区是1 元素是5
分区是0 元素是4
分区是0 元素是8
分区是2 元素是6
分区是2 元素是10
分区是2 元素是2
分区是3 元素是3
分区是3 元素是7
groupByKey & reduceByKey
- def groupByKey(): RDD[(K, scala.Iterable[V])]
- 没有入参,会将相同 Key 的 Value 放到一个集合中
scala> val rdd1 = sc.parallelize(List(("a", 1), ("b", 2), ("c", 3), ("a", 99)))
rdd1: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[85] at parallelize at <console>:24
scala> rdd1.groupByKey().collect
res37: Array[(String, Iterable[Int])] = Array((b,CompactBuffer(2)), (a,CompactBuffer(1, 99)), (c,CompactBuffer(3)))
scala> rdd1.groupByKey().mapValues(_.sum).collect
res38: Array[(String, Int)] = Array((b,2), (a,100), (c,3))
- def reduceByKey(func: (V, V) => V): RDD[(K, V)]
scala> rdd1.reduceByKey(_ + _).collect
res39: Array[(String, Int)] = Array((b,2), (a,100), (c,3))
- 不使用 distinct,如何去重?
- 将每一个元素映射为(x, null),然后使用reduceByKey取一个null,然后使用map取tuple中_1的元素即可
scala> val rdd1 = sc.parallelize(List(3, 4, 5, 15, 4, 10, 5, 6, 7, 1, 3, 4))
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[90] at parallelize at <console>:24
scala> rdd1.map((_, null)).reduceByKey((x, y) => x).map(_._1).collect
res41: Array[Int] = Array(4, 6, 10, 15, 1, 3, 7, 5)
/**
* Return a new RDD containing the distinct elements in this RDD.
*/
def distinct(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T] = withScope {
map(x => (x, null)).reduceByKey((x, y) => x, numPartitions).map(_._1)
}
groupBy
scala> sc.parallelize(List("a", "a", "a", "b", "b", "c")).groupBy(x => x).collect
res42: Array[(String, Iterable[String])] = Array((b,CompactBuffer(b, b)), (a,CompactBuffer(a, a, a)), (c,CompactBuffer(c)))
sortBy & sortByKey
- sortBy 是自定义排序,默认升序,如果想按照降序排列可以传入 false,或者在 sort 条件前加 -,底层调用了 sortByKey
scala> val rdd1 = sc.parallelize(List(("a", 18), ("b", 23), ("c", 5)))
rdd1: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[153] at parallelize at <console>:24
scala> rdd1.sortBy(_._2).collect
res53: Array[(String, Int)] = Array((c,5), (a,18), (b,23))
scala> rdd1.sortBy(_._2, false).collect
res54: Array[(String, Int)] = Array((b,23), (a,18), (c,5))
scala> rdd1.sortBy(-_._2).collect
res55: Array[(String, Int)] = Array((b,23), (a,18), (c,5))
scala> rdd1.map(x => (x._2, x._1)).sortByKey().map(x => (x._1, x._2)).collect
res56: Array[(Int, String)] = Array((5,c), (18,a), (23,b))
join & leftOuterJoin & rightOuterJoin & fullOuterJoin
- 内连接 join
- def join[W](other: RDD[(K, W)]): RDD[(K, (V, W))]
scala> val rdd1 = sc.parallelize(List(("a", "上海"), ("b", "北京"), ("c", "深圳")))
rdd1: org.apache.spark.rdd.RDD[(String, String)] = ParallelCollectionRDD[176] at parallelize at <console>:24
scala> val rdd2 = sc.parallelize(List(("a", "20"), ("b", "22"), ("d", "32")))
rdd2: org.apache.spark.rdd.RDD[(String, String)] = ParallelCollectionRDD[177] at parallelize at <console>:24
scala> rdd1.join(rdd2).collect
res58: Array[(String, (String, String))] = Array((b,(北京,22)), (a,(上海,20)))
- 左外连接(左连接) leftOuterJoin
- def leftOuterJoin[W](other: RDD[(K, W)]): RDD[(K, (V, Option[W]))]
scala> rdd1.leftOuterJoin(rdd2).collect
res59: Array[(String, (String, Option[String]))] = Array((b,(北京,Some(22))), (a,(上海,Some(20))), (c,(深圳,None)))
- 右外连接(右连接) rightOuterJoin
- def rightOuterJoin[W](other: RDD[(K, W)]): RDD[(K, (Option[V], W))]
scala> rdd1.rightOuterJoin(rdd2).collect
res61: Array[(String, (Option[String], String))] = Array((d,(None,32)), (b,(Some(北京),22)), (a,(Some(上海),20)))
- 全外连接 fullOuterJoin
- def fullOuterJoin[W](other: RDD[(K, W)]): RDD[(K, (Option[V], Option[W]))]
scala> rdd1.fullOuterJoin(rdd2).collect
res62: Array[(String, (Option[String], Option[String]))] = Array((d,(None,Some(32))), (b,(Some(北京),Some(22))), (a,(Some(上海),Some(20))), (c,(Some(深圳),None)))
cogroup & xxxJoin
- cogroup是根据 key 进行关联,返回两边 RDD 的记录,没关联上的是空
- def cogroup[W](other: RDD[(K, W)]): RDD[(K, (scala.Iterable[V], scala.Iterable[W]))]
- cogroup 返回 RDD 的 Value 类型是 (scala.Iterable[V], scala.Iterable[W]),xxxJoin返回 RDD 的 Value 类型是 (Option[V], Option[W])
- xxxJoin 底层调用的就是 cogroup
scala> rdd1.cogroup(rdd2).collect
res63: Array[(String, (Iterable[String], Iterable[String]))] = Array((d,(CompactBuffer(),CompactBuffer(32))), (b,(CompactBuffer(北京),CompactBuffer(22))), (a,(CompactBuffer(上海),CompactBuffer(20))), (c,(CompactBuffer(深圳),CompactBuffer())))