Spark算子[08]:combineByKey详解

本文详细介绍了Spark的combineByKey操作,它是分布式数据集聚合操作的基础。结合源码解析,阐述了如何将RDD[K,V]转换为RDD[K,C],并提供了Scala和Java的实战案例,以计算学生平均成绩来展示其用法。" 75381899,6678090,21款Android UI框架:自定义View与炫酷动画,"['Android开发', 'UI设计', '自定义控件', '动画效果', '开源库']

combineByKey

聚合数据一般在集中式数据比较方便,如果涉及到分布式的数据集,该如何去实现呢。这里介绍一下combineByKey, 这个是各种聚集操作的鼻祖,应该要好好了解一下,可以参考Spark API
更好的,可以将spark的源码包加载到Idea工具中,Spark源码包下载

源码

  /**
   * @see [[combineByKeyWithClassTag]]
   * 
   * 具体实现在combineByKeyWithClassTag中
   */
  def combineByKey[C](
      createCombiner: V => C,
      mergeValue: (C, V) => C,
      mergeCombiners: (C, C) => C,
      partitioner: Partitioner,
      mapSideCombine: Boolean = true,
      serializer: Serializer = null): RDD[(K, C)] = self.withScope {
    combineByKeyWithClassTag(createCombiner, mergeValue, mergeCombiners,
      partitioner, mapSideCombine, serializer)(null)
  }
def combineByKey[C](createCombiner: (V) => C, mergeValue: (C, V) => C, mergeCombiners: (C, C) => C): RDD[(K, C)]

def combineByKey[C](createCombiner: (V) => C, mergeValue: (C, V) => C, mergeCombiners: (C, C) => C, numPartitions: Int): RDD[(K, C)]

def combineByKey[C](createCombiner: (V) => C, mergeValue: (C, V) => C, mergeCombiners: (C, C) => C, partitioner: Partitioner, mapSideCombine: Boolean = true, serializer: Serializer = null): RDD[(K, C)]

该函数用于将RDD[K,V]转换成RDD[K,C],这里的V类型和C类型可以相同也可以不同。

参数:

  • createCombiner:组合器函数,用于将V类型转换成C类型,输入参数为RDD[K,V]中的V,输出为C
  • mergeValue:合并值函数,将一个C类型和一个V类型值合并成一个C类型,输入参数为(C,V),输出为C
  • mergeCombiners:合并组合器函数,用于将两个C类型值合并成一个C类型,输入参数为(C,C),输出为C
  • numPartitions:结果RDD分区数,默认保持原有的分区数
  • partitioner:分区函数,默认为HashPartitioner
  • mapSideCombine:是否需要在Map端进行combine操作,类似于MapReduce中的combine,默认为true

案例

Scala实战案例

举一个计算学生平均成绩的例子,scala版本实战案例参考链接

类ScoreDetail,存储学生的名字和一个主题的分数。

// 1、类ScoreDetail,存储学生的名字、学科、分数。
case class ScoreDetail(studentName: String, subject: String, score: Float)


/**
 * 求学生成绩平均值
 */
def avgScore(): Unit = {
  val conf = new SparkConf().setAppName("avgScore").setMaster("local[2]")
  val sc = new SparkContext(conf)
  //2.1 构建学生信息list集合
  val scoreDetail = List(
    ScoreDetail("A", "Math", 98),
    ScoreDetail("A", "English", 88),
    ScoreDetail("B", "Math", 75),
    ScoreDetail("B", "English", 78),
    ScoreDetail("C", "Math", 90),
    ScoreDetail("C", "English", 80),
    ScoreDetail("D", "Math", 91),
    ScoreDetail("D", "English", 80)
  )
  //2.2 创建学生信息Tuple2(学生名称,学生信息)
  val studentDetail = for {x <- scoreDetail} yield (x.studentName, x)
  //2.3 平行化学生信息,并创建Hash分区,缓存
  val studentDetailRdd = sc.parallelize(studentDetail).partitionBy(new HashPartitioner(3)).cache()

  val avgscoreRdd = studentDetailRdd.combineByKey(
    //1、 createCombiner:组合器函数,输入参数为RDD[K,V]中的V(即ScoreDetail对象),输出为tuple2(学生成绩,1)
    (x: ScoreDetail) => (x.score, 1),
    //2、 mergeValue:合并值函数,输入参数为(C,V)即((学生成绩,1),ScoreDetail对象),输出为tuple2(学生成绩,2)
    (acc: (Float, Int), x: ScoreDetail) => (acc._1 + x.score, acc._2 + 1),
    //3、 mergeCombiners:合并组合器函数,对多个节点上的数据合并,输入参数为(C,C),输出为C
    (acc1: (Float, Int), acc2: (Float, Int)) => (acc1._1 + acc2._1, acc1._2 + acc2._2)
  ).map(x => (x._1, x._2._1 / x._2._2)) //对于输出(学生姓名,(学生成绩和,学生成绩次数)),求学生成绩平均值

  avgscoreRdd.foreach(println)
}

输出结果:
(C,85.0)
(B,76.5)
(A,93.0)
(D,85.5)


Java实战案例

1、ScoreDetail对象,存储学生的名字、学科、分数。

public class ScoreDetail003 implements Serializable {
    String name ;
    String subject ;
    int score ;

    public ScoreDetail003(String name, String subject, int score) {
        this.name = name;
        this.subject = subject;
        this.score = score;
    }

}

2、求成绩均值

public static void avgScore() {
    SparkConf conf = new SparkConf().setAppName("reduceByKey").setMaster("local");
    JavaSparkContext sc = new JavaSparkContext(conf);
    ArrayList<ScoreDetail003> scoreDetail = new ArrayList<ScoreDetail003>();
    scoreDetail.add(new ScoreDetail003("A", "Math", 98));
    scoreDetail.add(new ScoreDetail003("A", "English", 88));
    scoreDetail.add(new ScoreDetail003("B", "Math", 75));
    scoreDetail.add(new ScoreDetail003("B", "English", 78));
    scoreDetail.add(new ScoreDetail003("C", "Math", 90));
    scoreDetail.add(new ScoreDetail003("C", "English", 80));
    scoreDetail.add(new ScoreDetail003("D", "Math", 91));
    scoreDetail.add(new ScoreDetail003("D", "English", 80));

    JavaRDD<ScoreDetail003> scoreDetailRdd = sc.parallelize(scoreDetail);

    JavaPairRDD<String,ScoreDetail003> pairRDD = scoreDetailRdd.mapToPair(detail -> new Tuple2<String, ScoreDetail003>(detail.name,detail));

    //1、创建createCombiner:组合器函数,输入参数为RDD[K,V]中的V(即ScoreDetail对象),输出为tuple2(学生成绩,1)
    Function<ScoreDetail003, Tuple2<Float,Integer>> createCombiner = new Function<ScoreDetail003, Tuple2<Float, Integer>>() {
        @Override
        public Tuple2<Float, Integer> call(ScoreDetail003 v1) throws Exception {
            return new Tuple2<Float, Integer>((float) v1.score,1);
        }
    };

    //2、mergeValue:合并值函数,输入参数为(C,V)即(tuple2(学生成绩,1),ScoreDetail对象),输出为tuple2(学生成绩,2)
    Function2<Tuple2<Float,Integer>,ScoreDetail003,Tuple2<Float,Integer>> mergeValue = new Function2<Tuple2<Float, Integer>, ScoreDetail003, Tuple2<Float, Integer>>() {
        @Override
        public Tuple2<Float, Integer> call(Tuple2<Float, Integer> v1, ScoreDetail003 v2) throws Exception {
            return new Tuple2<Float, Integer>(v1._1()+v2.score,v1._2()+1);
        }
    };

    //3、mergeCombiners:合并组合器函数,对多个节点上的数据合并,输入参数为(C,C),输出为C

    Function2<Tuple2<Float,Integer>,Tuple2<Float,Integer>,Tuple2<Float,Integer>> mergeCombiners = new Function2<Tuple2<Float, Integer>, Tuple2<Float, Integer>, Tuple2<Float, Integer>>() {
        @Override
        public Tuple2<Float, Integer> call(Tuple2<Float, Integer> v1, Tuple2<Float, Integer> v2) throws Exception {
            return new Tuple2<Float, Integer>(v1._1()+v2._1(),v1._2()+v2._2());
        }
    };

    //4、combineByKey并求均值
    JavaPairRDD<String,Float> res = pairRDD.combineByKey(createCombiner, mergeValue, mergeCombiners, 2)
            .mapToPair(x -> new Tuple2<String, Float>(x._1(),x._2()._1()/x._2()._2()));

    //5、打印结果
    res.foreach(x -> System.out.println(x));


    sc.close();
}
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

生命不息丶折腾不止

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值