Spark BulkLoad批量读写Hbase

最新推荐文章于 2024-08-14 18:12:00 发布

原创最新推荐文章于 2024-08-14 18:12:00 发布 · 1.7k 阅读

2 ·

本内容遵循CC 4.0 BY-SA版权协议

Spark 同时被 2 个专栏收录

11 篇文章

订阅专栏

HBase

3 篇文章

订阅专栏

本文介绍了如何使用Spark进行HBase的批量读写操作，强调避免使用put单条插入，推荐使用BulkLoad。针对Hbase 1.x和2.x两个版本，分别提供了详细的依赖和操作步骤。在1.x版本中，读取数据使用sparkContext.newAPIHadoopRDD，写入数据则通过生成Hfile并利用LoadIncrementalHFiles进行导入。而在2.x版本，官方提供了hbase-spark连接工具，支持列裁剪，读写更加高效，推荐使用bulkLoadThinRows或hbaseBulkLoad方法生成Hfiles后再导入。

Spark BulkLoad批量读写Hbase

Spark读写Hbase，不要使用put逐条数据插入，效率太低了，要使用批量导入的方式！要分Hbase版本来做不同处理：

Hbase 1.x版本

依赖：

<!-- spark2.x依赖省略 --->
		<dependency>
            <groupId>org.apache.hbase</groupId>
            <artifactId>hbase-client</artifactId>
            <version>1.2.7</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hbase</groupId>
            <artifactId>hbase-server</artifactId>
            <version>1.2.7</version>
        </dependency>

读数据

使用spark.sparkContext.newAPIHadoopRDD，走全表扫描，读取所有列

val inputHbase: Configuration = HBaseConfiguration.create()
    inputHbase.set("hbase.zookeeper.quorum", "zk01,zk02,zk03")
    inputHbase.set("hbase.zookeeper.property.clientPort", "2181")
    inputHbase.set("fs.defaultFS", "hdfs://cluster/")
//写入的表
    inputHbase.set("hbase.mapred.outputtable", "sp_001")
//读取的表
    inputHbase.set("hbase.mapreduce.inputtable", "sp_001")
    inputHbase.setInt("hbase.mapreduce.bulkload.max.hfiles.perRegion.perFamily",2048)
    val hbaseRDD = spark.sparkContext.newAPIHadoopRDD(inputHbase, classOf[TableInputFormat], classOf[ImmutableBytesWritable], classOf[Result])
      .repartition(100)
    //全表扫描读取成RDD
    val data = hbaseRDD.map(x => {
      val rowkey = new String(x._2.getRow)
      val clazz: Class[_] = classOf[Kakou]
      val fields = clazz.getDeclaredFields
      var map = new mutable.HashMap[String,String]()
      if (fields != null && fields.size > 0) {
        for (field <- fields) {
          field.setAccessible(true)
          val column = field.getName
          map += (column -> Bytes.toString(x._2.getValue("f".getBytes(), column.getBytes())))
        }
      }
      //filter null column
      map = map.filter(_._2!=null)
      val lat= map.getOrElse("lat","")
      val lon= map.getOrElse("lon","")
      val passid= map.getOrElse("passid","")
      val sno= map.getOrElse("sno","")
      Evt(rowkey,sno,passid,lat,lon)
    })
data.show(false)

写数据

通过调用df.saveAsNewAPIHadoopFile生成Hfile,要求单个Hfile内的rowkey必须字典有序，列名之间也要字典有序，多个HFile间不要求有序（可以多次分配导入），再通过new LoadIncrementalHFiles(hbaseConf).doBulkLoad将Hfile导入到目标表

//转换数据
protected def transform(df: RDD[Evt]) = {
    rdd
      .map(x => {
        val ik = new ImmutableBytesWritable(Bytes.toBytes(x.rowkey))
        var sn = x.sno
        if(sn.length==1){
          sn = s"0${sn}"
        }
        //fuse fields ,value transform as csv format
        val value: KeyValue = new KeyValue(
          Bytes.toBytes(x.rowkey),
          Bytes.toBytes("f"),
          Bytes.toBytes("v"),
          Bytes.toBytes(s"${sn},,,${x.sp},${x.lon},${x.lat},${x.evt}"))
       Tuple2(ik,value)
      })
      .filter(_._1.getLength!=0)
      .sortBy(x =>x._1,true)
}
//载入数据到新表
protected def load(data: RDD[(ImmutableBytesWritable, KeyValue)], hbaseConf: Configuration, evtTableName: String,hFilePath: String) = {
    val conn = ConnectionFactory.createConnection(hbaseConf)
    val admin = conn.getAdmin
    val table = conn.getTable(TableName.valueOf(evtTableName))
    val job = Job.getInstance(hbaseConf)
    //map reduce job
    job.setMapOutputKeyClass(classOf[ImmutableBytesWritable])
    job.setMapOutputValueClass(classOf[KeyValue])
    job.setOutputFormatClass(classOf[HFileOutputFormat2])
    HFileOutputFormat2.configureIncrementalLoad(job, table, conn.getRegionLocator(TableName.valueOf(evtTableName)))

    //hFlile缓存删除

        if(HdfsUtil.exist(hFilePath,ss)){
          println("HFile cache data delete...")
          HdfsUtil.delete(hFilePath,ss)
        }
    println("start product Hfile ...")
    //输入必须是RDD，Dataset不兼容
    data.saveAsNewAPIHadoopFile(hFilePath, classOf[ImmutableBytesWritable], classOf[KeyValue], classOf[HFileOutputFormat2], job.getConfiguration)
    println("hfile write complete! next, please use distcp to mv data to new hdfs cluster ...")
    
    val bulkLoader = new LoadIncrementalHFiles(hbaseConf)
    bulkLoader.doBulkLoad(new Path(hFilePath), admin, table, conn.getRegionLocator(TableName.valueOf(evtTableName)))
    table.close()
    conn.close()
    if(HdfsUtil.exist(hFilePath,ss)){
      println("HFile cache data delete...")
      HdfsUtil.delete(hFilePath,ss)
    }

  }

Hbase2.x版本

hbase2.x版本Hbase官方提供hbase-spark连接工具，可实现像读取hive表一样读取Hbase表数据，并且支持列裁剪，写入方式提供new HBaseContext(sc, config).bulkLoadThinRows（列少时更高效）或rdd.hbaseBulkLoad（单行列数超过10k）生成Hfiles,然后doBulkLoad将Hfiles载入新表,大多数情况使用bulkLoadThinRows（单行列数少于10k个）

依赖：

   		<!-- spark2.x依赖省略 --->
		<dependency>
            <groupId>org.apache.hbase.connectors.spark</groupId>
            <artifactId>hbase-spark</artifactId>
            <version>1.0.0</version>
        </dependency>

读数据

val inputHbase: Configuration = HBaseConfiguration.create()
    inputHbase.set("hbase.zookeeper.quorum", "zk01,zk02,zk03")
    inputHbase.set("hbase.zookeeper.property.clientPort", "2181")
    inputHbase.set("fs.defaultFS", "hdfs://cluster/")
//写入的表
    inputHbase.set("hbase.mapred.outputtable", "sp_001")
//读取的表
    inputHbase.set("hbase.mapreduce.inputtable", "sp_001")
    inputHbase.setInt("hbase.mapreduce.bulkload.max.hfiles.perRegion.perFamily",2048)
    val hbaseContext = new HBaseContext(ss.sparkContext, inputHbase, null)

    def catalog =
      s"""{
         |"table":{"namespace":"default", "name":"${evtInputTableName}"},
         |"rowkey":"key",
         |"columns":{
         |"rowkey":{"cf":"rowkey", "col":"key", "type":"string"},
         |"sno":{"cf":"f", "col":"sno", "type":"string"},
         |"lat":{"cf":"f", "col":"lat", "type":"string"},
         |"lon":{"cf":"f", "col":"lon", "type":"string"},
         |"sp":{"cf":"f", "col":"sp", "type":"string"},
         |"evt":{"cf":"f", "col":"evt", "type":"string"}
         |}
         |}""".stripMargin

    println("读取hbase变成dataframe ----->>>")
    val df = spark
      .read
      .options(Map(HBaseTableCatalog.tableCatalog -> cat))
      .format("org.apache.hadoop.hbase.spark")
      .load()
    //    df.show(false)
    val rowTable = df.select("rowkey", "sno", "lat", "lon","sp","evt")
	rowTable.show(false)

写数据

      val ss = SparkSession.builder().master("local").appName("bulk-load").config(new SparkConf()).getOrCreate()
      val sc = ss.sparkContext
      val zookeeperQuorum = "huawei:2181"

      val hdfsRootPath = "hdfs://huawei:9000"
      val hFilePath = "hdfs://huawei:9000/tmp/hfile/"

      val tableName = "test"
      val columnFamily1 = "af"

      val hadoopConf = new Configuration()
      hadoopConf.set("fs.defaultFS", hdfsRootPath)
      hadoopConf.set("dfs.client.use.datanode.hostname","true");


      val config = HBaseConfiguration.create(hadoopConf)
      config.set(HConstants.ZOOKEEPER_QUORUM, zookeeperQuorum)
      config.set(TableOutputFormat.OUTPUT_TABLE, tableName)
      //如果导入数据量过大,可以适当修改默认值32
      config.set("hbase.mapreduce.bulkload.max.hfiles.perRegion.perFamily","32")


      val rdd = sc.parallelize(Array(
        ("1",
          (Bytes.toBytes(columnFamily1), Bytes.toBytes("name"), Bytes.toBytes("foo1"))),
        ("3",
          (Bytes.toBytes(columnFamily1), Bytes.toBytes("age"), Bytes.toBytes("20"))),
        ("3",
          (Bytes.toBytes(columnFamily1), Bytes.toBytes("name"), Bytes.toBytes("foo3"))),
        ("3",
          (Bytes.toBytes(columnFamily1), Bytes.toBytes("age"), Bytes.toBytes("33"))),
        ("5",
          (Bytes.toBytes(columnFamily1), Bytes.toBytes("name"), Bytes.toBytes("foo5"))),
        ("5",
          (Bytes.toBytes(columnFamily1), Bytes.toBytes("age"), Bytes.toBytes("45"))),
        ("4",
          (Bytes.toBytes(columnFamily1), Bytes.toBytes("age"), Bytes.toBytes("46"))),
        ("2",
          (Bytes.toBytes(columnFamily1), Bytes.toBytes("name"), Bytes.toBytes("foo2"))),
        ("2",
          (Bytes.toBytes(columnFamily1), Bytes.toBytes("age"), Bytes.toBytes("12")))))
        .groupByKey()
        .repartitionAndSortWithinPartitions(new HashPartitioner(1))

      val hbaseContext = new HBaseContext(sc, config)

	//写Hfiles
      hbaseContext.bulkLoadThinRows[(String, Iterable[(Array[Byte], Array[Byte], Array[Byte])])](rdd,
        TableName.valueOf(tableName),
        t => {
          val rowKey = Bytes.toBytes(t._1)

          val familyQualifiersValues = new FamiliesQualifiersValues
          t._2.foreach(f => {
            val family:Array[Byte] = f._1
            val qualifier = f._2
            val value:Array[Byte] = f._3

            familyQualifiersValues +=(family, qualifier, value)
          })
          (new ByteArrayWrapper(rowKey), familyQualifiersValues)
        },
        hFilePath)

try {

  val conn = ConnectionFactory.createConnection(config)
  val load = new LoadIncrementalHFiles(config)
  val table = conn.getTable(TableName.valueOf(tableName))
  load.doBulkLoad(new Path(hFilePath), conn.getAdmin, table,
    conn.getRegionLocator(TableName.valueOf(tableName)))
  println("写入hbase 完成!")
  table.close()
  conn.close()
}catch {
  case e: Exception =>{
    println(s"入库hbase失败! msg: ${e}")
  }
}finally {
  //入库完成，要删除hdfs缓存目录
//          HdfsUtil.delete(hFilePath,ss)

}