Spark BulkLoad批量读写Hbase

本文介绍了如何使用Spark进行HBase的批量读写操作,强调避免使用put单条插入,推荐使用BulkLoad。针对Hbase 1.x和2.x两个版本,分别提供了详细的依赖和操作步骤。在1.x版本中,读取数据使用sparkContext.newAPIHadoopRDD,写入数据则通过生成Hfile并利用LoadIncrementalHFiles进行导入。而在2.x版本,官方提供了hbase-spark连接工具,支持列裁剪,读写更加高效,推荐使用bulkLoadThinRows或hbaseBulkLoad方法生成Hfiles后再导入。

Spark BulkLoad批量读写Hbase

Spark读写Hbase,不要使用put逐条数据插入,效率太低了,要使用批量导入的方式!要分Hbase版本来做不同处理:

Hbase 1.x版本

依赖:

<!-- spark2.x依赖省略 --->
		<dependency>
            <groupId>org.apache.hbase</groupId>
            <artifactId>hbase-client</artifactId>
            <version>1.2.7</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hbase</groupId>
            <artifactId>hbase-server</artifactId>
            <version>1.2.7</version>
        </dependency>
  • 读数据

    使用spark.sparkContext.newAPIHadoopRDD,走全表扫描,读取所有列

    val inputHbase: Configuration = HBaseConfiguration.create()
        inputHbase.set("hbase.zookeeper.quorum", "zk01,zk02,zk03")
        inputHbase.set("hbase.zookeeper.property.clientPort", "2181")
        inputHbase.set("fs.defaultFS", "hdfs://cluster/")
    //写入的表
        inputHbase.set("hbase.mapred.outputtable", "sp_001")
    //读取的表
        inputHbase.set("hbase.mapreduce.inputtable", "sp_001")
        inputHbase.setInt("hbase.mapreduce.bulkload.max.hfiles.perRegion.perFamily",2048)
        val hbaseRDD = spark.sparkContext.newAPIHadoopRDD(inputHbase, classOf[TableInputFormat], classOf[ImmutableBytesWritable], classOf[Result])
          .repartition(100)
        //全表扫描读取成RDD
        val data = hbaseRDD.map(x => {
          val rowkey = new String(x._2.getRow)
          val clazz: Class[_] = classOf[Kakou]
          val fields = clazz.getDeclaredFields
          var map = new mutable.HashMap[String,String]()
          if (fields != null && fields.size > 0) {
            for (field <- fields) {
              field.setAccessible(true)
              val column = field.getName
              map += (column -> Bytes.toString(x._2.getValue("f".getBytes(), column.getBytes())))
            }
          }
          //filter null column
          map = map.filter(_._2!=null)
          val lat= map.getOrElse("lat","")
          val lon= map.getOrElse("lon","")
          val passid= map.getOrElse("passid","")
          val sno= map.getOrElse("sno","")
          Evt(rowkey,sno,passid,lat,lon)
        })
    data.show(false)
    
  • 写数据

    通过调用df.saveAsNewAPIHadoopFile生成Hfile,要求单个Hfile内的rowkey必须字典有序,列名之间也要字典有序,多个HFile间不要求有序(可以多次分配导入),再通过new LoadIncrementalHFiles(hbaseConf).doBulkLoad将Hfile导入到目标表

    //转换数据
    protected def transform(df: RDD[Evt]) = {
        rdd
          .map(x => {
            val ik = new ImmutableBytesWritable(Bytes.toBytes(x.rowkey))
            var sn = x.sno
            if(sn.length==1){
              sn = s"0${sn}"
            }
            //fuse fields ,value transform as csv format
            val value: KeyValue = new KeyValue(
              Bytes.toBytes(x.rowkey),
              Bytes.toBytes("f"),
              Bytes.toBytes("v"),
              Bytes.toBytes(s"${sn},,,${x.sp},${x.lon},${x.lat},${x.evt}"))
           Tuple2(ik,value)
          })
          .filter(_._1.getLength!=0)
          .sortBy(x =>x._1,true)
    }
    //载入数据到新表
    protected def load(data: RDD[(ImmutableBytesWritable, KeyValue)], hbaseConf: Configuration, evtTableName: String,hFilePath: String) = {
        val conn = ConnectionFactory.createConnection(hbaseConf)
        val admin = conn.getAdmin
        val table = conn.getTable(TableName.valueOf(evtTableName))
        val job = Job.getInstance(hbaseConf)
        //map reduce job
        job.setMapOutputKeyClass(classOf[ImmutableBytesWritable])
        job.setMapOutputValueClass(classOf[KeyValue])
        job.setOutputFormatClass(classOf[HFileOutputFormat2])
        HFileOutputFormat2.configureIncrementalLoad(job, table, conn.getRegionLocator(TableName.valueOf(evtTableName)))
    
        //hFlile缓存删除
    
            if(HdfsUtil.exist(hFilePath,ss)){
              println("HFile cache data delete...")
              HdfsUtil.delete(hFilePath,ss)
            }
        println("start product Hfile ...")
        //输入必须是RDD,Dataset不兼容
        data.saveAsNewAPIHadoopFile(hFilePath, classOf[ImmutableBytesWritable], classOf[KeyValue], classOf[HFileOutputFormat2], job.getConfiguration)
        println("hfile write complete! next, please use distcp to mv data to new hdfs cluster ...")
        
        val bulkLoader = new LoadIncrementalHFiles(hbaseConf)
        bulkLoader.doBulkLoad(new Path(hFilePath), admin, table, conn.getRegionLocator(TableName.valueOf(evtTableName)))
        table.close()
        conn.close()
        if(HdfsUtil.exist(hFilePath,ss)){
          println("HFile cache data delete...")
          HdfsUtil.delete(hFilePath,ss)
        }
    
      }
    
    

Hbase2.x版本

hbase2.x版本Hbase官方提供hbase-spark连接工具,可实现像读取hive表一样读取Hbase表数据,并且支持列裁剪,写入方式提供new HBaseContext(sc, config).bulkLoadThinRows(列少时更高效)或rdd.hbaseBulkLoad(单行列数超过10k)生成Hfiles,然后doBulkLoad将Hfiles载入新表,大多数情况使用bulkLoadThinRows(单行列数少于10k个)

依赖:

   		<!-- spark2.x依赖省略 --->
		<dependency>
            <groupId>org.apache.hbase.connectors.spark</groupId>
            <artifactId>hbase-spark</artifactId>
            <version>1.0.0</version>
        </dependency>
  • 读数据

    val inputHbase: Configuration = HBaseConfiguration.create()
        inputHbase.set("hbase.zookeeper.quorum", "zk01,zk02,zk03")
        inputHbase.set("hbase.zookeeper.property.clientPort", "2181")
        inputHbase.set("fs.defaultFS", "hdfs://cluster/")
    //写入的表
        inputHbase.set("hbase.mapred.outputtable", "sp_001")
    //读取的表
        inputHbase.set("hbase.mapreduce.inputtable", "sp_001")
        inputHbase.setInt("hbase.mapreduce.bulkload.max.hfiles.perRegion.perFamily",2048)
        val hbaseContext = new HBaseContext(ss.sparkContext, inputHbase, null)
    
        def catalog =
          s"""{
             |"table":{"namespace":"default", "name":"${evtInputTableName}"},
             |"rowkey":"key",
             |"columns":{
             |"rowkey":{"cf":"rowkey", "col":"key", "type":"string"},
             |"sno":{"cf":"f", "col":"sno", "type":"string"},
             |"lat":{"cf":"f", "col":"lat", "type":"string"},
             |"lon":{"cf":"f", "col":"lon", "type":"string"},
             |"sp":{"cf":"f", "col":"sp", "type":"string"},
             |"evt":{"cf":"f", "col":"evt", "type":"string"}
             |}
             |}""".stripMargin
    
        println("读取hbase变成dataframe ----->>>")
        val df = spark
          .read
          .options(Map(HBaseTableCatalog.tableCatalog -> cat))
          .format("org.apache.hadoop.hbase.spark")
          .load()
        //    df.show(false)
        val rowTable = df.select("rowkey", "sno", "lat", "lon","sp","evt")
    	rowTable.show(false)
    
  • 写数据

          val ss = SparkSession.builder().master("local").appName("bulk-load").config(new SparkConf()).getOrCreate()
          val sc = ss.sparkContext
          val zookeeperQuorum = "huawei:2181"
    
          val hdfsRootPath = "hdfs://huawei:9000"
          val hFilePath = "hdfs://huawei:9000/tmp/hfile/"
    
          val tableName = "test"
          val columnFamily1 = "af"
    
          val hadoopConf = new Configuration()
          hadoopConf.set("fs.defaultFS", hdfsRootPath)
          hadoopConf.set("dfs.client.use.datanode.hostname","true");
    
    
          val config = HBaseConfiguration.create(hadoopConf)
          config.set(HConstants.ZOOKEEPER_QUORUM, zookeeperQuorum)
          config.set(TableOutputFormat.OUTPUT_TABLE, tableName)
          //如果导入数据量过大,可以适当修改默认值32
          config.set("hbase.mapreduce.bulkload.max.hfiles.perRegion.perFamily","32")
    
    
          val rdd = sc.parallelize(Array(
            ("1",
              (Bytes.toBytes(columnFamily1), Bytes.toBytes("name"), Bytes.toBytes("foo1"))),
            ("3",
              (Bytes.toBytes(columnFamily1), Bytes.toBytes("age"), Bytes.toBytes("20"))),
            ("3",
              (Bytes.toBytes(columnFamily1), Bytes.toBytes("name"), Bytes.toBytes("foo3"))),
            ("3",
              (Bytes.toBytes(columnFamily1), Bytes.toBytes("age"), Bytes.toBytes("33"))),
            ("5",
              (Bytes.toBytes(columnFamily1), Bytes.toBytes("name"), Bytes.toBytes("foo5"))),
            ("5",
              (Bytes.toBytes(columnFamily1), Bytes.toBytes("age"), Bytes.toBytes("45"))),
            ("4",
              (Bytes.toBytes(columnFamily1), Bytes.toBytes("age"), Bytes.toBytes("46"))),
            ("2",
              (Bytes.toBytes(columnFamily1), Bytes.toBytes("name"), Bytes.toBytes("foo2"))),
            ("2",
              (Bytes.toBytes(columnFamily1), Bytes.toBytes("age"), Bytes.toBytes("12")))))
            .groupByKey()
            .repartitionAndSortWithinPartitions(new HashPartitioner(1))
    
          val hbaseContext = new HBaseContext(sc, config)
    
    	//写Hfiles
          hbaseContext.bulkLoadThinRows[(String, Iterable[(Array[Byte], Array[Byte], Array[Byte])])](rdd,
            TableName.valueOf(tableName),
            t => {
              val rowKey = Bytes.toBytes(t._1)
    
              val familyQualifiersValues = new FamiliesQualifiersValues
              t._2.foreach(f => {
                val family:Array[Byte] = f._1
                val qualifier = f._2
                val value:Array[Byte] = f._3
    
                familyQualifiersValues +=(family, qualifier, value)
              })
              (new ByteArrayWrapper(rowKey), familyQualifiersValues)
            },
            hFilePath)
    
    try {
    
      val conn = ConnectionFactory.createConnection(config)
      val load = new LoadIncrementalHFiles(config)
      val table = conn.getTable(TableName.valueOf(tableName))
      load.doBulkLoad(new Path(hFilePath), conn.getAdmin, table,
        conn.getRegionLocator(TableName.valueOf(tableName)))
      println("写入hbase 完成!")
      table.close()
      conn.close()
    }catch {
      case e: Exception =>{
        println(s"入库hbase失败! msg: ${e}")
      }
    }finally {
      //入库完成,要删除hdfs缓存目录
    //          HdfsUtil.delete(hFilePath,ss)
    
    }
    

参考文档

hbase官网文档: http://hbase.apache.org/book.html#_bulk_load

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值