Spark BulkLoad批量读写Hbase
Spark读写Hbase,不要使用put逐条数据插入,效率太低了,要使用批量导入的方式!要分Hbase版本来做不同处理:
Hbase 1.x版本
依赖:
<!-- spark2.x依赖省略 --->
<dependency>
<groupId>org.apache.hbase</groupId>
<artifactId>hbase-client</artifactId>
<version>1.2.7</version>
</dependency>
<dependency>
<groupId>org.apache.hbase</groupId>
<artifactId>hbase-server</artifactId>
<version>1.2.7</version>
</dependency>
-
读数据
使用spark.sparkContext.newAPIHadoopRDD,走全表扫描,读取所有列
val inputHbase: Configuration = HBaseConfiguration.create() inputHbase.set("hbase.zookeeper.quorum", "zk01,zk02,zk03") inputHbase.set("hbase.zookeeper.property.clientPort", "2181") inputHbase.set("fs.defaultFS", "hdfs://cluster/") //写入的表 inputHbase.set("hbase.mapred.outputtable", "sp_001") //读取的表 inputHbase.set("hbase.mapreduce.inputtable", "sp_001") inputHbase.setInt("hbase.mapreduce.bulkload.max.hfiles.perRegion.perFamily",2048) val hbaseRDD = spark.sparkContext.newAPIHadoopRDD(inputHbase, classOf[TableInputFormat], classOf[ImmutableBytesWritable], classOf[Result]) .repartition(100) //全表扫描读取成RDD val data = hbaseRDD.map(x => { val rowkey = new String(x._2.getRow) val clazz: Class[_] = classOf[Kakou] val fields = clazz.getDeclaredFields var map = new mutable.HashMap[String,String]() if (fields != null && fields.size > 0) { for (field <- fields) { field.setAccessible(true) val column = field.getName map += (column -> Bytes.toString(x._2.getValue("f".getBytes(), column.getBytes()))) } } //filter null column map = map.filter(_._2!=null) val lat= map.getOrElse("lat","") val lon= map.getOrElse("lon","") val passid= map.getOrElse("passid","") val sno= map.getOrElse("sno","") Evt(rowkey,sno,passid,lat,lon) }) data.show(false) -
写数据
通过调用df.saveAsNewAPIHadoopFile生成Hfile,要求单个Hfile内的rowkey必须字典有序,列名之间也要字典有序,多个HFile间不要求有序(可以多次分配导入),再通过new LoadIncrementalHFiles(hbaseConf).doBulkLoad将Hfile导入到目标表
//转换数据 protected def transform(df: RDD[Evt]) = { rdd .map(x => { val ik = new ImmutableBytesWritable(Bytes.toBytes(x.rowkey)) var sn = x.sno if(sn.length==1){ sn = s"0${sn}" } //fuse fields ,value transform as csv format val value: KeyValue = new KeyValue( Bytes.toBytes(x.rowkey), Bytes.toBytes("f"), Bytes.toBytes("v"), Bytes.toBytes(s"${sn},,,${x.sp},${x.lon},${x.lat},${x.evt}")) Tuple2(ik,value) }) .filter(_._1.getLength!=0) .sortBy(x =>x._1,true) } //载入数据到新表 protected def load(data: RDD[(ImmutableBytesWritable, KeyValue)], hbaseConf: Configuration, evtTableName: String,hFilePath: String) = { val conn = ConnectionFactory.createConnection(hbaseConf) val admin = conn.getAdmin val table = conn.getTable(TableName.valueOf(evtTableName)) val job = Job.getInstance(hbaseConf) //map reduce job job.setMapOutputKeyClass(classOf[ImmutableBytesWritable]) job.setMapOutputValueClass(classOf[KeyValue]) job.setOutputFormatClass(classOf[HFileOutputFormat2]) HFileOutputFormat2.configureIncrementalLoad(job, table, conn.getRegionLocator(TableName.valueOf(evtTableName))) //hFlile缓存删除 if(HdfsUtil.exist(hFilePath,ss)){ println("HFile cache data delete...") HdfsUtil.delete(hFilePath,ss) } println("start product Hfile ...") //输入必须是RDD,Dataset不兼容 data.saveAsNewAPIHadoopFile(hFilePath, classOf[ImmutableBytesWritable], classOf[KeyValue], classOf[HFileOutputFormat2], job.getConfiguration) println("hfile write complete! next, please use distcp to mv data to new hdfs cluster ...") val bulkLoader = new LoadIncrementalHFiles(hbaseConf) bulkLoader.doBulkLoad(new Path(hFilePath), admin, table, conn.getRegionLocator(TableName.valueOf(evtTableName))) table.close() conn.close() if(HdfsUtil.exist(hFilePath,ss)){ println("HFile cache data delete...") HdfsUtil.delete(hFilePath,ss) } }
Hbase2.x版本
hbase2.x版本Hbase官方提供hbase-spark连接工具,可实现像读取hive表一样读取Hbase表数据,并且支持列裁剪,写入方式提供new HBaseContext(sc, config).bulkLoadThinRows(列少时更高效)或rdd.hbaseBulkLoad(单行列数超过10k)生成Hfiles,然后doBulkLoad将Hfiles载入新表,大多数情况使用bulkLoadThinRows(单行列数少于10k个)
依赖:
<!-- spark2.x依赖省略 --->
<dependency>
<groupId>org.apache.hbase.connectors.spark</groupId>
<artifactId>hbase-spark</artifactId>
<version>1.0.0</version>
</dependency>
-
读数据
val inputHbase: Configuration = HBaseConfiguration.create() inputHbase.set("hbase.zookeeper.quorum", "zk01,zk02,zk03") inputHbase.set("hbase.zookeeper.property.clientPort", "2181") inputHbase.set("fs.defaultFS", "hdfs://cluster/") //写入的表 inputHbase.set("hbase.mapred.outputtable", "sp_001") //读取的表 inputHbase.set("hbase.mapreduce.inputtable", "sp_001") inputHbase.setInt("hbase.mapreduce.bulkload.max.hfiles.perRegion.perFamily",2048) val hbaseContext = new HBaseContext(ss.sparkContext, inputHbase, null) def catalog = s"""{ |"table":{"namespace":"default", "name":"${evtInputTableName}"}, |"rowkey":"key", |"columns":{ |"rowkey":{"cf":"rowkey", "col":"key", "type":"string"}, |"sno":{"cf":"f", "col":"sno", "type":"string"}, |"lat":{"cf":"f", "col":"lat", "type":"string"}, |"lon":{"cf":"f", "col":"lon", "type":"string"}, |"sp":{"cf":"f", "col":"sp", "type":"string"}, |"evt":{"cf":"f", "col":"evt", "type":"string"} |} |}""".stripMargin println("读取hbase变成dataframe ----->>>") val df = spark .read .options(Map(HBaseTableCatalog.tableCatalog -> cat)) .format("org.apache.hadoop.hbase.spark") .load() // df.show(false) val rowTable = df.select("rowkey", "sno", "lat", "lon","sp","evt") rowTable.show(false) -
写数据
val ss = SparkSession.builder().master("local").appName("bulk-load").config(new SparkConf()).getOrCreate() val sc = ss.sparkContext val zookeeperQuorum = "huawei:2181" val hdfsRootPath = "hdfs://huawei:9000" val hFilePath = "hdfs://huawei:9000/tmp/hfile/" val tableName = "test" val columnFamily1 = "af" val hadoopConf = new Configuration() hadoopConf.set("fs.defaultFS", hdfsRootPath) hadoopConf.set("dfs.client.use.datanode.hostname","true"); val config = HBaseConfiguration.create(hadoopConf) config.set(HConstants.ZOOKEEPER_QUORUM, zookeeperQuorum) config.set(TableOutputFormat.OUTPUT_TABLE, tableName) //如果导入数据量过大,可以适当修改默认值32 config.set("hbase.mapreduce.bulkload.max.hfiles.perRegion.perFamily","32") val rdd = sc.parallelize(Array( ("1", (Bytes.toBytes(columnFamily1), Bytes.toBytes("name"), Bytes.toBytes("foo1"))), ("3", (Bytes.toBytes(columnFamily1), Bytes.toBytes("age"), Bytes.toBytes("20"))), ("3", (Bytes.toBytes(columnFamily1), Bytes.toBytes("name"), Bytes.toBytes("foo3"))), ("3", (Bytes.toBytes(columnFamily1), Bytes.toBytes("age"), Bytes.toBytes("33"))), ("5", (Bytes.toBytes(columnFamily1), Bytes.toBytes("name"), Bytes.toBytes("foo5"))), ("5", (Bytes.toBytes(columnFamily1), Bytes.toBytes("age"), Bytes.toBytes("45"))), ("4", (Bytes.toBytes(columnFamily1), Bytes.toBytes("age"), Bytes.toBytes("46"))), ("2", (Bytes.toBytes(columnFamily1), Bytes.toBytes("name"), Bytes.toBytes("foo2"))), ("2", (Bytes.toBytes(columnFamily1), Bytes.toBytes("age"), Bytes.toBytes("12"))))) .groupByKey() .repartitionAndSortWithinPartitions(new HashPartitioner(1)) val hbaseContext = new HBaseContext(sc, config) //写Hfiles hbaseContext.bulkLoadThinRows[(String, Iterable[(Array[Byte], Array[Byte], Array[Byte])])](rdd, TableName.valueOf(tableName), t => { val rowKey = Bytes.toBytes(t._1) val familyQualifiersValues = new FamiliesQualifiersValues t._2.foreach(f => { val family:Array[Byte] = f._1 val qualifier = f._2 val value:Array[Byte] = f._3 familyQualifiersValues +=(family, qualifier, value) }) (new ByteArrayWrapper(rowKey), familyQualifiersValues) }, hFilePath) try { val conn = ConnectionFactory.createConnection(config) val load = new LoadIncrementalHFiles(config) val table = conn.getTable(TableName.valueOf(tableName)) load.doBulkLoad(new Path(hFilePath), conn.getAdmin, table, conn.getRegionLocator(TableName.valueOf(tableName))) println("写入hbase 完成!") table.close() conn.close() }catch { case e: Exception =>{ println(s"入库hbase失败! msg: ${e}") } }finally { //入库完成,要删除hdfs缓存目录 // HdfsUtil.delete(hFilePath,ss) }
参考文档
hbase官网文档: http://hbase.apache.org/book.html#_bulk_load

本文介绍了如何使用Spark进行HBase的批量读写操作,强调避免使用put单条插入,推荐使用BulkLoad。针对Hbase 1.x和2.x两个版本,分别提供了详细的依赖和操作步骤。在1.x版本中,读取数据使用sparkContext.newAPIHadoopRDD,写入数据则通过生成Hfile并利用LoadIncrementalHFiles进行导入。而在2.x版本,官方提供了hbase-spark连接工具,支持列裁剪,读写更加高效,推荐使用bulkLoadThinRows或hbaseBulkLoad方法生成Hfiles后再导入。

932

被折叠的 条评论
为什么被折叠?



