一、现象
1、hbase操作偶尔有延迟,写入数据时需要4000+ms。
2、hbase master日志异常:
2021-09-06 08:30:43,140 INFO [PEWorker-7] procedure.MasterProcedureScheduler: Took xlock for pid=9516, state=RUNNABLE:SPLIT_TABLE_REGION_PREPARE; SplitTableRegionProcedure table=idm:t_user_behavior_api_log_202109, parent=1c9a9a43f5d1bc0581f66fa5ba39f17a, daughterA=81a2bf91af4ffc731e0e734b91be05e1, daughterB=3b7540c5ef2a5ec320e9376038120984
2021-09-06 08:30:43,305 INFO [PEWorker-7] procedure2.ProcedureExecutor: Initialized subprocedures=[{pid=9517, ppid=9516, state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure table=idm:t_user_behavior_api_log_202109, region=1c9a9a43f5d1bc0581f66fa5ba39f17a, override=true, server=hbaseprodnode03.yummy.tech,16020,1630587697108}]
2021-09-06 08:30:43,347 INFO [PEWorker-14] procedure.MasterProcedureScheduler: Took xlock for pid=9517, ppid=9516, state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure table=idm:t_user_behavior_api_log_202109, region=1c9a9a43f5d1bc0581f66fa5ba39f17a, override=true, server=hbaseprodnode03.yummy.tech,16020,1630587697108
2021-09-06 08:30:43,388 INFO [PEWorker-14] assignment.RegionStateStore: pid=9517 updating hbase:meta row=1c9a9a43f5d1bc0581f66fa5ba39f17a, regionState=CLOSING, regionLocation=hbaseprodnode03.yummy.tech,16020,1630587697108
2021-09-06 08:30:43,390 INFO [PEWorker-14] assignment.RegionTransitionProcedure: Dispatch pid=9517, ppid=9516, state=RUNNABLE:REGION_TRANSITION_DISPATCH, locked=true; UnassignProcedure table=idm:t_user_behavior_api_log_202109, region=1c9a9a43f5d1bc0581f66fa5ba39f17a, override=true, server=hbaseprodnode03.yummy.tech,16020,1630587697108
2021-09-06 08:30:43,541 INFO [RSProcedureDispatcher-pool3-t627] procedure.RSProcedureDispatcher: submitting region close {ENCODED => 1c9a9a43f5d1bc0581f66fa5ba39f17a, NAME => 'idm:t_user_behavior_api_log_202109,,1624989656062.1c9a9a43f5d1bc0581f66fa5ba39f17a.', STARTKEY => '', ENDKEY => ''} to hbaseprodnode03.yummy.tech,16020,1630587697108
2021-09-06 08:30:43,662 INFO [PEWorker-15] assignment.RegionStateStore: pid=9517 updating hbase:meta row=1c9a9a43f5d1bc0581f66fa5ba39f17a, regionState=CLOSED
2021-09-06 08:30:43,830 INFO [PEWorker-15] procedure2.ProcedureExecutor: Finished subprocedure pid=9517, resume processing parent pid=9516, state=RUNNABLE:SPLIT_TABLE_REGIONS_CHECK_CLOSED_REGIONS, locked=true; SplitTableRegionProcedure table=idm:t_user_behavior_api_log_202109, parent=1c9a9a43f5d1bc0581f66fa5ba39f17a, daughterA=81a2bf91af4ffc731e0e734b91be05e1, daughterB=3b7540c5ef2a5ec320e9376038120984
2021-09-06 08:30:43,830 INFO [PEWorker-15] procedure2.ProcedureExecutor: Finished pid=9517, ppid=9516, state=SUCCESS; UnassignProcedure table=idm:t_user_behavior_api_log_202109, region=1c9a9a43f5d1bc0581f66fa5ba39f17a, override=true, server=hbaseprodnode03.yummy.tech,16020,1630587697108 in 359msec, unfinishedSiblingCount=0
2021-09-06 08:30:43,928 INFO [PEWorker-9] assignment.SplitTableRegionProcedure: pid=9516 splitting 3 storefiles, region=1c9a9a43f5d1bc0581f66fa5ba39f17a, threads=3
2021-09-06 08:30:43,928 INFO [PEWorker-9] util.ByteBufferArray: Allocating buffers total=16 GB, sizePerBuffer=4 MB, count=4096
2021-09-06 08:30:45,499 ERROR [PEWorker-9] util.ByteBufferArray: Buffer creation interrupted
java.util.concurrent.ExecutionException: java.lang.OutOfMemoryError: Direct buffer memory
at java.util.concurrent.FutureTask.report(FutureTask.java:122)
at java.util.concurrent.FutureTask.get(FutureTask.java:192)
at org.apache.hadoop.hbase.util.ByteBufferArray.createBuffers(ByteBufferArray.java:96)
at org.apache.hadoop.hbase.util.ByteBufferArray.<init>(ByteBufferArray.java:74)
at org.apache.hadoop.hbase.io.hfile.bucket.ByteBufferIOEngine.<init>(ByteBufferIOEngine.java:86)
at org.apache.hadoop.hbase.io.hfile.bucket.BucketCache.getIOEngineFromName(BucketCache.java:384)
at org.apache.hadoop.hbase.io.hfile.bucket.BucketCache.<init>(BucketCache.java:262)
at org.apache.hadoop.hbase.io.hfile.CacheConfig.getBucketCache(CacheConfig.java:655)
at org.apache.hadoop.hbase.io.hfile.CacheConfig.instantiateBlockCache(CacheConfig.java:685)
at org.apache.hadoop.hbase.io.hfile.CacheConfig.<init>(CacheConfig.java:218)
at org.apache.hadoop.hbase.master.assignment.SplitTableRegionProcedure.splitStoreFiles(SplitTableRegionProcedure.java:653)
at org.apache.hadoop.hbase.master.assignment.SplitTableRegionProcedure.createDaughterRegions(SplitTableRegionProcedure.java:574)
at org.apache.hadoop.hbase.master.assignment.SplitTableRegionProcedure.executeFromState(SplitTableRegionProcedure.java:267)
at org.apache.hadoop.hbase.master.assignment.SplitTableRegionProcedure.executeFromState(SplitTableRegionProcedure.java:92)
at org.apache.hadoop.hbase.procedure2.StateMachineProcedure.execute(StateMachineProcedure.java:184)
at org.apache.hadoop.hbase.procedure2.Procedure.doExecute(Procedure.java:958)
at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1836)
at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1596)
at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$1200(ProcedureExecutor.java:80)
at org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:2141)
Caused by: java.lang.OutOfMemoryError: Direct buffer memory
at java.nio.Bits.reserveMemory(Bits.java:694)
at java.nio.DirectByteBuffer.<init>(DirectByteBuffer.java:123)
at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:311)
at org.apache.hadoop.hbase.io.hfile.bucket.ByteBufferIOEngine$1.allocate(ByteBufferIOEngine.java:83)
at org.apache.hadoop.hbase.util.ByteBufferArray$BufferCreatorCallable.call(ByteBufferArray.java:136)
at org.apache.hadoop.hbase.util.ByteBufferArray$BufferCreatorCallable.call(ByteBufferArray.java:121)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
2021-09-06 08:30:45,500 ERROR [PEWorker-9] hfile.CacheConfig: Can't instantiate bucket cache
java.io.IOException: java.util.concurrent.ExecutionException: java.lang.OutOfMemoryError: Direct buffer memory
at org.apache.hadoop.hbase.util.ByteBufferArray.createBuffers(ByteBufferArray.java:102)
at org.apache.hadoop.hbase.util.ByteBufferArray.<init>(ByteBufferArray.java:74)
at org.apache.hadoop.hbase.io.hfile.bucket.ByteBufferIOEngine.<init>(ByteBufferIOEngine.java:86)
at org.apache.hadoop.hbase.io.hfile.bucket.BucketCache.getIOEngineFromName(BucketCache.java:384)
at org.apache.hadoop.hbase.io.hfile.bucket.BucketCache.<init>(BucketCache.java:262)
at org.apache.hadoop.hbase.io.hfile.CacheConfig.getBucketCache(CacheConfig.java:655)
at org.apache.hadoop.hbase.io.hfile.CacheConfig.instantiateBlockCache(CacheConfig.java:685)
at org.apache.hadoop.hbase.io.hfile.CacheConfig.<init>(CacheConfig.java:218)
at org.apache.hadoop.hbase.master.assignment.SplitTableRegionProcedure.splitStoreFiles(SplitTableRegionProcedure.java:653)
at org.apache.hadoop.hbase.master.assignment.SplitTableRegionProcedure.createDaughterRegions(SplitTableRegionProcedure.java:574)
at org.apache.hadoop.hbase.master.assignment.SplitTableRegionProcedure.executeFromState(SplitTableRegionProcedure.java:267)
at org.apache.hadoop.hbase.master.assignment.SplitTableRegionProcedure.executeFromState(SplitTableRegionProcedure.java:92)
at org.apache.hadoop.hbase.procedure2.StateMachineProcedure.execute(StateMachineProcedure.java:184)
at org.apache.hadoop.hbase.procedure2.Procedure.doExecute(Procedure.java:958)
at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1836)
at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1596)
at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$1200(ProcedureExecutor.java:80)
at org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:2141)
Caused by: java.util.concurrent.ExecutionException: java.lang.OutOfMemoryError: Direct buffer memory
at java.util.concurrent.FutureTask.report(FutureTask.java:122)
at java.util.concurrent.FutureTask.get(FutureTask.java:192)
at org.apache.hadoop.hbase.util.ByteBufferArray.createBuffers(ByteBufferArray.java:96)
... 17 more
Caused by: java.lang.OutOfMemoryError: Direct buffer memory
at java.nio.Bits.reserveMemory(Bits.java:694)
at java.nio.DirectByteBuffer.<init>(DirectByteBuffer.java:123)
at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:311)
at org.apache.hadoop.hbase.io.hfile.bucket.ByteBufferIOEngine$1.allocate(ByteBufferIOEngine.java:83)
at org.apache.hadoop.hbase.util.ByteBufferArray$BufferCreatorCallable.call(ByteBufferArray.java:136)
at org.apache.hadoop.hbase.util.ByteBufferArray$BufferCreatorCallable.call(ByteBufferArray.java:121)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
2021-09-06 08:30:45,500 ERROR [PEWorker-9] procedure2.ProcedureExecutor: CODE-BUG: Uncaught runtime exception: pid=9516, state=RUNNABLE:SPLIT_TABLE_REGION_CREATE_DAUGHTER_REGIONS, locked=true; SplitTableRegionProcedure table=idm:t_user_behavior_api_log_202109, parent=1c9a9a43f5d1bc0581f66fa5ba39f17a, daughterA=81a2bf91af4ffc731e0e734b91be05e1, daughterB=3b7540c5ef2a5ec320e9376038120984
java.lang.RuntimeException: java.io.IOException: java.util.concurrent.ExecutionException: java.lang.OutOfMemoryError: Direct buffer memory
at org.apache.hadoop.hbase.io.hfile.CacheConfig.getBucketCache(CacheConfig.java:659)
at org.apache.hadoop.hbase.io.hfile.CacheConfig.instantiateBlockCache(CacheConfig.java:685)
at org.apache.hadoop.hbase.io.hfile.CacheConfig.<init>(CacheConfig.java:218)
at org.apache.hadoop.hbase.master.assignment.SplitTableRegionProcedure.splitStoreFiles(SplitTableRegionProcedure.java:653)
at org.apache.hadoop.hbase.master.assignment.SplitTableRegionProcedure.createDaughterRegions(SplitTableRegionProcedure.java:574)
at org.apache.hadoop.hbase.master.assignment.SplitTableRegionProcedure.executeFromState(SplitTableRegionProcedure.java:267)
at org.apache.hadoop.hbase.master.assignment.SplitTableRegionProcedure.executeFromState(SplitTableRegionProcedure.java:92)
at org.apache.hadoop.hbase.procedure2.StateMachineProcedure.execute(StateMachineProcedure.java:184)
at org.apache.hadoop.hbase.procedure2.Procedure.doExecute(Procedure.java:958)
at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1836)
at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1596)
at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$1200(ProcedureExecutor.java:80)
at org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:2141)
Caused by: java.io.IOException: java.util.concurrent.ExecutionException: java.lang.OutOfMemoryError: Direct buffer memory
at org.apache.hadoop.hbase.util.ByteBufferArray.createBuffers(ByteBufferArray.java:102)
at org.apache.hadoop.hbase.util.ByteBufferArray.<init>(ByteBufferArray.java:74)
at org.apache.hadoop.hbase.io.hfile.bucket.ByteBufferIOEngine.<init>(ByteBufferIOEngine.java:86)
at org.apache.hadoop.hbase.io.hfile.bucket.BucketCache.getIOEngineFromName(BucketCache.java:384)
at org.apache.hadoop.hbase.io.hfile.bucket.BucketCache.<init>(BucketCache.java:262)
at org.apache.hadoop.hbase.io.hfile.CacheConfig.getBucketCache(CacheConfig.java:655)
... 12 more
Caused by: java.util.concurrent.ExecutionException: java.lang.OutOfMemoryError: Direct buffer memory
at java.util.concurrent.FutureTask.report(FutureTask.java:122)
at java.util.concurrent.FutureTask.get(FutureTask.java:192)
at org.apache.hadoop.hbase.util.ByteBufferArray.createBuffers(ByteBufferArray.java:96)
... 17 more
Caused by: java.lang.OutOfMemoryError: Direct buffer memory
at java.nio.Bits.reserveMemory(Bits.java:694)
at java.nio.DirectByteBuffer.<init>(DirectByteBuffer.java:123)
at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:311)
at org.apache.hadoop.hbase.io.hfile.bucket.ByteBufferIOEngine$1.allocate(ByteBufferIOEngine.java:83)
at org.apache.hadoop.hbase.util.ByteBufferArray$BufferCreatorCallable.call(ByteBufferArray.java:136)
at org.apache.hadoop.hbase.util.ByteBufferArray$BufferCreatorCallable.call(ByteBufferArray.java:121)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
2021-09-06 08:30:45,838 INFO [PEWorker-5] procedure.MasterProcedureScheduler: Waiting on xlock for pid=9518, state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=idm:t_user_behavior_api_log_202109, region=1c9a9a43f5d1bc0581f66fa5ba39f17a, target=hbaseprodnode03.yummy.tech,16020,1630587697108 held by pid=9516
2021-09-06 08:30:46,105 INFO [PEWorker-9] procedure2.ProcedureExecutor: Rolled back pid=9516, state=ROLLEDBACK, exception=java.lang.RuntimeException via CODE-BUG: Uncaught runtime exception: pid=9516, state=RUNNABLE:SPLIT_TABLE_REGION_CREATE_DAUGHTER_REGIONS, locked=true; SplitTableRegionProcedure table=idm:t_user_behavior_api_log_202109, parent=1c9a9a43f5d1bc0581f66fa5ba39f17a, daughterA=81a2bf91af4ffc731e0e734b91be05e1, daughterB=3b7540c5ef2a5ec320e9376038120984:java.lang.RuntimeException: java.io.IOException: java.util.concurrent.ExecutionException: java.lang.OutOfMemoryError: Direct buffer memory; SplitTableRegionProcedure table=idm:t_user_behavior_api_log_202109, parent=1c9a9a43f5d1bc0581f66fa5ba39f17a, daughterA=81a2bf91af4ffc731e0e734b91be05e1, daughterB=3b7540c5ef2a5ec320e9376038120984 exec-time=3.3590sec
2021-09-06 08:30:46,105 INFO [PEWorker-10] procedure.MasterProcedureScheduler: Took xlock for pid=9518, state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=idm:t_user_behavior_api_log_202109, region=1c9a9a43f5d1bc0581f66fa5ba39f17a, target=hbaseprodnode03.yummy.tech,16020,1630587697108
2021-09-06 08:30:46,146 INFO [PEWorker-10] assignment.AssignProcedure: Starting pid=9518, state=RUNNABLE:REGION_TRANSITION_QUEUE, locked=true; AssignProcedure table=idm:t_user_behavior_api_log_202109, region=1c9a9a43f5d1bc0581f66fa5ba39f17a, target=hbaseprodnode03.yummy.tech,16020,1630587697108; rit=OFFLINE, location=hbaseprodnode03.yummy.tech,16020,1630587697108; forceNewPlan=false, retain=false target svr=hbaseprodnode03.yummy.tech,16020,1630587697108
2021-09-06 08:30:46,297 INFO [master/hbaseprodnode01:16000] balancer.BaseLoadBalancer: Reassigned 1 regions. 1 retained the pre-restart assignment.
2021-09-06 08:30:46,297 INFO [PEWorker-4] assignment.RegionStateStore: pid=9518 updating hbase:meta row=1c9a9a43f5d1bc0581f66fa5ba39f17a, regionState=OPENING, regionLocation=hbaseprodnode03.yummy.tech,16020,1630587697108
2021-09-06 08:30:46,299 INFO [PEWorker-4] assignment.RegionTransitionProcedure: Dispatch pid=9518, state=RUNNABLE:REGION_TRANSITION_DISPATCH, locked=true; AssignProcedure table=idm:t_user_behavior_api_log_202109, region=1c9a9a43f5d1bc0581f66fa5ba39f17a, target=hbaseprodnode03.yummy.tech,16020,1630587697108
2021-09-06 08:30:46,571 INFO [PEWorker-13] assignment.RegionStateStore: pid=9518 updating hbase:meta row=1c9a9a43f5d1bc0581f66fa5ba39f17a, regionState=OPEN, openSeqNum=1917411, regionLocation=hbaseprodnode03.yummy.tech,16020,1630587697108
2021-09-06 08:30:46,669 INFO [PEWorker-13] procedure2.ProcedureExecutor: Finished pid=9518, state=SUCCESS; AssignProcedure table=idm:t_user_behavior_api_log_202109, region=1c9a9a43f5d1bc0581f66fa5ba39f17a, target=hbaseprodnode03.yummy.tech,16020,1630587697108 in 776msec
二、调查思路
1、调查异常“java.lang.OutOfMemoryError: Direct buffer memory“,
参考文章HBase Direct Memory OutOfMemory
调整了-XX:MaxDirectMemorySize=20480m
因为ambari里master和regionserver是同一个配置,所以第一次统一修改了这个值。
hbase服务重启后,master服务异常终止,
把master的配置还原,去掉-XX:MaxDirectMemorySize后,hmaster服务恢复正常。
2、异常依然存在,查找hbase内存配置相关资料,都是讲region server如何设置内存,
当前遇到的这个异常是hmaster报的,所以没有从网上找到有参考价值的内容。
3、通过异常日志的堆栈,查看源码。
SplitTableRegionProcedure.splitStoreFiles时,要获取缓存,缓存的大小是二级缓存配置的:hbase.bucketcache.size的大小,配置的是16G,master本身只有16G,所以申请时内存不够导致异常。
升级master内存到32G以后,还需要配置-XX:MaxDirectMemorySize=18G。
4、导致操作有延迟的原因就是当读写的表正在做split时,要先lock该表,等结束时unlock,在split有异常的情况下,会有3秒中左右的时间这个表是不能用的,跟延迟时间相符。
三、结论
hbase.bucketcache.size不止是L2级缓存,在splitTableRegion时也会被HMaster当作缓存来用,所以HMaster的配置也不能太低(为了节省成本,HMaster机器的配置是普通机器,16G内存,而Region server都是高配64G内存)。

本文记录了一次HBase性能调优的过程,针对HMaster在进行表分裂时出现的内存溢出错误进行了深入分析,并提出了相应的解决方案。

7423

被折叠的 条评论
为什么被折叠?



