I have a number of tables (with 100 million-ish rows) that are stored as external Hive tables using Parquet format. The Spark job needs to join several of them together, using a single column, with almost no filtering. The join column has unique values about 2/3X fewer than the number of rows.
I can see that there are shuffles happening by the join key; and I have been trying to utilize bucketing/partitioning to improve join performance. My thought is that if Spark can be made aware that each of these tables has been bucketed using the same column, it can load the dataframes and join them without shuffling. I have tried using Hive bucketing, but the shuffles don't go away. (From Spark's documentation it looks like Hive bucketing is not supported as of Spark 2.3.0 at least, which I found out later.) Can I use Spark's bucketing feature to do this? If yes, would I have to disable Hive support and just read the files directly? Or could I rewrite the tables once using Spark's bucketing scheme and still be able to read them as Hive tables?
EDIT: For writing out the Hive bucketed tables I was using something like:
customerDF
.write
.option("path", "/some/path")
.mode("overwrite")
.format("parquet")
.bucketBy(200, "customer_key")
.sortBy("customer_key")
.saveAsTable("table_name")
giftDF.write
.partitionBy("plat","date")
.mode(SaveMode.Append)
.bucketBy(20,"room_id")
.parquet(path)
本文探讨了在处理大规模数据集时,如何通过合理利用Spark的bucketing特性来优化多个大型表格(约1亿行)的Join操作性能。作者尝试了使用Hive的bucketing策略,但未见成效,转而寻求Spark自身bucketing机制的应用,旨在减少shuffle过程,提高处理效率。

1024

被折叠的 条评论
为什么被折叠?



