以sql调用来分析:
[hadoop@10 ~]$ spark-sql --master local \
--conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions \
--conf spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkSessionCatalog \
--conf spark.sql.catalog.spark_catalog.type=hadoop \
--conf spark.sql.catalog.spark_catalog.warehouse=hdfs://ns1/user/wanghongbing/db
流程如下:

其中,spark包重要的类:
- org.apache.spark.sql.connector.catalog.CatalogManager
-
org.apache.spark.sql.connector.catalog.Catalogs
iceberg包对应的类:
- org.apache.iceberg.spark.SparkSessionCatalog
- org.apache.iceberg.spark.SparkCatalog

# CatalogManager
def catalog(name: String): CatalogPlugin = synchronized {
if (name.equalsIgnoreCase(SESSION_CATALOG_NAME)) {
v2SessionCatalog
} else {
catalogs.getOrElseUpdate(name, Catalogs.load(name, conf))
}
}
private[sql] object CatalogManager {
val SESSION_CATALOG_NAME: String = "spark_catalog"
}
# SparkSessionCatalog
/**
* A Spark catalog that can also load non-Iceberg tables.
*
* @param <T> CatalogPlugin class to avoid casting to TableCatalog and SupportsNamespaces.
*/
public class SparkSessionCatalog<T extends TableCatalog & SupportsNamespaces>
extends BaseCatalog implements CatalogExtension {
这里 org.apache.iceberg.spark.SparkSessionCatalog 实现了 org.apache.spark.sql.connector.catalog.CatalogExtension 和 CatalogPlugin
|
|

小结:Spark包中定义了Catalog的接口,iceberg用于实现。
本文详细介绍了如何使用Spark SQL通过Iceberg库进行数据分析,涉及CatalogManager和SparkSessionCatalog的关键类及其交互过程,特别关注了如何配置和连接非Iceberg表。

1471

被折叠的 条评论
为什么被折叠?



