【Spark数仓项目】需求八:MySQL的DataX全量导入和增量导入Hive
文章目录
一、mysql全量导入hive[分区表]
需求介绍:
本需求将模拟从MySQL中向Hive数仓中导入数据,数据以时间分区。测试两种导入场景,一种是将数据全量导入,即包含所有时间分区;另一种是每天运行调度,仅导入当天时间分区中的用户数据。
- mysql表建表语句:
create table t_order(
id int primary key auto_increment,
amt decimal(10,2),
`status` int default 0,
user_id int,
create_time timestamp DEFAULT CURRENT_TIMESTAMP,
modify_time timestamp DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP
)
- hive
create table t_order(
id int,
amt decimal(10,2),
`status` int,
user_id int,
create_time date,
modify_time date
)partitioned by (dt string)
row format delimited
fields terminated by '\t'
注意字段时间戳,我们将从以上MySQL向Hive导入数据。
- 编写datax的json脚本
{
"job": {
"content": [
{
"reader": {
"name": "mysqlreader",
"parameter": {
"connection": [
{
"jdbcUrl": ["jdbc:mysql://hadoop10:3306/spark-dw"],
"querySql": [
"select id,amt,status,user_id,create_time,modify_time from t_order"
]
}
],
"password": "0000",
"username": "root",
}
},
"writer": {
"name": "hdfswriter",
"parameter": {
"column": [
{
"name": "id","type": "int"},
{
"name": "amt","type": "double"},
{
"name": "status","type": "int"},
{
"name": "user_id","type": "int"},
{
"name": "create_time","type": "string"},
{
"name": "modify_time","type": "string"}
],
"defaultFS": "hdfs://hadoop10:8020",
"fieldDelimiter": "\t",
"fileName": "t_order"

本文介绍了如何使用DataX从MySQL全量导入和增量导入数据至Hive分区表,包括通过时间字段实现增量同步,并展示了Python脚本自动生成DataXjson配置文件的方法。

4020

被折叠的 条评论
为什么被折叠?



