maxwell(1.44.0)暂停监听某些表的测试
在阅读本文档之前,建议先阅读网博主之前编写的部署Maxwell文档,以便更好地理解本文档Maxwell(1.44.0)部署文档
一、准备工作
1.1编写配置文档(config.properties)
# MySQL 配置
host=xx.xx.xx.xxx
user=maxwell
password=xxxxxx
# Kafka 生产者配置
producer=kafka
kafka.bootstrap.servers=xx.xx.xx.xx1:9092,xx.xx.xx.xx2:9092,xx.xx.xx.xx3:9092
# ✅ 动态 topic 命名规则:maxwell_t_{generator}
kafka_topic=maxwell
# ✅ 使用正则过滤 + 替换规则,将数据库名映射为 generator 名
# filter=exclude:*.*,include:db_enterprise*.t_enterprise_*:maxwell_t_enterprise,include:db_business*.t_general_taxpayer_*:maxwell_t_taxpayer,include:db_business*.t_creditimportexport_data_*:maxwell_t_creditimportexport,include:db_code*.t_econ_kind_code_*:maxwell_t_econ,include:db_sub_enterprises*.t_last_industry_*:maxwell_t_last_industry,include:db_code*.t_industry_code_*:maxwell_t_industry,include:db_enterprise*.t_address_*:maxwell_t_address,include:db_enterprise*.t_history_names_*:maxwell_t_history_names,include:db_enterprise*.t_emails_*:maxwell_t_emails,include:db_enterprise_reports*.t_report_details_*:maxwell_t_report_details
javascript=/home/bigdata/filter.js
# Kafka 优化参数
producer_kafka_batch_size=163840
producer_kafka_linger_ms=20
producer_kafka_buffer_memory=67108864
producer_kafka_compression_type=lz4
# Maxwell 核心优化
producer_async=true
worker_threads=4
maxwell.buffer.size=10000
output_nulls=false
producer_partition_by=primary_key
# 高可用建议配置
producer_kafka_max_in_flight_requests_per_connection=5
producer_kafka_acks=all
producer_kafka_retries=10
producer_kafka_enable_idempotence=true
| 配置项 | 数值 | 作用说明 |
|---|---|---|
| producer_kafka_batch_size | 163 840 B ≈ 160 KB | 每批最多攒 160 KB 再一起发给 Kafka,提高吞吐;与 linger_ms 配合控制延迟。 |
| producer_kafka_linger_ms | 20 ms | 即使批次没满,最多等待 20 ms 就发送,平衡延迟与吞吐。 |
| producer_kafka_buffer_memory | 67108 864 B ≈ 64 MB | Kafka Producer 客户端总缓存上限,用来暂存未发出去的消息。 |
| producer_kafka_compression_type | lz4 | 消息体使用 LZ4 压缩,显著降低网络 IO 与磁盘占用,压缩/解压速度优于 GZIP。 |
| producer_async | true | Maxwell 采用异步模式生产消息,批量提交,提高性能;为 false 时每条同步等待。 |
| worker_threads | 4 | Maxwell 内部使用 4 条线程并行读取 binlog 事件并转换,加快处理速度。 |
| maxwell.buffer.size | 10 000 | 内部环形队列最大缓存 10 000 条事件,防止瞬间峰值把内存打爆。 |
| output_nulls | false | 字段值为 NULL 时不输出到 JSON,减少消息体积;为 true 时显式输出 "col": null。 |
| producer_partition_by | primary_key | 按主键哈希选分区,保证同一主键始终进入同一分区,下游可有序消费。 |
| producer_kafka_max_in_flight_requests_per_connection | 5 | 单个 TCP 连接最多同时发送 5 个请求,提高吞吐;配合幂等设为 ≤5。 |
| producer_kafka_acks | all | 等待所有 ISR 副本确认才认为发送成功,最高级别持久化保障。 |
| producer_kafka_retries | 10 | 可重试错误(如瞬时网络抖动)自动重试 10 次,避免数据丢失。 |
| producer_kafka_enable_idempotence | true | 开启幂等生产者,重试时不会重复写消息,实现端到端精确一次(EOS)。 |
1.2 编写filter.js文件(对于多库多表的存储,映射到同一个topic)
背景介绍:为何使用 filter.js 进行 Maxwell 动态路由清洗
1.2.1 业务现状:分库分表架构 (Data Sharding)
在当前的大数据架构中,为了应对海量数据的存储与高并发写入,业务数据库采用了分库分表(Sharding)的策略。
根据配置文件中的 filter 规则可以看出,我们的业务数据分散在多个逻辑库和物理表中。例如:
- 企业数据:分散在 db_enterprise_01, db_enterprise_02 … 等多个库中,表名可能为 t_enterprise_2023, t_enterprise_2024 等。
- 税务数据:分散在 db_business 系列库中。
- 代码表数据:分散在 db_code 系列库中。
1.2.2 痛点:静态配置的局限性 (Configuration Complexity)
如果不使用脚本处理,仅依赖 Maxwell 配置文件中的 filter 参数(即配置中被注释掉的那一行长字符串),我们会面临以下严峻挑战:
- 配置难以维护:随着业务扩展,库表数量增加,filter 规则会变得极度冗长且难以阅读(如您配置中所示,一行规则包含了数十个正则匹配)。
- 灵活性差:静态配置很难处理复杂的逻辑。例如,“将所有 db_enterprise* 库下的 t_enterprise* 表的数据,统一发送到 Kafka 的 maxwell_t_enterprise 主题中”。
- Topic 爆炸:如果不做重命名映射,Maxwell 默认可能会为每个物理表生成不同的标识,导致下游 Kafka Topic 数量激增,增加了消费端(Flink/Spark)的处理复杂度。
1.2.3 解决方案:引入 JavaScript 动态过滤器 (Programmable Routing)
通过配置 javascript=/home/bigdata/filter.js,我们启用了 Maxwell 的编程接口。这允许我们在数据进入 Kafka 之前,对 Binlog 数据进行行级拦截和处理。
核心价值在于:
- 多对一映射(Data Aggregation):
可以将分散在成百上千个物理分片表(如 db_enterprise_01.t_enterprise_001)的数据,在传输层直接“清洗”为统一的逻辑表名(如 maxwell_t_enterprise),并发送到同一个 Kafka Topic。 - 动态 Topic 路由(Dynamic Topic Routing):
无需在配置文件中写死 Topic 名称。filter.js 可以根据数据库名或表名的正则特征,动态决定该条数据应该发往哪个 Topic。这完美契合配置中提到的 maxwell_t_{generator} 命名规则。 - 数据清洗与脱敏(ETL at Source):
虽然主要目的是路由,但 JS 文件同时也提供了在源头过滤无用字段(如删除 create_time 或大字段)或对敏感数据进行脱敏的能力,减轻下游计算压力。
1.2.4 总结
编写 filter.js 是为了屏蔽上游分库分表的物理差异,向下游数据仓库提供一个统一、逻辑化的数据视图。它将复杂的正则匹配和重命名逻辑从配置文件中解耦出来,极大地提升了数据同步链路的可维护性和扩展性。
这里主要是添加黑名单,指定不监听某些表(加入blacklist,这里举例是db_test.t_bidding_content):
/**
* Maxwell row-level filter
* 投递规则:
* 1. 老库 + 老表
* 2. db_test 库下的新表(含 user)
*/
function process_row(row) {
// 【新增】黑名单控制区
// 暂停监听的表(格式:数据库名.表名)
var blacklist = [
"db_test.t_bidding_content"
];
// 获取当前数据的完整表名
var fullTableName = row.database + "." + row.table;
// 如果在黑名单中,直接丢弃,不处理
if (blacklist.indexOf(fullTableName) !== -1) {
row.suppress();
return; // 直接返回,结束函数
}
/* ---------- 1. 老库老表 ---------- */
var dbRegOld = /^db_(enterprise|business|code|sub_enterprises|enterprise_reports)\d+$/;
var tblRegOld = /^t_(enterprise|general_taxpayer|creditimportexport_data|econ_kind_code|last_industry|industry_code|address|history_names|emails|report_details)_\d+$/;
/* ---------- 2. 新库新表(含 user) ---------- */
var dbRegNew = /^db_test$/;
var tblRegNew = /^t_(bidding_content|bidding_info|bidding_related|user)$/;
var baseTable = null;
/* 老规则匹配 */
if (dbRegOld.test(row.database) && tblRegOld.test(row.table)) {
baseTable = row.table
.replace(/_\d+$/, '') // 去掉分表后缀
.replace(/^t_/, '') // 去掉 t_ 前缀
.replace('_data', ''); // 去掉 _data 后缀
}
/* 新规则匹配 */
else if (dbRegNew.test(row.database) && tblRegNew.test(row.table)) {
baseTable = row.table
.replace(/^t_/, '') // 去掉 t_ 前缀
.replace(/_\d+$/, ''); // 去掉分表后缀(如果有)
}
/* 3. 生成 topic 或丢弃 */
if (baseTable) {
row.kafka_topic = 'maxwell_t_' + baseTable;
} else {
row.suppress(); // 不匹配则丢弃
}
}
1.3 删除Kafka(3.9.0)topic
#进入Kafka目录
cd kafka_2.13-3.9.0
#查看当前topic列表
./bin/kafka-topics.sh --bootstrap-server xx.xx.xx.xx1:9092 --list

删除maxwell_t_bidding_content、maxwell_t_bidding_info、maxwell_t_bidding_related这三个表的topic
bin/kafka-topics.sh --bootstrap-server xx.xx.xx.xx1:9092 --delete --topic maxwell_t_bidding_content
bin/kafka-topics.sh --bootstrap-server xx.xx.xx.xx1:9092 --delete --topic maxwell_t_bidding_info
bin/kafka-topics.sh --bootstrap-server xx.xx.xx.xx1:9092 --delete --topic maxwell_t_bidding_related

1.4 重新启动Maxwell
#进入Maxwell路径
cd /opt/module/maxwell-1.44.0
#指定jdk11启动
export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-11.0.23.0.9-2.el7_9.x86_64
export PATH=$JAVA_HOME/bin:$PATH
#启动Maxwell
./bin/maxwell --config config.properties

二、开始测试
2.1 启动插入mysql数据脚本
运行环境:python3.8.20
# -*- coding: utf8 -*-
"""
三表并发插入 + 后台「按轮次」批量改/删 + 最终 Kafka 总量统计
修改点:
1. [新增] 启动前打印 Kafka 初始 Offset。
2. 结束后记录 Kafka 最终 Offset。
3. 统计表中 'Kafka Recv' 显示本次运行新增的数据量。
4. 最后显示 Kafka 当前总积压量。
"""
import datetime as dt
from config import Config
import random
import time
import pymysql
import threading
import signal
import re
from typing import Dict, Any, List, Deque
from collections import deque, Counter
from kafka import KafkaAdminClient, KafkaConsumer, TopicPartition
from kafka.admin import NewTopic
from kafka.errors import TopicAlreadyExistsError
# -------------- Kafka 配置 --------------
BOOTSTRAP_SERVERS = ["xx.xx.xx.xx1:9092", "xx.xx.xx.xx2:9092", "xx.xx.xx.xx3:9092"]
KAFKA_TOPICS = ["maxwell_t_bidding_content", "maxwell_t_bidding_info", "maxwell_t_bidding_related"]
# -------------- 全局统计 (MySQL 端) --------------
STATS: Dict[str, Counter] = {
"t_bidding_content": Counter({"insert": 0, "update": 0, "delete": 0}),
"t_bidding_info": Counter({"insert": 0, "update": 0, "delete": 0}),
"t_bidding_related": Counter({"insert": 0, "update": 0, "delete": 0}),
}
def create_topics_once():
admin = KafkaAdminClient(bootstrap_servers=BOOTSTRAP_SERVERS)
topics = [NewTopic(name=kt, num_partitions=6, replication_factor=2) for kt in KAFKA_TOPICS]
try:
admin.create_topics(new_topics=topics, validate_only=False)
print(">>> Kafka topics created OK:", KAFKA_TOPICS)
except TopicAlreadyExistsError:
print(">>> Topics already exist, skip creation.")
except Exception as e:
print(f">>> Create topics warning: {e}")
finally:
admin.close()
# -------------- 获取 Kafka 当前总数据量 (Log End Offset) --------------
def get_current_topic_offsets() -> Dict[str, int]:
"""
不消费数据,直接查询 Kafka Broker 上各 Partition 的 LogEndOffset。
代表 Topic 当前的总消息数(High Watermark)。
"""
total_offsets = {t: 0 for t in KAFKA_TOPICS}
consumer = None
try:
consumer = KafkaConsumer(bootstrap_servers=BOOTSTRAP_SERVERS)
for topic in KAFKA_TOPICS:
partitions = consumer.partitions_for_topic(topic)
if not partitions:
continue
# 构建分区对象列表
tp_list = [TopicPartition(topic, p) for p in partitions]
# 获取所有分区的最新 Offset
end_offsets = consumer.end_offsets(tp_list)
# 累加得到 Topic 总量
total_offsets[topic] = sum(end_offsets.values())
except Exception as e:
print(f">>> 获取 Kafka Offsets 失败: {e}")
finally:
if consumer:
consumer.close()
return total_offsets
# -------------- 雪花 ID --------------
EPOCH = 1_600_000_000_000
WORKER_ID_BITS = 10
SEQUENCE_BITS = 12
MAX_WORKER_ID = -1 ^ (-1 << WORKER_ID_BITS)
MAX_SEQUENCE = -1 ^ (-1 << SEQUENCE_BITS)
WORKER_ID_SHIFT = SEQUENCE_BITS
TIMESTAMP_SHIFT = SEQUENCE_BITS + WORKER_ID_BITS
class Snowflake:
_lock = threading.Lock()
def __init__(self, worker_id: int = 1):
self.worker_id = worker_id
self.sequence = 0
self.last_timestamp = -1
def _gen_timestamp(self):
return int(time.time() * 1000)
def get_id(self) -> int:
with Snowflake._lock:
timestamp = self._gen_timestamp()
if timestamp < self.last_timestamp:
time.sleep(0.001)
timestamp = self._gen_timestamp()
if timestamp == self.last_timestamp:
self.sequence = (self.sequence + 1) & MAX_SEQUENCE
if self.sequence == 0:
while timestamp <= self.last_timestamp:
timestamp = self._gen_timestamp()
else:
self.sequence = 0
self.last_timestamp = timestamp
return ((timestamp - EPOCH) << TIMESTAMP_SHIFT |
self.worker_id << WORKER_ID_SHIFT |
self.sequence)
_snow = Snowflake(worker_id=1)
def gen_new_id() -> int: return _snow.get_id()
# -------------- 数据库配置 --------------
DB_CONFIG = {
"host": "xx.xx.xx.xx",
"user": "root",
"password": "xxxx",
"port": 3306,
"charset": "utf8mb4",
"autocommit": False,
}
DATABASE = "db_test"
# ========== 速率配置 ==========
BATCH_SIZE = 100
TARGET_TPS = 100
CRUD_ROUND = 20
TABLE_META = {
"t_bidding_info": [
"id", "title", "publish_time", "area_code", "notice_type_main",
"notice_type_sub", "industry_code", "project_name", "project_number",
"project_budget_money", "project_time_limit", "entry_start_time",
"entry_end_time", "bid_open_time", "proprietor_company",
"agency_company", "winner_company", "winner_candidate",
"related_construction", "project_fund_source", "attachment_oss",
"url", "crawl_time", "s_id", "u_id", "u_tags", "qds", "create_time",
"row_update_time", "local_row_update_time", "cdc_sync_date",
"products", "partition_date", "insert_mysql"
],
"t_bidding_content": [
"id", "content_text", "content_html", "content_swf", "u_id",
"u_tags", "qds", "create_time", "row_update_time",
"local_row_update_time", "cdc_sync_date", "partition_date",
"insert_mysql"
],
"t_bidding_related": [
"id", "u_id", "eid", "role", "title", "publish_time", "area_code",
"notice_type_main", "notice_type_sub", "industry_code",
"project_number", "project_bid_money", "qds", "u_tags",
"create_time", "row_update_time", "local_row_update_time",
"cdc_sync_date", "products", "partition_date", "insert_mysql"
],
}
STOP_EVENT = threading.Event()
# -------------- 令牌桶 --------------
class TokenBucket:
def __init__(self, rate: int):
self.rate = rate
self.tokens = rate
self.lock = threading.Lock()
self.last = time.time()
def consume(self, n: int = 1):
while not STOP_EVENT.is_set():
with self.lock:
now = time.time()
delta = now - self.last
self.tokens = min(self.rate, self.tokens + delta * self.rate)
self.last = now
if self.tokens >= n:
self.tokens -= n
return
time.sleep(0.001)
TOKEN_BUCKET = TokenBucket(TARGET_TPS)
# -------------- 轮次计数 & CRUD 逻辑 --------------
_INSERT_ROUNDS = 0
_CRUD_LOCK = threading.Lock()
_CRUD_SEQ = 0
_CRUD_SEQ_LOCK = threading.Lock()
_BUFFER_LOCK = threading.Lock()
_BUFFERS: Dict[str, Deque[Dict[str, Any]]] = {
tbl: deque(maxlen=CRUD_ROUND * BATCH_SIZE * 2) for tbl in TABLE_META
}
def _trigger_crud_if_needed():
global _INSERT_ROUNDS
with _CRUD_LOCK:
_INSERT_ROUNDS += 1
if _INSERT_ROUNDS >= CRUD_ROUND:
_INSERT_ROUNDS = 0
threading.Thread(target=_crud_once, daemon=True).start()
def _crud_once():
conn = pymysql.connect(**DB_CONFIG)
try:
with _BUFFER_LOCK:
if not any(_BUFFERS.values()): return
snapshot = {k: list(v) for k, v in _BUFFERS.items()}
for q in _BUFFERS.values(): q.clear()
with _CRUD_SEQ_LOCK:
global _CRUD_SEQ
_CRUD_SEQ += 1
counter = _CRUD_SEQ
print(f"\n>>> CRUD 第 {counter} 轮开始")
for tbl, buf in snapshot.items():
if not buf: continue
ids = [r["id"] for r in buf]
mod_cnt = random.randint(50, 100)
del_cnt = random.randint(50, 100)
mod_ids = random.sample(ids, min(mod_cnt, len(ids)))
candidates = [i for i in ids if i not in mod_ids]
del_cnt = min(del_cnt, len(candidates))
del_ids = random.sample(candidates, del_cnt) if del_cnt else []
with conn.cursor() as cur:
if mod_ids:
col_map = {"t_bidding_info": "title", "t_bidding_related": "title",
"t_bidding_content": "content_text"}
col = col_map[tbl]
new_val = f"[RAND-{counter}]"
sql_mod = f"UPDATE `{DATABASE}`.`{tbl}` SET {col} = %s WHERE id = %s"
cur.executemany(sql_mod, [(new_val, _id) for _id in mod_ids])
STATS[tbl]["update"] += len(mod_ids)
print(f" [{tbl}] 修改 {len(mod_ids)} 行")
if del_ids:
sql_del = f"DELETE FROM `{DATABASE}`.`{tbl}` WHERE id = %s"
cur.executemany(sql_del, [(_id,) for _id in del_ids])
STATS[tbl]["delete"] += len(del_ids)
print(f" [{tbl}] 删除 {len(del_ids)} 行")
conn.commit()
print(f">>> CRUD 第 {counter} 轮完成\n")
finally:
conn.close()
def add_to_buffer(table: str, rows: List[Dict[str, Any]]):
with _BUFFER_LOCK:
for r in rows:
_BUFFERS[table].append({"id": r["id"], "table": table})
_trigger_crud_if_needed()
# -------------- 数据生成 --------------
_SHARED_UIDS = deque(maxlen=2000)
_UID_REUSE_RATE = 0.8
_AREA_CODE_POOL: List[str] = []
def _clean_uid(raw: str) -> str:
if not raw: return 'default_uid'
raw = str(raw).strip()
cleaned = re.sub(r'[^a-zA-Z0-9_-]', '_', raw)
return cleaned[:120] if cleaned else 'default_uid'
def _load_area_code_pool():
global _AREA_CODE_POOL
conn = pymysql.connect(**DB_CONFIG)
try:
with conn.cursor() as cur:
cur.execute(
f"SELECT DISTINCT area_code FROM `{DATABASE}`.`t_bidding_info` WHERE area_code IS NOT NULL LIMIT 1000")
_AREA_CODE_POOL = [row[0] for row in cur.fetchall()]
print(f">>> area_code 池加载完成,共 {len(_AREA_CODE_POOL)} 条")
finally:
conn.close()
def transform_one(table: str, row: Dict[str, Any]) -> Dict[str, Any]:
new_row = row.copy()
new_row["id"] = gen_new_id()
new_row["u_id"] = _clean_uid(row.get("u_id"))
if random.random() < _UID_REUSE_RATE and _SHARED_UIDS:
new_row["u_id"] = random.choice(_SHARED_UIDS)
else:
if new_row["u_id"]: _SHARED_UIDS.append(new_row["u_id"])
if table == "t_bidding_info":
if _AREA_CODE_POOL: new_row["area_code"] = random.choice(_AREA_CODE_POOL)
if new_row.get("title"): new_row["title"] = "[UPD]" + new_row["title"]
if new_row.get("project_budget_money") is not None: new_row["project_budget_money"] = round(
float(new_row["project_budget_money"]) * 1.05, 6)
elif table == "t_bidding_content":
if new_row.get("content_text"): new_row["content_text"] = "[UPD]" + new_row["content_text"]
elif table == "t_bidding_related":
if new_row.get("title"): new_row["title"] = "[UPD]" + new_row["title"]
if new_row.get("project_bid_money") is not None: new_row["project_bid_money"] = round(
float(new_row["project_bid_money"]) * 1.05, 6)
return new_row
def build_insert_sql(table: str, cols: List[str]) -> str:
back_quote_cols = [f"`{c}`" for c in cols]
placeholders = ["%s"] * len(cols)
return f"INSERT INTO `{DATABASE}`.`{table}` ({','.join(back_quote_cols)}) VALUES ({','.join(placeholders)})"
def worker(table: str):
cols = TABLE_META[table]
select_cols = ",".join([f"`{c}`" for c in cols])
sql_select = f"SELECT {select_cols} FROM `{DATABASE}`.`{table}` WHERE id > %s ORDER BY id ASC LIMIT %s"
sql_insert = build_insert_sql(table, cols)
conn = pymysql.connect(**DB_CONFIG)
try:
lower_id = 0
total = 0
while not STOP_EVENT.is_set():
with conn.cursor() as cur:
cur.execute(sql_select, (lower_id, BATCH_SIZE))
rows = cur.fetchall()
if not rows:
lower_id = 0
continue
dict_rows = [dict(zip(cols, r)) for r in rows]
new_rows = [transform_one(table, r) for r in dict_rows]
TOKEN_BUCKET.consume(len(new_rows))
try:
cur.executemany(sql_insert, [tuple(r[c] for c in cols) for r in new_rows])
conn.commit()
total += len(new_rows)
print(f"[{table}] +{len(new_rows)} 行,总插入 {total}")
STATS[table]["insert"] += len(new_rows)
add_to_buffer(table, new_rows)
except Exception as e:
conn.rollback()
print(f"[{table}] 插入异常:{e}")
lower_id = max(r["id"] for r in dict_rows)
finally:
conn.close()
# -------------- 核心统计函数 --------------
def print_final_stats(initial_offsets: Dict[str, int], final_offsets: Dict[str, int]):
"""
initial_offsets: 脚本启动时的 offset
final_offsets: 脚本结束时的 offset
"""
print("\n=================== 本次运行统计 ===================")
print(
f"{'Table / Topic':<30} | {'MySQL Insert':<12} | {'MySQL Update':<12} | {'MySQL Delete':<12} | {'Kafka Recv':<10}")
print("-" * 90)
for tbl in TABLE_META.keys():
# 获取 MySQL 明细
ins = STATS[tbl]['insert']
upd = STATS[tbl]['update']
dlt = STATS[tbl]['delete']
# 计算 Kafka 接收量 (本次增量 = 结束 - 初始)
topic_name = f"maxwell_{tbl}"
start_off = initial_offsets.get(topic_name, 0)
end_off = final_offsets.get(topic_name, 0)
# 这里的 Kafka Recv 指的是本次脚本运行期间 Kafka 新增的数据条数
kafka_recv = max(0, end_off - start_off)
print(f"{tbl:<30} | {ins:<12} | {upd:<12} | {dlt:<12} | {kafka_recv:<10}")
print("===============================================")
# ------ 单独输出各个 Topic 停止后现在的总数据量 ------
print("\n>>> Kafka 各 Topic 当前(停止后)总积压量:")
for topic in KAFKA_TOPICS:
count = final_offsets.get(topic, 0)
print(f" [Topic] {topic:<35} : {count} 条")
print("===============================================")
def wait_for_schedule_window():
fmt = "%H:%M"
start = dt.datetime.strptime(Config.SCHEDULE_START_TIME, fmt).time()
end = dt.datetime.strptime(Config.SCHEDULE_END_TIME, fmt).time()
while not STOP_EVENT.is_set():
now = dt.datetime.now().time()
if start <= end:
in_window = start <= now <= end
else:
in_window = now >= start or now <= end
if in_window:
print(f"[schedule] 进入允许运行时段 {Config.SCHEDULE_START_TIME}~{Config.SCHEDULE_END_TIME},开始工作")
return
else:
print(f"[schedule] 当前 {now.strftime('%H:%M')} 不在允许时段,等待 60 s …")
STOP_EVENT.wait(60)
# -------------- 主控 --------------
def main():
wait_for_schedule_window()
if STOP_EVENT.is_set(): return
_load_area_code_pool()
create_topics_once()
# --- 获取并打印初始数据量 ---
print(">>> 正在获取 Kafka 初始数据量 (请稍候)...")
initial_offsets = get_current_topic_offsets()
# !!!这里是新增的打印部分!!!
print(f"\n>>> [初始] Kafka 各 Topic 当前数据总量:")
for topic in KAFKA_TOPICS:
count = initial_offsets.get(topic, 0)
print(f" [Topic] {topic:<35} : {count} 条")
print("-----------------------------------------------\n")
# !!!新增部分结束!!!
print(">>> 初始 Offset 获取完毕,准备启动线程...")
def _sig_handler(sig, frame):
print("\n>>> Ctrl+C 捕获,正在停止所有线程 ...")
STOP_EVENT.set()
signal.signal(signal.SIGINT, _sig_handler)
threads = []
for tbl in TABLE_META.keys():
t = threading.Thread(target=worker, args=(tbl,), daemon=False)
t.start()
threads.append(t)
try:
while not STOP_EVENT.is_set():
time.sleep(1)
finally:
print(">>> 等待工作线程结束...")
for t in threads:
if not t.daemon: t.join()
print(">>> 正在等待 Maxwell 将最后的数据同步 (Wait 5s) ...")
time.sleep(5)
print(">>> 正在读取 Kafka 结束时的总数据量 ...")
final_offsets = get_current_topic_offsets()
# 将初始和结束的 offset 都传进去进行对比
print_final_stats(initial_offsets, final_offsets)
print(">>> 已安全退出")
if __name__ == "__main__":
main()
2.2测试结果
初始三个topic数据都为0

ctrl+c停止程序,打印输出:

发现maxwell_t_bidding_content这个topic里面没有数据,但是db_test.t_bidding_content这个表实际插入了2500条数据,更改了256次,然后删除了231条,说明Maxwell成功暂停db_test.t_bidding_content这个表。
确定topic实际条数:
编写sh文件统计topic数量:
vi count_maxwell_total.sh
编写脚本:
#!/bin/bash
# count_maxwell_total_v3.sh 基于 Kafka 3.9.0
BROKERS="xx.xx.xx.xx1:9092,xx.xx.xx.xx2:9092,xx.xx.xx.xx3:9092"
#修改需要统计的topic即可
TOPIC="maxwell_t_bidding_info"
KAFKA_HOME=/home/bigdata/kafka_2.13-3.9.0
# 3.9.0 新版工具
"$KAFKA_HOME/bin/kafka-get-offsets.sh" \
--bootstrap-server "$BROKERS" \
--topic "$TOPIC" 2>/dev/null | \
awk -F: '{sum+=$NF} END{print "总条数:", sum}'

修改topic为maxwell_t_bidding_content
TOPIC="maxwell_t_bidding_content"

修改topic为maxwell_t_bidding_related
TOPIC="maxwell_t_bidding_related"

暂停监听某些表的测试&spm=1001.2101.3001.5002&articleId=155318828&d=1&t=3&u=99765f53b6684aeabf3a50abf56c9cbb)
347

被折叠的 条评论
为什么被折叠?



