maxwell(1.44.0)暂停监听某些表的测试

maxwell(1.44.0)暂停监听某些表的测试

在阅读本文档之前,建议先阅读网博主之前编写的部署Maxwell文档,以便更好地理解本文档Maxwell(1.44.0)部署文档

一、准备工作

1.1编写配置文档(config.properties)

# MySQL 配置
host=xx.xx.xx.xxx
user=maxwell
password=xxxxxx

# Kafka 生产者配置
producer=kafka
kafka.bootstrap.servers=xx.xx.xx.xx1:9092,xx.xx.xx.xx2:9092,xx.xx.xx.xx3:9092

# ✅ 动态 topic 命名规则:maxwell_t_{generator}
kafka_topic=maxwell

# ✅ 使用正则过滤 + 替换规则,将数据库名映射为 generator 名
# filter=exclude:*.*,include:db_enterprise*.t_enterprise_*:maxwell_t_enterprise,include:db_business*.t_general_taxpayer_*:maxwell_t_taxpayer,include:db_business*.t_creditimportexport_data_*:maxwell_t_creditimportexport,include:db_code*.t_econ_kind_code_*:maxwell_t_econ,include:db_sub_enterprises*.t_last_industry_*:maxwell_t_last_industry,include:db_code*.t_industry_code_*:maxwell_t_industry,include:db_enterprise*.t_address_*:maxwell_t_address,include:db_enterprise*.t_history_names_*:maxwell_t_history_names,include:db_enterprise*.t_emails_*:maxwell_t_emails,include:db_enterprise_reports*.t_report_details_*:maxwell_t_report_details
javascript=/home/bigdata/filter.js


# Kafka 优化参数
producer_kafka_batch_size=163840
producer_kafka_linger_ms=20
producer_kafka_buffer_memory=67108864
producer_kafka_compression_type=lz4

# Maxwell 核心优化
producer_async=true
worker_threads=4
maxwell.buffer.size=10000
output_nulls=false
producer_partition_by=primary_key

# 高可用建议配置
producer_kafka_max_in_flight_requests_per_connection=5
producer_kafka_acks=all
producer_kafka_retries=10
producer_kafka_enable_idempotence=true
配置项数值作用说明
producer_kafka_batch_size163 840 B ≈ 160 KB每批最多攒 160 KB 再一起发给 Kafka,提高吞吐;与 linger_ms 配合控制延迟。
producer_kafka_linger_ms20 ms即使批次没满,最多等待 20 ms 就发送,平衡延迟与吞吐。
producer_kafka_buffer_memory67108 864 B ≈ 64 MBKafka Producer 客户端总缓存上限,用来暂存未发出去的消息。
producer_kafka_compression_typelz4消息体使用 LZ4 压缩,显著降低网络 IO 与磁盘占用,压缩/解压速度优于 GZIP。
producer_asynctrueMaxwell 采用异步模式生产消息,批量提交,提高性能;为 false 时每条同步等待。
worker_threads4Maxwell 内部使用 4 条线程并行读取 binlog 事件并转换,加快处理速度。
maxwell.buffer.size10 000内部环形队列最大缓存 10 000 条事件,防止瞬间峰值把内存打爆。
output_nullsfalse字段值为 NULL 时输出到 JSON,减少消息体积;为 true 时显式输出 "col": null
producer_partition_byprimary_key按主键哈希选分区,保证同一主键始终进入同一分区,下游可有序消费。
producer_kafka_max_in_flight_requests_per_connection5单个 TCP 连接最多同时发送 5 个请求,提高吞吐;配合幂等设为 ≤5。
producer_kafka_acksall等待所有 ISR 副本确认才认为发送成功,最高级别持久化保障。
producer_kafka_retries10可重试错误(如瞬时网络抖动)自动重试 10 次,避免数据丢失。
producer_kafka_enable_idempotencetrue开启幂等生产者,重试时不会重复写消息,实现端到端精确一次(EOS)。

1.2 编写filter.js文件(对于多库多表的存储,映射到同一个topic)

背景介绍:为何使用 filter.js 进行 Maxwell 动态路由清洗

1.2.1 业务现状:分库分表架构 (Data Sharding)

在当前的大数据架构中,为了应对海量数据的存储与高并发写入,业务数据库采用了分库分表(Sharding)的策略。
根据配置文件中的 filter 规则可以看出,我们的业务数据分散在多个逻辑库和物理表中。例如:

  • 企业数据:分散在 db_enterprise_01, db_enterprise_02 … 等多个库中,表名可能为 t_enterprise_2023, t_enterprise_2024 等。
  • 税务数据:分散在 db_business 系列库中。
  • 代码表数据:分散在 db_code 系列库中。
1.2.2 痛点:静态配置的局限性 (Configuration Complexity)

如果不使用脚本处理,仅依赖 Maxwell 配置文件中的 filter 参数(即配置中被注释掉的那一行长字符串),我们会面临以下严峻挑战:

  • 配置难以维护:随着业务扩展,库表数量增加,filter 规则会变得极度冗长且难以阅读(如您配置中所示,一行规则包含了数十个正则匹配)。
  • 灵活性差:静态配置很难处理复杂的逻辑。例如,“将所有 db_enterprise* 库下的 t_enterprise* 表的数据,统一发送到 Kafka 的 maxwell_t_enterprise 主题中”。
  • Topic 爆炸:如果不做重命名映射,Maxwell 默认可能会为每个物理表生成不同的标识,导致下游 Kafka Topic 数量激增,增加了消费端(Flink/Spark)的处理复杂度。
1.2.3 解决方案:引入 JavaScript 动态过滤器 (Programmable Routing)

通过配置 javascript=/home/bigdata/filter.js,我们启用了 Maxwell 的编程接口。这允许我们在数据进入 Kafka 之前,对 Binlog 数据进行行级拦截和处理

核心价值在于:

  1. 多对一映射(Data Aggregation)
    可以将分散在成百上千个物理分片表(如 db_enterprise_01.t_enterprise_001)的数据,在传输层直接“清洗”为统一的逻辑表名(如 maxwell_t_enterprise),并发送到同一个 Kafka Topic。
  2. 动态 Topic 路由(Dynamic Topic Routing)
    无需在配置文件中写死 Topic 名称。filter.js 可以根据数据库名或表名的正则特征,动态决定该条数据应该发往哪个 Topic。这完美契合配置中提到的 maxwell_t_{generator} 命名规则。
  3. 数据清洗与脱敏(ETL at Source)
    虽然主要目的是路由,但 JS 文件同时也提供了在源头过滤无用字段(如删除 create_time 或大字段)或对敏感数据进行脱敏的能力,减轻下游计算压力。
1.2.4 总结

编写 filter.js 是为了屏蔽上游分库分表的物理差异,向下游数据仓库提供一个统一、逻辑化的数据视图。它将复杂的正则匹配和重命名逻辑从配置文件中解耦出来,极大地提升了数据同步链路的可维护性和扩展性。

这里主要是添加黑名单,指定不监听某些表(加入blacklist,这里举例是db_test.t_bidding_content):

/**
 * Maxwell row-level filter
 * 投递规则:
 *   1. 老库 + 老表
 *   2. db_test 库下的新表(含 user)
 */
function process_row(row) {
    // 【新增】黑名单控制区
    // 暂停监听的表(格式:数据库名.表名)
    var blacklist = [
        "db_test.t_bidding_content"
    ];

    // 获取当前数据的完整表名
    var fullTableName = row.database + "." + row.table;

    // 如果在黑名单中,直接丢弃,不处理
    if (blacklist.indexOf(fullTableName) !== -1) {
        row.suppress();
        return; // 直接返回,结束函数
    }

    /* ---------- 1. 老库老表 ---------- */
    var dbRegOld = /^db_(enterprise|business|code|sub_enterprises|enterprise_reports)\d+$/;
    var tblRegOld = /^t_(enterprise|general_taxpayer|creditimportexport_data|econ_kind_code|last_industry|industry_code|address|history_names|emails|report_details)_\d+$/;

    /* ---------- 2. 新库新表(含 user) ---------- */
    var dbRegNew = /^db_test$/;
    var tblRegNew = /^t_(bidding_content|bidding_info|bidding_related|user)$/;

    var baseTable = null;

    /* 老规则匹配 */
    if (dbRegOld.test(row.database) && tblRegOld.test(row.table)) {
        baseTable = row.table
            .replace(/_\d+$/, '')      // 去掉分表后缀
            .replace(/^t_/, '')        // 去掉 t_ 前缀
            .replace('_data', '');     // 去掉 _data 后缀
    }
    /* 新规则匹配 */
    else if (dbRegNew.test(row.database) && tblRegNew.test(row.table)) {
        baseTable = row.table
            .replace(/^t_/, '')        // 去掉 t_ 前缀
            .replace(/_\d+$/, '');     // 去掉分表后缀(如果有)
    }

    /* 3. 生成 topic 或丢弃 */
    if (baseTable) {
        row.kafka_topic = 'maxwell_t_' + baseTable;
    } else {
        row.suppress();   // 不匹配则丢弃
    }
}

1.3 删除Kafka(3.9.0)topic

#进入Kafka目录
cd kafka_2.13-3.9.0
#查看当前topic列表
./bin/kafka-topics.sh --bootstrap-server xx.xx.xx.xx1:9092 --list

在这里插入图片描述

删除maxwell_t_bidding_content、maxwell_t_bidding_info、maxwell_t_bidding_related这三个表的topic

bin/kafka-topics.sh --bootstrap-server xx.xx.xx.xx1:9092 --delete --topic maxwell_t_bidding_content

bin/kafka-topics.sh --bootstrap-server xx.xx.xx.xx1:9092 --delete --topic maxwell_t_bidding_info

bin/kafka-topics.sh --bootstrap-server xx.xx.xx.xx1:9092 --delete --topic maxwell_t_bidding_related

在这里插入图片描述

1.4 重新启动Maxwell

#进入Maxwell路径
cd /opt/module/maxwell-1.44.0

#指定jdk11启动
export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-11.0.23.0.9-2.el7_9.x86_64
export PATH=$JAVA_HOME/bin:$PATH

#启动Maxwell
./bin/maxwell --config config.properties

在这里插入图片描述

二、开始测试

2.1 启动插入mysql数据脚本

运行环境:python3.8.20

# -*- coding: utf8 -*-
"""
三表并发插入 + 后台「按轮次」批量改/删 + 最终 Kafka 总量统计
修改点:
1. [新增] 启动前打印 Kafka 初始 Offset。
2. 结束后记录 Kafka 最终 Offset。
3. 统计表中 'Kafka Recv' 显示本次运行新增的数据量。
4. 最后显示 Kafka 当前总积压量。
"""
import datetime as dt
from config import Config
import random
import time
import pymysql
import threading
import signal
import re
from typing import Dict, Any, List, Deque
from collections import deque, Counter
from kafka import KafkaAdminClient, KafkaConsumer, TopicPartition
from kafka.admin import NewTopic
from kafka.errors import TopicAlreadyExistsError

# -------------- Kafka 配置 --------------
BOOTSTRAP_SERVERS = ["xx.xx.xx.xx1:9092", "xx.xx.xx.xx2:9092", "xx.xx.xx.xx3:9092"]
KAFKA_TOPICS = ["maxwell_t_bidding_content", "maxwell_t_bidding_info", "maxwell_t_bidding_related"]

# -------------- 全局统计 (MySQL 端) --------------
STATS: Dict[str, Counter] = {
    "t_bidding_content": Counter({"insert": 0, "update": 0, "delete": 0}),
    "t_bidding_info": Counter({"insert": 0, "update": 0, "delete": 0}),
    "t_bidding_related": Counter({"insert": 0, "update": 0, "delete": 0}),
}


def create_topics_once():
    admin = KafkaAdminClient(bootstrap_servers=BOOTSTRAP_SERVERS)
    topics = [NewTopic(name=kt, num_partitions=6, replication_factor=2) for kt in KAFKA_TOPICS]
    try:
        admin.create_topics(new_topics=topics, validate_only=False)
        print(">>> Kafka topics created OK:", KAFKA_TOPICS)
    except TopicAlreadyExistsError:
        print(">>> Topics already exist, skip creation.")
    except Exception as e:
        print(f">>> Create topics warning: {e}")
    finally:
        admin.close()


# -------------- 获取 Kafka 当前总数据量 (Log End Offset) --------------
def get_current_topic_offsets() -> Dict[str, int]:
    """
    不消费数据,直接查询 Kafka Broker 上各 Partition 的 LogEndOffset。
    代表 Topic 当前的总消息数(High Watermark)。
    """
    total_offsets = {t: 0 for t in KAFKA_TOPICS}
    consumer = None
    try:
        consumer = KafkaConsumer(bootstrap_servers=BOOTSTRAP_SERVERS)
        for topic in KAFKA_TOPICS:
            partitions = consumer.partitions_for_topic(topic)
            if not partitions:
                continue
            # 构建分区对象列表
            tp_list = [TopicPartition(topic, p) for p in partitions]
            # 获取所有分区的最新 Offset
            end_offsets = consumer.end_offsets(tp_list)
            # 累加得到 Topic 总量
            total_offsets[topic] = sum(end_offsets.values())
    except Exception as e:
        print(f">>> 获取 Kafka Offsets 失败: {e}")
    finally:
        if consumer:
            consumer.close()
    return total_offsets


# -------------- 雪花 ID --------------
EPOCH = 1_600_000_000_000
WORKER_ID_BITS = 10
SEQUENCE_BITS = 12
MAX_WORKER_ID = -1 ^ (-1 << WORKER_ID_BITS)
MAX_SEQUENCE = -1 ^ (-1 << SEQUENCE_BITS)
WORKER_ID_SHIFT = SEQUENCE_BITS
TIMESTAMP_SHIFT = SEQUENCE_BITS + WORKER_ID_BITS


class Snowflake:
    _lock = threading.Lock()

    def __init__(self, worker_id: int = 1):
        self.worker_id = worker_id
        self.sequence = 0
        self.last_timestamp = -1

    def _gen_timestamp(self):
        return int(time.time() * 1000)

    def get_id(self) -> int:
        with Snowflake._lock:
            timestamp = self._gen_timestamp()
            if timestamp < self.last_timestamp:
                time.sleep(0.001)
                timestamp = self._gen_timestamp()
            if timestamp == self.last_timestamp:
                self.sequence = (self.sequence + 1) & MAX_SEQUENCE
                if self.sequence == 0:
                    while timestamp <= self.last_timestamp:
                        timestamp = self._gen_timestamp()
            else:
                self.sequence = 0
            self.last_timestamp = timestamp
            return ((timestamp - EPOCH) << TIMESTAMP_SHIFT |
                    self.worker_id << WORKER_ID_SHIFT |
                    self.sequence)


_snow = Snowflake(worker_id=1)


def gen_new_id() -> int: return _snow.get_id()


# -------------- 数据库配置 --------------
DB_CONFIG = {
    "host": "xx.xx.xx.xx",
    "user": "root",
    "password": "xxxx",
    "port": 3306,
    "charset": "utf8mb4",
    "autocommit": False,
}
DATABASE = "db_test"

# ========== 速率配置 ==========
BATCH_SIZE = 100
TARGET_TPS = 100
CRUD_ROUND = 20

TABLE_META = {
    "t_bidding_info": [
        "id", "title", "publish_time", "area_code", "notice_type_main",
        "notice_type_sub", "industry_code", "project_name", "project_number",
        "project_budget_money", "project_time_limit", "entry_start_time",
        "entry_end_time", "bid_open_time", "proprietor_company",
        "agency_company", "winner_company", "winner_candidate",
        "related_construction", "project_fund_source", "attachment_oss",
        "url", "crawl_time", "s_id", "u_id", "u_tags", "qds", "create_time",
        "row_update_time", "local_row_update_time", "cdc_sync_date",
        "products", "partition_date", "insert_mysql"
    ],
    "t_bidding_content": [
        "id", "content_text", "content_html", "content_swf", "u_id",
        "u_tags", "qds", "create_time", "row_update_time",
        "local_row_update_time", "cdc_sync_date", "partition_date",
        "insert_mysql"
    ],
    "t_bidding_related": [
        "id", "u_id", "eid", "role", "title", "publish_time", "area_code",
        "notice_type_main", "notice_type_sub", "industry_code",
        "project_number", "project_bid_money", "qds", "u_tags",
        "create_time", "row_update_time", "local_row_update_time",
        "cdc_sync_date", "products", "partition_date", "insert_mysql"
    ],
}

STOP_EVENT = threading.Event()


# -------------- 令牌桶 --------------
class TokenBucket:
    def __init__(self, rate: int):
        self.rate = rate
        self.tokens = rate
        self.lock = threading.Lock()
        self.last = time.time()

    def consume(self, n: int = 1):
        while not STOP_EVENT.is_set():
            with self.lock:
                now = time.time()
                delta = now - self.last
                self.tokens = min(self.rate, self.tokens + delta * self.rate)
                self.last = now
                if self.tokens >= n:
                    self.tokens -= n
                    return
            time.sleep(0.001)


TOKEN_BUCKET = TokenBucket(TARGET_TPS)

# -------------- 轮次计数 & CRUD 逻辑 --------------
_INSERT_ROUNDS = 0
_CRUD_LOCK = threading.Lock()
_CRUD_SEQ = 0
_CRUD_SEQ_LOCK = threading.Lock()
_BUFFER_LOCK = threading.Lock()
_BUFFERS: Dict[str, Deque[Dict[str, Any]]] = {
    tbl: deque(maxlen=CRUD_ROUND * BATCH_SIZE * 2) for tbl in TABLE_META
}


def _trigger_crud_if_needed():
    global _INSERT_ROUNDS
    with _CRUD_LOCK:
        _INSERT_ROUNDS += 1
        if _INSERT_ROUNDS >= CRUD_ROUND:
            _INSERT_ROUNDS = 0
            threading.Thread(target=_crud_once, daemon=True).start()


def _crud_once():
    conn = pymysql.connect(**DB_CONFIG)
    try:
        with _BUFFER_LOCK:
            if not any(_BUFFERS.values()): return
            snapshot = {k: list(v) for k, v in _BUFFERS.items()}
            for q in _BUFFERS.values(): q.clear()

        with _CRUD_SEQ_LOCK:
            global _CRUD_SEQ
            _CRUD_SEQ += 1
            counter = _CRUD_SEQ

        print(f"\n>>> CRUD 第 {counter} 轮开始")
        for tbl, buf in snapshot.items():
            if not buf: continue
            ids = [r["id"] for r in buf]
            mod_cnt = random.randint(50, 100)
            del_cnt = random.randint(50, 100)
            mod_ids = random.sample(ids, min(mod_cnt, len(ids)))
            candidates = [i for i in ids if i not in mod_ids]
            del_cnt = min(del_cnt, len(candidates))
            del_ids = random.sample(candidates, del_cnt) if del_cnt else []

            with conn.cursor() as cur:
                if mod_ids:
                    col_map = {"t_bidding_info": "title", "t_bidding_related": "title",
                               "t_bidding_content": "content_text"}
                    col = col_map[tbl]
                    new_val = f"[RAND-{counter}]"
                    sql_mod = f"UPDATE `{DATABASE}`.`{tbl}` SET {col} = %s WHERE id = %s"
                    cur.executemany(sql_mod, [(new_val, _id) for _id in mod_ids])
                    STATS[tbl]["update"] += len(mod_ids)
                    print(f"  [{tbl}] 修改 {len(mod_ids)} 行")
                if del_ids:
                    sql_del = f"DELETE FROM `{DATABASE}`.`{tbl}` WHERE id = %s"
                    cur.executemany(sql_del, [(_id,) for _id in del_ids])
                    STATS[tbl]["delete"] += len(del_ids)
                    print(f"  [{tbl}] 删除 {len(del_ids)} 行")
            conn.commit()
        print(f">>> CRUD 第 {counter} 轮完成\n")
    finally:
        conn.close()


def add_to_buffer(table: str, rows: List[Dict[str, Any]]):
    with _BUFFER_LOCK:
        for r in rows:
            _BUFFERS[table].append({"id": r["id"], "table": table})
    _trigger_crud_if_needed()


# -------------- 数据生成 --------------
_SHARED_UIDS = deque(maxlen=2000)
_UID_REUSE_RATE = 0.8
_AREA_CODE_POOL: List[str] = []


def _clean_uid(raw: str) -> str:
    if not raw: return 'default_uid'
    raw = str(raw).strip()
    cleaned = re.sub(r'[^a-zA-Z0-9_-]', '_', raw)
    return cleaned[:120] if cleaned else 'default_uid'


def _load_area_code_pool():
    global _AREA_CODE_POOL
    conn = pymysql.connect(**DB_CONFIG)
    try:
        with conn.cursor() as cur:
            cur.execute(
                f"SELECT DISTINCT area_code FROM `{DATABASE}`.`t_bidding_info` WHERE area_code IS NOT NULL LIMIT 1000")
            _AREA_CODE_POOL = [row[0] for row in cur.fetchall()]
            print(f">>> area_code 池加载完成,共 {len(_AREA_CODE_POOL)} 条")
    finally:
        conn.close()


def transform_one(table: str, row: Dict[str, Any]) -> Dict[str, Any]:
    new_row = row.copy()
    new_row["id"] = gen_new_id()
    new_row["u_id"] = _clean_uid(row.get("u_id"))
    if random.random() < _UID_REUSE_RATE and _SHARED_UIDS:
        new_row["u_id"] = random.choice(_SHARED_UIDS)
    else:
        if new_row["u_id"]: _SHARED_UIDS.append(new_row["u_id"])
    if table == "t_bidding_info":
        if _AREA_CODE_POOL: new_row["area_code"] = random.choice(_AREA_CODE_POOL)
        if new_row.get("title"): new_row["title"] = "[UPD]" + new_row["title"]
        if new_row.get("project_budget_money") is not None: new_row["project_budget_money"] = round(
            float(new_row["project_budget_money"]) * 1.05, 6)
    elif table == "t_bidding_content":
        if new_row.get("content_text"): new_row["content_text"] = "[UPD]" + new_row["content_text"]
    elif table == "t_bidding_related":
        if new_row.get("title"): new_row["title"] = "[UPD]" + new_row["title"]
        if new_row.get("project_bid_money") is not None: new_row["project_bid_money"] = round(
            float(new_row["project_bid_money"]) * 1.05, 6)
    return new_row


def build_insert_sql(table: str, cols: List[str]) -> str:
    back_quote_cols = [f"`{c}`" for c in cols]
    placeholders = ["%s"] * len(cols)
    return f"INSERT INTO `{DATABASE}`.`{table}` ({','.join(back_quote_cols)}) VALUES ({','.join(placeholders)})"


def worker(table: str):
    cols = TABLE_META[table]
    select_cols = ",".join([f"`{c}`" for c in cols])
    sql_select = f"SELECT {select_cols} FROM `{DATABASE}`.`{table}` WHERE id > %s ORDER BY id ASC LIMIT %s"
    sql_insert = build_insert_sql(table, cols)
    conn = pymysql.connect(**DB_CONFIG)
    try:
        lower_id = 0
        total = 0
        while not STOP_EVENT.is_set():
            with conn.cursor() as cur:
                cur.execute(sql_select, (lower_id, BATCH_SIZE))
                rows = cur.fetchall()
                if not rows:
                    lower_id = 0
                    continue
                dict_rows = [dict(zip(cols, r)) for r in rows]
                new_rows = [transform_one(table, r) for r in dict_rows]
                TOKEN_BUCKET.consume(len(new_rows))
                try:
                    cur.executemany(sql_insert, [tuple(r[c] for c in cols) for r in new_rows])
                    conn.commit()
                    total += len(new_rows)
                    print(f"[{table}] +{len(new_rows)} 行,总插入 {total}")
                    STATS[table]["insert"] += len(new_rows)
                    add_to_buffer(table, new_rows)
                except Exception as e:
                    conn.rollback()
                    print(f"[{table}] 插入异常:{e}")
                lower_id = max(r["id"] for r in dict_rows)
    finally:
        conn.close()


# -------------- 核心统计函数 --------------
def print_final_stats(initial_offsets: Dict[str, int], final_offsets: Dict[str, int]):
    """
    initial_offsets: 脚本启动时的 offset
    final_offsets: 脚本结束时的 offset
    """
    print("\n=================== 本次运行统计 ===================")
    print(
        f"{'Table / Topic':<30} | {'MySQL Insert':<12} | {'MySQL Update':<12} | {'MySQL Delete':<12} | {'Kafka Recv':<10}")
    print("-" * 90)

    for tbl in TABLE_META.keys():
        # 获取 MySQL 明细
        ins = STATS[tbl]['insert']
        upd = STATS[tbl]['update']
        dlt = STATS[tbl]['delete']

        # 计算 Kafka 接收量 (本次增量 = 结束 - 初始)
        topic_name = f"maxwell_{tbl}"
        start_off = initial_offsets.get(topic_name, 0)
        end_off = final_offsets.get(topic_name, 0)

        # 这里的 Kafka Recv 指的是本次脚本运行期间 Kafka 新增的数据条数
        kafka_recv = max(0, end_off - start_off)

        print(f"{tbl:<30} | {ins:<12} | {upd:<12} | {dlt:<12} | {kafka_recv:<10}")

    print("===============================================")

    # ------ 单独输出各个 Topic 停止后现在的总数据量 ------
    print("\n>>> Kafka 各 Topic 当前(停止后)总积压量:")
    for topic in KAFKA_TOPICS:
        count = final_offsets.get(topic, 0)
        print(f"  [Topic] {topic:<35} : {count} 条")
    print("===============================================")


def wait_for_schedule_window():
    fmt = "%H:%M"
    start = dt.datetime.strptime(Config.SCHEDULE_START_TIME, fmt).time()
    end = dt.datetime.strptime(Config.SCHEDULE_END_TIME, fmt).time()
    while not STOP_EVENT.is_set():
        now = dt.datetime.now().time()
        if start <= end:
            in_window = start <= now <= end
        else:
            in_window = now >= start or now <= end
        if in_window:
            print(f"[schedule] 进入允许运行时段 {Config.SCHEDULE_START_TIME}~{Config.SCHEDULE_END_TIME},开始工作")
            return
        else:
            print(f"[schedule] 当前 {now.strftime('%H:%M')} 不在允许时段,等待 60 s …")
            STOP_EVENT.wait(60)


# -------------- 主控 --------------
def main():
    wait_for_schedule_window()
    if STOP_EVENT.is_set(): return

    _load_area_code_pool()
    create_topics_once()

    # --- 获取并打印初始数据量 ---
    print(">>> 正在获取 Kafka 初始数据量 (请稍候)...")
    initial_offsets = get_current_topic_offsets()

    # !!!这里是新增的打印部分!!!
    print(f"\n>>> [初始] Kafka 各 Topic 当前数据总量:")
    for topic in KAFKA_TOPICS:
        count = initial_offsets.get(topic, 0)
        print(f"  [Topic] {topic:<35} : {count} 条")
    print("-----------------------------------------------\n")
    # !!!新增部分结束!!!

    print(">>> 初始 Offset 获取完毕,准备启动线程...")

    def _sig_handler(sig, frame):
        print("\n>>> Ctrl+C 捕获,正在停止所有线程 ...")
        STOP_EVENT.set()

    signal.signal(signal.SIGINT, _sig_handler)

    threads = []
    for tbl in TABLE_META.keys():
        t = threading.Thread(target=worker, args=(tbl,), daemon=False)
        t.start()
        threads.append(t)

    try:
        while not STOP_EVENT.is_set():
            time.sleep(1)
    finally:
        print(">>> 等待工作线程结束...")
        for t in threads:
            if not t.daemon: t.join()

        print(">>> 正在等待 Maxwell 将最后的数据同步 (Wait 5s) ...")
        time.sleep(5)

        print(">>> 正在读取 Kafka 结束时的总数据量 ...")
        final_offsets = get_current_topic_offsets()

        # 将初始和结束的 offset 都传进去进行对比
        print_final_stats(initial_offsets, final_offsets)
        print(">>> 已安全退出")


if __name__ == "__main__":
    main()

2.2测试结果

初始三个topic数据都为0

在这里插入图片描述

ctrl+c停止程序,打印输出:

在这里插入图片描述

发现maxwell_t_bidding_content这个topic里面没有数据,但是db_test.t_bidding_content这个表实际插入了2500条数据,更改了256次,然后删除了231条,说明Maxwell成功暂停db_test.t_bidding_content这个表。

确定topic实际条数:

编写sh文件统计topic数量:

vi count_maxwell_total.sh

编写脚本:

#!/bin/bash
# count_maxwell_total_v3.sh  基于 Kafka 3.9.0
BROKERS="xx.xx.xx.xx1:9092,xx.xx.xx.xx2:9092,xx.xx.xx.xx3:9092"
#修改需要统计的topic即可
TOPIC="maxwell_t_bidding_info"
KAFKA_HOME=/home/bigdata/kafka_2.13-3.9.0

# 3.9.0 新版工具
"$KAFKA_HOME/bin/kafka-get-offsets.sh" \
  --bootstrap-server "$BROKERS" \
  --topic "$TOPIC" 2>/dev/null | \
awk -F: '{sum+=$NF} END{print "总条数:", sum}'

在这里插入图片描述

修改topic为maxwell_t_bidding_content

TOPIC="maxwell_t_bidding_content"

在这里插入图片描述

修改topic为maxwell_t_bidding_related

TOPIC="maxwell_t_bidding_related"

在这里插入图片描述

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值