MediaPipe关键数据和方法解析

原创已于 2025-11-30 22:07:27 修改 · 1.1k 阅读

20 ·

本内容遵循CC 4.0 BY-SA版权协议

GEO检测

标签

#计算机视觉 #人工智能 #手势识别

于 2025-11-23 22:35:34 首次发布

手势识别专栏收录该内容

3 篇文章

订阅专栏

从上一篇《MediaPipe入门》中的应用示例可以看到，使用MediaPipe识别手部关键点的主要代码就三句：

mp_hands = mp.solutions.hands
    hands = mp_hands.Hands(static_image_mode=True,
                           max_num_hands=2,
                           min_detection_confidence=0.5,
                           min_tracking_confidence=0.5)
results = hands.process(img)

接下来我们仅从应用的角度，详细分析上述参数的含义，以及识别结果包含的信息。
以下内容均以本人安装的0.10.21版本的MediaPipe为例。

一、mp_hands.Hands()参数深度解析

1.1 核心参数详解

import mediapipe as mp
import cv2

class MediaPipeHandsConfig:
    """
    MediaPipe Hands参数配置详解类
    """
    
    @staticmethod
    def create_hands_model(
        static_image_mode: bool = False,
        max_num_hands: int = 2, 
        model_complexity: int = 1,
        min_detection_confidence: float = 0.5,
        min_tracking_confidence: float = 0.5
    ):
        """
        创建MediaPipe Hands模型实例
        
        参数详解:
        
        static_image_mode (bool):
        - False: 视频流模式，使用追踪优化连续帧检测
        - True:  静态图像模式，每帧都进行完整检测
        - 推荐: 实时应用使用False，单张图片分析使用True
        
        max_num_hands (int):
        - 最大检测手部数量：1, 2
        - 影响性能：检测手部越多，计算量越大
        - 推荐: 根据实际场景选择，通常2足够
        
        model_complexity (int): 
        - 0: 轻量级模型，速度最快，精度较低
        - 1: 平衡模型，速度与精度均衡（默认）
        - 2: 高精度模型，速度最慢，精度最高
        - 推荐: 实时应用使用0或1，精度要求高使用2
        
        min_detection_confidence (float) [0, 1]:
        - 手部检测的最小置信度阈值
        - 值越高，检测越严格，漏检可能增加
        - 值越低，检测越宽松，误检可能增加
        - 推荐: 0.5-0.7
        
        min_tracking_confidence (float) [0, 1]:
        - 手部追踪的最小置信度阈值  
        - 仅当static_image_mode=False时有效
        - 值越高，追踪稳定性越好，但可能频繁重新检测
        - 推荐: 0.5-0.7
        """
        
        mp_hands = mp.solutions.hands
        
        hands = mp_hands.Hands(
            static_image_mode=static_image_mode,
            max_num_hands=max_num_hands,
            model_complexity=model_complexity,
            min_detection_confidence=min_detection_confidence,
            min_tracking_confidence=min_tracking_confidence
        )
        
        return hands
    
    @staticmethod
    def get_recommended_configs():
        """
        获取不同场景的推荐配置
        """
        configs = {
            "real_time_fast": {
                "static_image_mode": False,
                "max_num_hands": 2,
                "model_complexity": 0,
                "min_detection_confidence": 0.5,
                "min_tracking_confidence": 0.5,
                "description": "实时快速检测，适合移动设备"
            },
            "real_time_balanced": {
                "static_image_mode": False, 
                "max_num_hands": 2,
                "model_complexity": 1,
                "min_detection_confidence": 0.6,
                "min_tracking_confidence": 0.6,
                "description": "实时平衡模式，速度精度均衡"
            },
            "high_accuracy": {
                "static_image_mode": True,
                "max_num_hands": 2, 
                "model_complexity": 2,
                "min_detection_confidence": 0.7,
                "min_tracking_confidence": 0.7,
                "description": "高精度模式，适合静态图像分析"
            },
            "single_hand_focus": {
                "static_image_mode": False,
                "max_num_hands": 1,
                "model_complexity": 1, 
                "min_detection_confidence": 0.7,
                "min_tracking_confidence": 0.7,
                "description": "单手专注检测，减少干扰"
            }
        }
        return configs

二、MediaPipe Hands.process() 方法详解

2.1 方法作用总结

solutions.hands.Hands.process() 是 MediaPipe 手部检测解决方案的核心方法，主要功能包括：

手部检测：在输入图像中定位手部区域
手部关键点识别：检测每只手的21个关键解剖点
偏手性判断：识别每只手是左手还是右手
3D坐标计算：提供世界坐标系下的3D手部关键点

该方法封装了完整的机器学习推理流程，从图像预处理到后处理，输出结构化的手部分析结果。

2.2返回值详细解析

返回值类型

方法返回一个 NamedTuple，具体类型为 mp.tasks.vision.HandLandmarkerResult

核心属性结构

1. `multi_handedness` - 偏手性信息

类型: List[List[ClassificationResult]]
描述: 这是一个嵌套列表，存储了每只检测到的手的偏手性信息。
结构:
- 外层列表：长度等于检测到的手的数量。例如，检测到两只手，则长度为2。
- 内层列表：通常长度为1，包含一个 ClassificationResult 对象。
- ClassificationResult 对象有两个属性：
  - index: 类别索引（通常0代表"Left"，1代表"Right"，但这个映射关系最好通过标签确认）。
  - score: 分类置信度，范围 [0, 1]，表示模型对该判断的把握程度。
  - label: 分类标签字符串（例如 "Left" 或 "Right"）。这是最有用的属性。

# 示例结构
[
    [  # 第0只检测到的手
        ClassificationResult(
            index=0,           # 类别索引 (0=Left, 1=Right)
            score=0.98,        # 置信度 [0, 1]
            label='Left',      # 分类标签 ('Left'/'Right')
        )
    ],
    [  # 第1只检测到的手
        ClassificationResult(
            index=1,
            score=0.95,
            label='Right'
        )
    ]
]

访问示例：

for i, hand_classifications in enumerate(results.handedness):
    classification = hand_classifications[0]  # 总是取第一个元素
    print(f"手 {i}: {classification.label}, 置信度: {classification.score:.2f}")

2. `multi_hand_landmarks` - 归一化手部关键点

类型: List[NormalizedLandmarkList]
描述: 这是最核心的属性，包含了每只检测到手的所有21个关键点的归一化坐标。
结构:
- 外层列表：长度等于检测到的手的数量，与 multi_handedness 列表一一对应。
- NormalizedLandmarkList 对象：包含一个 landmark 属性，它是一个由21个 NormalizedLandmark 对象组成的列表。
- 每个 NormalizedLandmark 对象有三个属性：
  - x: 归一化后的x坐标，范围 [0, 1]，相对于图像宽度。
  - y: 归一化后的y坐标，范围 [0, 1]，相对于图像高度。
  - z: 归一化后的深度坐标，以手腕关键点为原点，值越小表示点离摄像头越近。
  - visibility: 该关键点的可见性置信度（有些模型中可能不可用）。

# 示例结构
[
    NormalizedLandmarkList(
        landmark=[  # 21个关键点，顺序固定
            NormalizedLandmark(x=0.1, y=0.2, z=0.01),  # 0: 手腕
            NormalizedLandmark(x=0.12, y=0.18, z=0.02), # 1: 拇指根部
            # ... 其余19个关键点
            NormalizedLandmark(x=0.08, y=0.05, z=0.03)  # 20: 小指尖
        ]
    )
]

具体关节编号如下：
在这里插入图片描述

21个关键点对应关系：

0: 手腕
1-4: 拇指 (CMC, MCP, IP, TIP)
5-8: 食指 (MCP, PIP, DIP, TIP)
9-12: 中指 (MCP, PIP, DIP, TIP)
13-16: 无名指 (MCP, PIP, DIP, TIP)
17-20: 小指 (MCP, PIP, DIP, TIP)

访问示例：

if results.hand_landmarks:
    for hand_idx, hand_landmarks in enumerate(results.hand_landmarks):
        # 获取手腕位置
        wrist = hand_landmarks.landmark[0]
        print(f"手腕坐标: ({wrist.x:.3f}, {wrist.y:.3f}, {wrist.z:.3f})")
        
        # 获取食指尖
        index_tip = hand_landmarks.landmark[8]
        print(f"食指尖: ({index_tip.x:.3f}, {index_tip.y:.3f})")

3. `hand_world_landmarks` - 世界坐标手部关键点

类型: List[LandmarkList]

结构说明：

与 hand_landmarks 类似，但坐标为真实3D世界坐标
单位：米
原点位于手部几何中心
适合进行3D空间分析和手势识别

访问示例：

if results.hand_world_landmarks:
    for world_landmarks in results.hand_world_landmarks:
        wrist_3d = world_landmarks.landmark[0]
        print(f"3D手腕坐标: ({wrist_3d.x:.3f}m, {wrist_3d.y:.3f}m, {wrist_3d.z:.3f}m)")

完整使用示例

import cv2
import mediapipe as mp

# 初始化MediaPipe Hands
mp_hands = mp.solutions.hands
hands = mp_hands.Hands(
    static_image_mode=False,      # 视频流模式
    max_num_hands=2,              # 最大检测手数
    min_detection_confidence=0.5, # 检测置信度阈值
    min_tracking_confidence=0.5   # 跟踪置信度阈值
)

# 处理图像
image = cv2.imread('hand_image.jpg')
image_rgb = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
results = hands.process(image_rgb)

# 解析结果
if results.hand_landmarks:
    print(f"检测到 {len(results.hand_landmarks)} 只手")
    
    for i, (hand_landmarks, handedness_list) in enumerate(zip(results.hand_landmarks, results.handedness)):
        # 获取偏手性信息
        handedness = handedness_list[0]
        print(f"\n--- 第 {i+1} 只手 ---")
        print(f"  手型: {handedness.label}")
        print(f"  置信度: {handedness.score:.3f}")
        
        # 获取关键点信息
        print("  关键点坐标:")
        for j, landmark in enumerate(hand_landmarks.landmark):
            print(f"    点 {j}: ({landmark.x:.3f}, {landmark.y:.3f}, {landmark.z:.3f})")
        
        # 转换坐标用于绘制
        h, w, _ = image.shape
        landmark_px = []
        for landmark in hand_landmarks.landmark:
            x_px = int(landmark.x * w)
            y_px = int(landmark.y * h)
            landmark_px.append((x_px, y_px))
        
        # 在图像上绘制关键点
        for point in landmark_px:
            cv2.circle(image, point, 5, (0, 255, 0), -1)

cv2.imshow('Hand Detection', image)
cv2.waitKey(0)
hands.close()

在这里插入图片描述

重要注意事项

空结果处理：即使没有检测到手，返回值的所有属性也会存在，但值为空列表 []
坐标系统：
- hand_landmarks: 归一化坐标 (0-1)，适合在图像上绘制
- hand_world_landmarks: 真实3D坐标，适合空间分析
版本兼容性：
- 新版本使用 hand_landmarks 和 handedness
- 旧版本可能使用 multi_hand_landmarks 和 multi_handedness
性能考虑：
- 对于视频流，设置 static_image_mode=False 以提高性能
- 适当调整置信度阈值以平衡精度和召回率

需要说明的事

在使用分类结果时，需要使用语句results.multi_handedness[1].classification[0].score其中classification其实只有一个元素，但需要加[0]来引用其内部元素这个序号0代表呢？
这个嵌套结构容易让人困惑。我们来看一下这个classification[0]的序号0到底代表什么。

classification[0]中的序号0并不代表第0只手，而是代表对于同一只手，模型给出的第0个（也是唯一一个）分类结果。

假设 results 是已经处理好的结果

if results.multi_handedness:
    print(f"检测到了 {len(results.multi_handedness)} 只手")   
    for hand_index, classification_list in enumerate(results.multi_handedness):
        print(f"\n=== 第 {hand_index} 只手 ===")
        print(f"classification_list 的类型: {type(classification_list)}")
        print(f"classification_list 的长度: {len(classification_list)}")
        
        # 这里就是关键的 classification[0]
        classification_result = classification_list[0]  # ← 总是取第一个（索引0）
        print(f"classification[0] 的内容: {classification_result}")
        print(f"  标签: {classification_result.label}")
        print(f"  置信度: {classification_result.score:.3f}")
        print(f"  索引: {classification_result.index}")

输出可能类似于：

检测到了 2 只手
=== 第 0 只手 ===
classification_list 的类型: <class 'list'>
classification_list 的长度: 1
classification[0] 的内容: ClassificationResult(label='Left', score=0.95, index=0)
  标签: Left
  置信度: 0.950
  索引: 0

=== 第 1 只手 ===  
classification_list 的类型: <class 'list'>
classification_list 的长度: 1
classification[0] 的内容: ClassificationResult(label='Right', score=0.98, index=1)
  标签: Right
  置信度: 0.980
  索引: 1