Mediapipe - tracking faces, hands and pose

онлайн тренажер по питону
Online Python Trainer for Beginners

Learn Python easily without overwhelming theory. Solve practical tasks with automatic checking, get hints in Russian, and write code directly in your browser — no installation required.

Start Course

Installation and Setup of MediaPipe

Basic Installation

pip install mediapipe opencv-python

Additional Dependencies for Extended Functionality

pip install mediapipe opencv-python numpy matplotlib

Verify Installation

import mediapipe as mp
import cv2
print(f"MediaPipe version: {mp.__version__}")
print(f"OpenCV version: {cv2.__version__}")

Architecture and Core Components of MediaPipe

MediaPipe is built on a modular architecture where each module specializes in a specific computer‑vision task. The library uses optimized neural networks trained on massive datasets, delivering high accuracy and performance.

Key MediaPipe Modules

Module Purpose Number of Points Use Cases
Hands Hand tracking 21 points per hand Gesture control, virtual keyboards, games
FaceMesh Detailed face mapping 468 points AR filters, emotion analysis, recognition
Pose Body pose tracking 33 key points Fitness, sports, rehabilitation, animation
Holistic Comprehensive analysis Face + hands + body Full‑body motion capture
FaceDetection Fast face detection 6 key points Security systems, crowd counting
SelfieSegmentation Human segmentation Segmentation mask Background replacement, effects
Objectron 3D object detection 3D bounding box AR, robotics
MediaPipeHandLandmarker Enhanced hand tracking 21 points + extra metrics Precise gesture recognition

Detailed Module Descriptions

Hands Module – Hand Tracking

The Hands module provides high‑precision tracking of up to two hands simultaneously, detecting 21 key points on each hand.

Core Initialization Parameters

import mediapipe as mp

mp_hands = mp.solutions.hands
hands = mp_hands.Hands(
    static_image_mode=False,      # Video mode (False) or static image mode (True)
    max_num_hands=2,              # Maximum number of hands to detect
    min_detection_confidence=0.5, # Minimum detection confidence
    min_tracking_confidence=0.5   # Minimum tracking confidence
)

Practical Example with Coordinate Processing

import cv2
import mediapipe as mp
import numpy as np

mp_hands = mp.solutions.hands
hands = mp_hands.Hands(min_detection_confidence=0.7, min_tracking_confidence=0.5)
mp_draw = mp.solutions.drawing_utils

cap = cv2.VideoCapture(0)

while cap.isOpened():
    success, frame = cap.read()
    if not success:
        continue
    
    # Prepare image
    frame_rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
    frame_rgb.flags.writeable = False
    results = hands.process(frame_rgb)
    frame_rgb.flags.writeable = True
    frame = cv2.cvtColor(frame_rgb, cv2.COLOR_RGB2BGR)
    
    # Process results
    if results.multi_hand_landmarks:
        for hand_landmarks in results.multi_hand_landmarks:
            # Draw connections
            mp_draw.draw_landmarks(
                frame, hand_landmarks, mp_hands.HAND_CONNECTIONS,
                mp_draw.DrawingSpec(color=(0, 0, 255), thickness=2, circle_radius=2),
                mp_draw.DrawingSpec(color=(0, 255, 0), thickness=2)
            )
            
            # Extract coordinates of each point
            landmarks = []
            for lm in hand_landmarks.landmark:
                h, w, c = frame.shape
                cx, cy = int(lm.x * w), int(lm.y * h)
                landmarks.append([cx, cy])
            
            # Example: count raised fingers
            fingers_up = count_fingers(landmarks)
            cv2.putText(frame, f'Fingers: {fingers_up}', (10, 30), 
                       cv2.FONT_HERSHEY_SIMPLEX, 1, (255, 0, 0), 2)
    
    cv2.imshow("Hand Tracking", frame)
    if cv2.waitKey(1) & 0xFF == 27:  # ESC to exit
        break

cap.release()
cv2.destroyAllWindows()

def count_fingers(landmarks):
    """Count raised fingers"""
    tip_ids = [4, 8, 12, 16, 20]  # Finger tip IDs
    fingers = []
    
    # Thumb
    if landmarks[tip_ids[0]][0] > landmarks[tip_ids[0] - 1][0]:
        fingers.append(1)
    else:
        fingers.append(0)
    
    # Other fingers
    for id in range(1, 5):
        if landmarks[tip_ids[id]][1] < landmarks[tip_ids[id] - 2][1]:
            fingers.append(1)
        else:
            fingers.append(0)
    
    return sum(fingers)

FaceMesh Module – Detailed Face Mapping

FaceMesh delivers exceptionally accurate tracking of 468 3‑D facial landmarks in real time.

Initialization and Parameters

mp_face_mesh = mp.solutions.face_mesh
face_mesh = mp_face_mesh.FaceMesh(
    static_image_mode=False,
    max_num_faces=1,
    refine_landmarks=True,        # Enhanced accuracy for lips and eyes
    min_detection_confidence=0.5,
    min_tracking_confidence=0.5
)

Working with Specific Facial Regions

import cv2
import mediapipe as mp

mp_face_mesh = mp.solutions.face_mesh
face_mesh = mp_face_mesh.FaceMesh(refine_landmarks=True)
mp_draw = mp.solutions.drawing_utils
draw_spec = mp_draw.DrawingSpec(thickness=1, circle_radius=1)

# Index lists for specific facial parts
LEFT_EYE = [362, 382, 381, 380, 374, 373, 390, 249, 263, 466, 388, 387, 386, 385, 384, 398]
RIGHT_EYE = [33, 7, 163, 144, 145, 153, 154, 155, 133, 173, 157, 158, 159, 160, 161, 246]
LIPS = [61, 146, 91, 181, 84, 17, 314, 405, 320, 307, 375, 308, 324, 318]

cap = cv2.VideoCapture(0)

while cap.isOpened():
    success, frame = cap.read()
    if not success:
        continue
    
    frame_rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
    results = face_mesh.process(frame_rgb)
    
    if results.multi_face_landmarks:
        for face_landmarks in results.multi_face_landmarks:
            # Draw full face mesh
            mp_draw.draw_landmarks(
                frame, face_landmarks, mp_face_mesh.FACEMESH_CONTOURS,
                None, draw_spec
            )
            
            # Highlight eyes
            h, w, c = frame.shape
            for eye_idx in LEFT_EYE + RIGHT_EYE:
                x = int(face_landmarks.landmark[eye_idx].x * w)
                y = int(face_landmarks.landmark[eye_idx].y * h)
                cv2.circle(frame, (x, y), 2, (0, 255, 0), -1)
    
    cv2.imshow("Face Mesh", frame)
    if cv2.waitKey(1) & 0xFF == 27:
        break

cap.release()
cv2.destroyAllWindows()

Pose Module – Body Pose Tracking

The Pose module identifies 33 key points of the human body, covering joints, limbs, and central landmarks.

Extended Example with Pose Analysis

import cv2
import mediapipe as mp
import numpy as np

mp_pose = mp.solutions.pose
pose = mp_pose.Pose(
    static_image_mode=False,
    model_complexity=2,           # 0, 1 or 2 (higher = more accurate)
    smooth_landmarks=True,
    enable_segmentation=True,     # Enable segmentation
    smooth_segmentation=True,
    min_detection_confidence=0.5,
    min_tracking_confidence=0.5
)

mp_draw = mp.solutions.drawing_utils

def calculate_angle(a, b, c):
    """Calculate angle between three points"""
    a = np.array(a)
    b = np.array(b)
    c = np.array(c)
    
    radians = np.arctan2(c[1] - b[1], c[0] - b[0]) - np.arctan2(a[1] - b[1], a[0] - b[0])
    angle = np.abs(radians * 180.0 / np.pi)
    
    if angle > 180.0:
        angle = 360 - angle
    
    return angle

cap = cv2.VideoCapture(0)

while cap.isOpened():
    success, frame = cap.read()
    if not success:
        continue
    
    frame_rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
    results = pose.process(frame_rgb)
    
    if results.pose_landmarks:
        # Draw skeleton
        mp_draw.draw_landmarks(
            frame, results.pose_landmarks, mp_pose.POSE_CONNECTIONS,
            mp_draw.DrawingSpec(color=(245, 117, 66), thickness=2, circle_radius=2),
            mp_draw.DrawingSpec(color=(245, 66, 230), thickness=2, circle_radius=2)
        )
        
        # Extract coordinates for analysis
        landmarks = results.pose_landmarks.landmark
        h, w, c = frame.shape
        
        # Coordinates for squat analysis
        hip = [landmarks[mp_pose.PoseLandmark.LEFT_HIP.value].x * w,
               landmarks[mp_pose.PoseLandmark.LEFT_HIP.value].y * h]
        knee = [landmarks[mp_pose.PoseLandmark.LEFT_KNEE.value].x * w,
                landmarks[mp_pose.PoseLandmark.LEFT_KNEE.value].y * h]
        ankle = [landmarks[mp_pose.PoseLandmark.LEFT_ANKLE.value].x * w,
                 landmarks[mp_pose.PoseLandmark.LEFT_ANKLE.value].y * h]
        
        # Compute knee angle
        angle = calculate_angle(hip, knee, ankle)
        
        # Visualize angle
        cv2.putText(frame, f'Knee Angle: {int(angle)}', 
                   (int(knee[0]), int(knee[1]) - 20),
                   cv2.FONT_HERSHEY_SIMPLEX, 0.7, (255, 255, 255), 2)
        
        # Determine squat phase
        if angle > 160:
            stage = "UP"
        elif angle < 90:
            stage = "DOWN"
        
        cv2.putText(frame, f'Stage: {stage}', (10, 30),
                   cv2.FONT_HERSHEY_SIMPLEX, 1, (0, 255, 0), 2)
    
    cv2.imshow("Pose Analysis", frame)
    if cv2.waitKey(1) & 0xFF == 27:
        break

cap.release()
cv2.destroyAllWindows()

Holistic Module – Comprehensive Analysis

Holistic combines the capabilities of all primary modules, delivering simultaneous tracking of face, hands, and body pose.

import cv2
import mediapipe as mp

mp_holistic = mp.solutions.holistic
holistic = mp_holistic.Holistic(
    static_image_mode=False,
    model_complexity=2,
    smooth_landmarks=True,
    enable_segmentation=True,
    smooth_segmentation=True,
    refine_face_landmarks=True,
    min_detection_confidence=0.5,
    min_tracking_confidence=0.5
)

mp_draw = mp.solutions.drawing_utils

cap = cv2.VideoCapture(0)

while cap.isOpened():
    success, frame = cap.read()
    if not success:
        continue
    
    frame_rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
    results = holistic.process(frame_rgb)
    
    # Draw all components
    if results.face_landmarks:
        mp_draw.draw_landmarks(frame, results.face_landmarks, mp_holistic.FACEMESH_CONTOURS)
    
    if results.pose_landmarks:
        mp_draw.draw_landmarks(frame, results.pose_landmarks, mp_holistic.POSE_CONNECTIONS)
    
    if results.left_hand_landmarks:
        mp_draw.draw_landmarks(frame, results.left_hand_landmarks, mp_holistic.HAND_CONNECTIONS)
    
    if results.right_hand_landmarks:
        mp_draw.draw_landmarks(frame, results.right_hand_landmarks, mp_holistic.HAND_CONNECTIONS)
    
    cv2.imshow("Holistic Analysis", frame)
    if cv2.waitKey(1) & 0xFF == 27:
        break

cap.release()
cv2.destroyAllWindows()

MediaPipe Methods and Functions Table

Class / Function Module Description Key Parameters
Hands() mp.solutions.hands Initializes hand detector static_image_mode, max_num_hands, min_detection_confidence
FaceMesh() mp.solutions.face_mesh Creates face‑mesh detector refine_landmarks, max_num_faces, min_detection_confidence
Pose() mp.solutions.pose Initializes pose detector model_complexity, smooth_landmarks, enable_segmentation
Holistic() mp.solutions.holistic Comprehensive detector model_complexity, refine_face_landmarks, enable_segmentation
SelfieSegmentation() mp.solutions.selfie_segmentation Selfie segmentation model_selection (0 or 1)
FaceDetection() mp.solutions.face_detection Fast face detection model_selection, min_detection_confidence
.process(image) All modules Processes an image RGB image as a NumPy array
draw_landmarks() mp.solutions.drawing_utils Draws landmarks image, landmarks, connections, landmark_drawing_spec
DrawingSpec() mp.solutions.drawing_utils Configures drawing style color, thickness, circle_radius
.landmark[i] Processing results Accesses a specific point Returns an object with x, y, z coordinates
.multi_hand_landmarks Hands result List of detected hands Each hand contains 21 points
.pose_landmarks Pose result Body key points 33 skeletal points
.face_landmarks FaceMesh result Facial points 468 face‑mesh points

SelfieSegmentation Module – Background Segmentation

import cv2
import mediapipe as mp
import numpy as np

mp_selfie = mp.solutions.selfie_segmentation
segmenter = mp_selfie.SelfieSegmentation(model_selection=1)  # 0 – general model, 1 – landscape

# Load background image
background = cv2.imread('background.jpg')

cap = cv2.VideoCapture(0)

while cap.isOpened():
    success, frame = cap.read()
    if not success:
        continue
    
    frame_rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
    results = segmenter.process(frame_rgb)
    
    # Get segmentation mask
    mask = results.segmentation_mask
    
    # Apply mask to replace background
    condition = np.stack((mask,) * 3, axis=-1) > 0.1
    
    # Resize background to match frame size
    h, w, c = frame.shape
    background_resized = cv2.resize(background, (w, h))
    
    # Replace background
    output_image = np.where(condition, frame, background_resized)
    
    cv2.imshow("Selfie Segmentation", output_image)
    if cv2.waitKey(1) & 0xFF == 27:
        break

cap.release()
cv2.destroyAllWindows()

Performance Optimization

Tips for Boosting Speed

# Disable writable flag for faster processing
frame_rgb.flags.writeable = False
results = hands.process(frame_rgb)
frame_rgb.flags.writeable = True

# Reduce resolution to speed up processing
frame_small = cv2.resize(frame, (320, 240))

# Skip frames to save resources
frame_count = 0
if frame_count % 2 == 0:  # Process every second frame
    results = hands.process(frame_rgb)
frame_count += 1

Practical Applications of MediaPipe

Gesture‑Controlled Presentation

import cv2
import mediapipe as mp
import pyautogui

class GesturePresentation:
    def __init__(self):
        self.mp_hands = mp.solutions.hands
        self.hands = self.mp_hands.Hands(min_detection_confidence=0.7)
        self.mp_draw = mp.solutions.drawing_utils
        
    def detect_gesture(self, landmarks):
        # Gesture recognition logic
        fingers = self.count_fingers(landmarks)
        
        if fingers == [0, 1, 0, 0, 0]:  # Index finger
            return "NEXT"
        elif fingers == [1, 0, 0, 0, 0]:  # Thumb
            return "PREVIOUS"
        elif fingers == [0, 1, 1, 0, 0]:  # Two fingers
            return "PAUSE"
        
        return "NONE"
    
    def control_presentation(self, gesture):
        if gesture == "NEXT":
            pyautogui.press('right')
        elif gesture == "PREVIOUS":
            pyautogui.press('left')
        elif gesture == "PAUSE":
            pyautogui.press('space')

Fitness Tracker with Exercise Analysis

class FitnessTracker:
    def __init__(self):
        self.mp_pose = mp.solutions.pose
        self.pose = self.mp_pose.Pose()
        self.exercise_count = 0
        self.stage = None
        
    def analyze_pushup(self, landmarks):
        # Analyze push‑up via elbow angle
        shoulder = [landmarks[11].x, landmarks[11].y]
        elbow = [landmarks[13].x, landmarks[13].y]
        wrist = [landmarks[15].x, landmarks[15].y]
        
        angle = self.calculate_angle(shoulder, elbow, wrist)
        
        if angle > 160:
            self.stage = "UP"
        elif angle < 90 and self.stage == "UP":
            self.stage = "DOWN"
            self.exercise_count += 1
        
        return self.exercise_count, angle

Frequently Asked Questions

Why does results return None?

This happens when MediaPipe cannot detect the target object (hand, face, pose) in the frame. Always check the result before using it:

if results.multi_hand_landmarks:
    # Process detected hands
    pass

How can I improve detection accuracy?

Increase min_detection_confidence and min_tracking_confidence, improve lighting, use a high‑quality camera, and avoid rapid movements.

Why are landmark coordinates in the range 0‑1?

MediaPipe uses normalized coordinates to be resolution‑independent. To get pixel coordinates, multiply by the image size:

x_pixel = landmark.x * image_width
y_pixel = landmark.y * image_height

How do I handle multiple objects?

Use loops to process all detected entities:

if results.multi_hand_landmarks:
    for hand_landmarks in results.multi_hand_landmarks:
        # Process each hand
        pass

Can MediaPipe be used without OpenCV?

Yes, MediaPipe works with any RGB NumPy array, but OpenCV is convenient for camera access and visualisation.

Error Handling and Debugging

Common Issues and Solutions

import cv2
import mediapipe as mp

def safe_mediapipe_processing():
    try:
        # Check camera availability
        cap = cv2.VideoCapture(0)
        if not cap.isOpened():
            raise Exception("Unable to open camera")
        
        mp_hands = mp.solutions.hands
        hands = mp_hands.Hands()
        
        while True:
            success, frame = cap.read()
            if not success:
                print("Failed to read frame from camera")
                continue
            
            # Verify frame size
            if frame.shape[0] == 0 or frame.shape[1] == 0:
                continue
                
            frame_rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
            results = hands.process(frame_rgb)
            
            # Safe result check
            if results.multi_hand_landmarks is not None:
                for hand_landmarks in results.multi_hand_landmarks:
                    # Process landmarks
                    pass
            
            cv2.imshow("MediaPipe", frame)
            if cv2.waitKey(1) & 0xFF == 27:
                break
                
    except Exception as e:
        print(f"Error: {e}")
    finally:
        cap.release()
        cv2.destroyAllWindows()

Integration with Other Libraries

Using MediaPipe with TensorFlow and Machine Learning

import tensorflow as tf
import numpy as np

class MediaPipeML:
    def __init__(self):
        self.mp_hands = mp.solutions.hands
        self.hands = self.mp_hands.Hands()
        self.model = tf.keras.models.load_model('gesture_model.h5')
    
    def extract_features(self, landmarks):
        # Extract features from landmark coordinates
        features = []
        for landmark in landmarks.landmark:
            features.extend([landmark.x, landmark.y, landmark.z])
        return np.array(features).reshape(1, -1)
    
    def predict_gesture(self, frame):
        results = self.hands.process(frame)
        if results.multi_hand_landmarks:
            for hand_landmarks in results.multi_hand_landmarks:
                features = self.extract_features(hand_landmarks)
                prediction = self.model.predict(features)
                return np.argmax(prediction)
        return None

Deployment and Production

Production‑Ready Optimization

class ProductionMediaPipe:
    def __init__(self):
        # Settings for production
        self.mp_hands = mp.solutions.hands
        self.hands = self.mp_hands.Hands(
            static_image_mode=False,
            max_num_hands=1,  # Limit for performance
            min_detection_confidence=0.8,  # High accuracy
            min_tracking_confidence=0.7
        )
        
        # Caching for optimization
        self.last_results = None
        self.frame_skip_counter = 0
        
    def process_frame(self, frame):
        # Skip frames to conserve resources
        self.frame_skip_counter += 1
        if self.frame_skip_counter % 2 == 0:
            frame_rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
            self.last_results = self.hands.process(frame_rgb)
        
        return self.last_results

Conclusion

MediaPipe is one of the most powerful and accessible libraries for building computer‑vision applications. With highly accurate models, optimized performance, and ease of use, it enables developers to create Google‑grade innovative solutions with just a few lines of code.

The library fits a broad spectrum of use cases—from simple demos to complex commercial products in AR/VR, fitness, gaming, security systems, and human‑computer interaction. Cross‑platform support and active maintenance by Google make MediaPipe an excellent choice for modern machine‑learning and computer‑vision projects.

As MediaPipe continues to evolve and add new capabilities, it keeps setting the standard for real‑time computer vision, providing developers with robust tools to shape the future of interactive technologies.

News