Project: General Hand Gesture Recognition

April 15, 2025 · 3 min read

Overview

This project aims to create a unified, semi-supervised contrastive-learning framework for hand gesture recognition. The framework is designed to adapt efficiently to various downstream tasks, such as human-computer interaction and sign language recognition, with minimal retraining or fine-tuning.

Scope and Applications

[!NOTE] This section is a summary generated from the report by Grok. The contents have been double-checked by the author.

Only this section covers the main content of the report and the remaining sections are about the details of setting up the project and the purpose of specific scripts within the repository.

Key Areas Explored

Static-Pose Representation Learning

Objective: Map hand landmark inputs (shape $21 \times 3$ ) into feature embeddings (size $128$ ).
Approach: Compared three encoder architectures:
- Multi-layer Perceptron (MLP)
- Graph Convolutional Network (GCN)
- Graph Attention Network (GAT)
Hypotheses Tested:
1. Graph-based models (GCN and GAT), which leverage edge information, outperform MLP in accuracy and convergence speed. This was evaluated using supervised contrastive loss on the Lexset dataset.
2. Incorporating a large unlabelled dataset (synthetic MANO data) with curriculum-based augmentations enhances model generalization.

Extension to Dynamic Gesture Recognition

Objective: Extend the contrastive learning approach to recognize dynamic gestures.
Approach: Utilize sequential architectures like Recurrent Neural Networks (RNN) and Long Short-Term Memory (LSTM) units to model temporal dependencies in gesture sequences.

Results

While not fully achieved the original goals, our key findings include:

Static Gesture Recognition:
- Graph-based networks (e.g., GCN, GAT) are more effective, leveraging hand skeletal connections for improved accuracy and faster convergence.
- Using large unlabelled datasets with curriculum learning enhances model generalization to new datasets and unseen gesture classes (which is tested by observing the cosine similarities of the output feature vectors).
- Curriculum Learning with such a shallow model used produces degraded performance when the magnitude of augmentation exceeded a certain value
Dynamic Gesture Recognition:
- Hierarchical and part-wise architectures improve understanding of gesture structures.
- Contrastive learning showed limited improvement over existing methods, indicating a need for more complex approaches.

Future Work

Develop a general hand gesture encoder capturing rotation- and scale-invariant features for rapid adaptation to tasks like dynamic gesture recognition.
Investigate joint training of static and dynamic datasets using curriculum and contrastive learning to improve robustness.

Overview​

Scope and Applications​

Key Areas Explored​

Static-Pose Representation Learning​

Extension to Dynamic Gesture Recognition​

Results​

Future Work​