OMG-Bench: A New Challenging Benchmark for Skeleton-based Online Micro Hand Gesture Recognition

1Sun Yat-sen University
2Beijing University of Posts and Telecommunications
3Shanghai Jiao Tong University
4Nankai University
5Academy of Military Sciences
6Tianjin Artificial Intelligence Innovation Center

*Indicates Corresponding Author

Dataset and Application Demo

Abstract

Online micro gesture recognition from hand skeletons is critical for VR/AR interaction but faces challenges due to limited public datasets and task-specific algorithms. Micro gestures involve subtle motion patterns, which make constructing datasets with precise skeletons and frame-level annotations difficult. To this end, we develop a multi-view self-supervised pipeline to automatically generate skeleton data, complemented by heuristic rules and expert refinement for semi-automatic annotation. Based on this pipeline, we introduce OMG-Bench, the first large-scale public benchmark for skeleton-based online micro gesture recognition. It features 40 fine-grained gesture classes with 13,948 instances across 1,272 sequences, characterized by subtle motions, rapid dynamics, and continuous execution. To tackle these challenges, we propose Hierarchical Memory-Augmented Transformer (HMATr), an end-to-end framework that unifies gesture detection and classification by leveraging hierarchical memory banks which store frame-level details and window-level semantics to preserve historical context. In addition, it employs learnable position-aware queries initialized from the memory to implicitly encode gesture positions and semantics. Experiments show that HMATr outperforms state-of-the-art methods by 7.6% in detection rate, establishing a strong baseline for online micro gesture recognition.

Data collection and annotation pipeline

Directional Weight Score

Data collection and annotation pipeline of OMG-Bench, using a calibrated five-camera RGB-D system and self-supervised multi-view hand pose estimation to obtain high-quality skeletons, followed by semi-automatic frame-level gesture labeling.

Comparison of Exisiting Datesets with Ours

Directional Weight Score

Comparison between open-source skeleton-based gesture recognition datasets and the proposed OMG-Bench.

Dataset Properties

Directional Weight Score

(a) Types and locations of defined micro gestures. TIP, PIP, and MCP denote the fingertip, proximal interphalangeal joint, and metacarpophalangeal joint. (b) Statistics of gesture types. (c) Distribution of sample counts per class.

Method

Directional Weight Score

Overview of our proposed HMATr. (a) Lightweight backbone processes streaming skeleton inputs using a non-overlapping sliding window approach. (b) Hierarchical memory bank uses historical temporal information to enrich the content of the current window. (c) Position-aware queries implicitly capture potential hand movements, enabling unified detection and recognition. (d) Memory Interaction and Position-aware Interaction encode both position and semantic information of gesture instances from the memory-enhanced features.

OMG-Bench Benchmark Evaluation

Directional Weight Score

Benchmark of SOTA methods with four metrics. Methods marked with * denote our re-implementations owing to the absence of open-source code.

Qualitative Results

Directional Weight Score

(a) Query Distribution. We present the distribution of query relative positions within the window for all test samples. We observe that different queries tend to focus on different positions, thereby capturing diverse types of micro-gesture features. This enhances the feature representation capability of HMATr. (b) Comparison of Gesture Segment Detection Results. We visualize the micro gesture detection results of different methods. The results demonstrate that our method can accurately identify the boundaries of consecutive same-class gestures, whereas the other two methods tend to either merge multiple consecutive same-class gestures or over-segment them. This highlights the superiority of our approach.