AnyHand: A Large-Scale Synthetic Dataset
for RGB(-D) Hand Pose Estimation

Chen Si1  ·  Yulin Liu1  ·  Bo Ai1  ·  Jianwen Xie2
Rolandos Alexandros Potamias3  ·  Chuanxia Zheng4  ·  Hao Su1

1UC San Diego    2Lambda, Inc    3Imperial College London    4Nanyang Technological University

🚧 Beta Release  — We are currently releasing fine-tuned checkpoints of HaMeR and WiLoR co-trained with AnyHand, ready for in-the-wild hand pose estimation. The full dataset, training code, and arXiv preprint are coming soon -- stay tuned!
AnyHand teaser figure

We propose AnyHand as a large-scale synthetic RGB-D dataset that substantially expands coverage of hand pose, hand-object interactions, occlusions, and viewpoint variations in the wild. When used to co-train state-of-the-art models such as HaMeR and WiLoR, it yields consistent improvements and supports robust 3D hand pose reconstruction across diverse real-world scenes. Predicted hand meshes from WiLoR co-trained with AnyHand are shown in pink.


Abstract

We present AnyHand, a large-scale synthetic dataset designed to advance the state of the art in 3D hand pose estimation from both RGB-only and RGB-D inputs. While recent foundation-style approaches have shown that increasing the quantity and diversity of training data can markedly improve performance and robustness, existing real-world datasets remain limited in coverage, and prior synthetic datasets rarely provide occlusions, arm details, and aligned depth together at scale.

To address this bottleneck, AnyHand contains 2.5M single-hand images and 4.1M hand-object interaction RGB-D images, all with rich geometric annotations. In the RGB-only setting, we show that augmenting the original training sets of existing baselines with AnyHand yields significant gains on multiple benchmarks, including FreiHAND and HO-3D, even when keeping the architecture and training scheme fixed.

More importantly, models trained with AnyHand generalize better to the out-of-domain HO-Cap dataset without any fine-tuning. We further introduce a lightweight depth fusion module that can be easily integrated into existing RGB-based models. Trained with AnyHand, the resulting RGB-D model achieves superior performance on HO-3D, demonstrating both the value of depth integration and the effectiveness of our synthetic data.


Dataset

AnyHand consists of two complementary splits. AnyHand-Single (2.1M images) covers isolated hand scenes across diverse backgrounds and viewpoints. AnyHand-Interact (4.2M images) adds hand-object interaction scenarios sourced from GraspXL's physics-based simulation with over 10M sequences and 500K+ objects. Both splits are rendered with full multi-modal annotations: RGB, depth, mask, bounding box, camera intrinsics, and 3D pose/shape.

2.1M + 4.2M
Total images
47,438
Hand shapes
10,240
Hand textures
>500K
Interaction objects
10M+
Interaction sequences
1,270
Backgrounds
AnyHand Gen Pipeline figure

Generation pipeline. Hand shapes are sampled from real-dataset MANO statistics; poses from a DPoser-Hand diffusion prior; appearances via 10,240 Handy textures and 254 SMPLitex arm textures; scenes rendered in Blender with randomized lighting (1–5 lights), backgrounds, and camera parameters (FOV 30°–40°, distance 0.6–1.0 m).


Qualitative Results

Dataset samples

AnyHand covers a wide range of hand poses, skin tones, viewpoints, lighting conditions, and interaction scenarios. Each rendered image is paired with a 3D hand mesh, 2D joint annotations, and a depth map.

AnyHand-Single Demo Figure

Qualitative results for AnyHand-Single. The dataset covers diverse single-hand poses, viewpoints, textures, and scene contexts.

AnyHand teaser figure

Qualitative results for AnyHand-Interact. The dataset captures diverse hand-object interactions with varied grasps, objects, and occlusion patterns.

Comparisons

Co-training WiLoR with AnyHand substantially improves mesh-to-image alignment on real-world images — recovering more accurate hand scale, palm width, finger thickness, and articulation, especially under hand-object interaction and challenging viewpoints.

In-the-wild

In-the-wild qualitative comparisons. WiLoR w/ AnyHand (pink) vs. original WiLoR (blue) and HaMeR (white).



BibTeX

citation.bib
@misc{si2026anyhand, title = {AnyHand: A Large-Scale Synthetic Dataset for RGB(-D) Hand Pose Estimation}, author = {Si, Chen and Liu, Yulin and Ai, Bo and Xie, Jianwen and Potamias, Rolandos Alexandros and Zheng, Chuanxia and Su, Hao}, year = {2026}, eprint = {2603.25726}, archivePrefix = {arXiv}, primaryClass = {cs.CV} }