AnyHand: A Large-Scale Synthetic Dataset
for RGB(-D) Hand Pose Estimation

Chen Si * 1  ·  Yulin Liu * 1  ·  Bo Ai 1  ·  Jianwen Xie 2
Rolandos Alexandros Potamias 3  ·  Chuanxia Zheng 4  ·  Hao Su 1

1 UC San Diego    2 Lambda, Inc    3 Imperial College London    4 Nanyang Technological University

🚧 Beta Release  — Fine-tuned HaMeR and WiLoR checkpoints co-trained with AnyHand are now available, along with a Colab demo you can run in your browser. The full dataset, generation pipeline, are AnyHandNet-D are still on the way. Please sign up here for a one-email-per-release heads-up.
AnyHand teaser figure

We propose AnyHand as a large-scale synthetic RGB-D dataset that substantially expands coverage of hand pose, hand-object interactions, occlusions, and viewpoint variations in the wild. When used to co-train state-of-the-art models such as HaMeR and WiLoR, it yields consistent improvements and supports robust 3D hand pose reconstruction across diverse real-world scenes. Predicted hand meshes from WiLoR co-trained with AnyHand are shown in pink.


Abstract

We present AnyHand, a large-scale synthetic dataset designed to advance the state of the art in 3D hand pose estimation from both RGB-only and RGB-D inputs. While recent foundation-style approaches have shown that increasing the quantity and diversity of training data can markedly improve performance and robustness, existing real-world datasets remain limited in coverage, and prior synthetic datasets rarely provide occlusions, arm details, and aligned depth together at scale.

To address this bottleneck, AnyHand contains 2.5M single-hand images and 4.1M hand-object interaction RGB-D images, all with rich geometric annotations. In the RGB-only setting, we show that augmenting the original training sets of existing baselines with AnyHand yields significant gains on multiple benchmarks, including FreiHAND and HO-3D, even when keeping the architecture and training scheme fixed.

More importantly, models trained with AnyHand generalize better to the out-of-domain HO-Cap dataset without any fine-tuning. We further introduce a lightweight depth fusion module that can be easily integrated into existing RGB-based models. Trained with AnyHand, the resulting RGB-D model achieves superior performance on HO-3D, demonstrating both the value of depth integration and the effectiveness of our synthetic data.


Dataset

AnyHand consists of two complementary splits. AnyHand-Single (2.1M images) covers isolated hand scenes across diverse backgrounds and viewpoints. AnyHand-Interact (4.2M images) adds hand-object interaction scenarios sourced from GraspXL's physics-based simulation with over 10M sequences and 500K+ objects. Both splits are rendered with full multi-modal annotations: RGB, depth, mask, bounding box, camera intrinsics, and 3D pose/shape.

2.1M + 4.2M
Total images
47,438
Hand shapes
10,240
Hand textures
>500K
Interaction objects
10M+
Interaction sequences
1,270
Backgrounds
AnyHand Gen Pipeline figure

Generation pipeline. Hand shapes are sampled from real-dataset MANO statistics; poses from a DPoser-Hand diffusion prior; appearances via 10,240 Handy textures and 254 SMPLitex arm textures; scenes rendered in Blender with randomized lighting (1–5 lights), backgrounds, and camera parameters (FOV 30°–40°, distance 0.6–1.0 m).


Qualitative Results

Dataset samples

AnyHand covers a wide range of hand poses, skin tones, viewpoints, lighting conditions, and interaction scenarios. Each rendered image is paired with a 3D hand mesh, 2D joint annotations, and a depth map.

AnyHand-Single Demo Figure

Qualitative results for AnyHand-Single. The dataset covers diverse single-hand poses, viewpoints, textures, and scene contexts.

AnyHand teaser figure

Qualitative results for AnyHand-Interact. The dataset captures diverse hand-object interactions with varied grasps, objects, and occlusion patterns.

Comparisons

Co-training WiLoR with AnyHand substantially improves mesh-to-image alignment on real-world images — recovering more accurate hand scale, palm width, finger thickness, and articulation, especially under hand-object interaction and challenging viewpoints.

In-the-wild

In-the-wild qualitative comparisons. WiLoR w/ AnyHand (pink) vs. original WiLoR (blue) and HaMeR (white).



BibTeX

citation.bib
@misc{si2026anyhand, title = {AnyHand: A Large-Scale Synthetic Dataset for RGB(-D) Hand Pose Estimation}, author = {Si, Chen and Liu, Yulin and Ai, Bo and Xie, Jianwen and Potamias, Rolandos Alexandros and Zheng, Chuanxia and Su, Hao}, year = {2026}, eprint = {2603.25726}, archivePrefix = {arXiv}, primaryClass = {cs.CV} }