In e-commerce and digital marketing, generating high-fidelity human-product demonstration videos is important for effective product presentation. However, most existing frameworks either fail to preserve the identities of both humans and products or lack an understanding of human-product spatial relationships, leading to unrealistic representations and unnatural interactions. To address these challenges, we propose a Diffusion Transformer (DiT)-based framework. Our method simultaneously preserves human identities and product-specific details, such as logos and textures, by injecting paired human-product reference information and utilizing an additional masked cross-attention mechanism. We employ a 3D body mesh template and product bounding boxes to provide precise motion guidance, enabling intuitive alignment of hand gestures with product placements. Additionally, structured text encoding is used to incorporate category-level semantics, enhancing 3D consistency during small rotational changes across frames. Trained on a hybrid dataset with extensive data augmentation strategies, our approach outperforms state-of-the-art techniques in maintaining the identity integrity of both humans and products and generating realistic demonstration motions.
Overview of DreamActor-H1. DreamActor-H1 can generate high-fidelity human-product demonstration videos from a human reference image and a product reference image. Our framework is built upon a DiT architecture, specifically leveraging Seaweed-7B, a foundational model for video generation with around 7 billion (7B) parameters. In the dataset preparation phase, we initially use a Vision-Language Model (VLM) to describe the product and human images. Subsequently, pose estimation and bounding box detection are applied to the training product-human demonstration video. During the training stage, we combine the human pose and product bounding box with the input video noise to serve as motion guidance. Additionally, we encode the input human and product images using a variational autoencoder (VAE) to serve as appearance guidance. The descriptions of the human and product are utilized as supplementary information, enhancing the material visual quality and 3D consistency during small rotational changes across frames. Regarding the DiT model, we implement stacks of full attention, reference attention, and object attention. Notably, object attention incorporates the product latent as an extra input. During the inference stage, we implement automatic pose template selection based on human and product information. Overall, our approach can overcome the challenges of identity preservation, motion realism, and spatial relationship modeling, and produce high-quality human-product demonstration videos given a human and a product image as inputs.
Our method is robust to various humans and products.
Our method generates results with fine-grained product/human identity preservation, temporal consistency and high fidelity.
We compare our full model with our full model w/o text input and our full model w/o text input and w/o object attention (baseline).