DreamActor-H1: High-Fidelity Human-Product Demonstration Video Generation via Motion-designed Diffusion Transformers

We present DreamActor-H1, a novel Diffusion Transformer (DiT)-based framework that generates high-quality human-product demonstration videos from paired human and product images. Trained on a large-scale hybrid dataset with multi-class augmentation, DreamActor-H1 outperforms state-of-the-art methods in preserving human-product identity integrity and generating physically plausible demonstration motions, making it suitable for personalized e-commerce advertising and interactive media.

Abstract

In e-commerce and digital marketing, generating high-fidelity human-product demonstration videos is important for effective product presentation. However, most existing frameworks either fail to preserve the identities of both humans and products or lack an understanding of human-product spatial relationships, leading to unrealistic representations and unnatural interactions. To address these challenges, we propose a Diffusion Transformer (DiT)-based framework. Our method simultaneously preserves human identities and product-specific details, such as logos and textures, by injecting paired human-product reference information and utilizing an additional masked cross-attention mechanism. We employ a 3D body mesh template and product bounding boxes to provide precise motion guidance, enabling intuitive alignment of hand gestures with product placements. Additionally, structured text encoding is used to incorporate category-level semantics, enhancing 3D consistency during small rotational changes across frames. Trained on a hybrid dataset with extensive data augmentation strategies, our approach outperforms state-of-the-art techniques in maintaining the identity integrity of both humans and products and generating realistic demonstration motions.

Method Overview

Overview of DreamActor-H1. DreamActor-H1 can generate high-fidelity human-product demonstration videos from a human reference image and a product reference image. Our framework is built upon a DiT architecture, specifically leveraging Seaweed-7B, a foundational model for video generation with around 7 billion (7B) parameters. In the dataset preparation phase, we initially use a Vision-Language Model (VLM) to describe the product and human images. Subsequently, pose estimation and bounding box detection are applied to the training product-human demonstration video. During the training stage, we combine the human pose and product bounding box with the input video noise to serve as motion guidance. Additionally, we encode the input human and product images using a variational autoencoder (VAE) to serve as appearance guidance. The descriptions of the human and product are utilized as supplementary information, enhancing the material visual quality and 3D consistency during small rotational changes across frames. Regarding the DiT model, we implement stacks of full attention, reference attention, and object attention. Notably, object attention incorporates the product latent as an extra input. During the inference stage, we implement automatic pose template selection based on human and product information. Overall, our approach can overcome the challenges of identity preservation, motion realism, and spatial relationship modeling, and produce high-quality human-product demonstration videos given a human and a product image as inputs.

Diversity

Our method is robust to various humans and products.

DreamActor-H1: High-Fidelity Human-Product Demonstration Video Generation via Motion-designed Diffusion Transformers

Abstract

Method Overview

Diversity

Comparing to SOTA Methods

Ablation study