ID-Crafter: VLM-Grounded Online RL for Compositional Multi-Subject Video Generation

* equal contribution, † project lead
reference 1
reference 2
reference 3

Prompt

In a cherry blossom forest where pink-and-white cherry blossoms fall, the boy squats down to call a white cat softly, holding out a dried fish. The cat is wary at first, then approaches slowly—after taking the fish, it gently rubs its head against the boy’s palm, with petals landing on their
ID-Crafter is a multi-subject video synthesis model that
generates subject-consistent videos with multiple references.

🧩   Abstract   🧩

Significant progress has been achieved in high-fidelity video synthesis, yet current paradigms often fall short in effectively integrating identity information from multiple subjects. This leads to semantic conflicts and suboptimal performance in preserving identities and interactions, limiting controllability and applicability. To tackle this issue, we introduce ID-Crafter, a framework for multi-subject video generation that achieves superior identity preservation and semantic coherence. ID-Crafter incorporates a hierarchical identity-preserving attention mechanism and a VLM that performs reasoning on the multimodal input into a video DiT to enable multi-subject video generation. An online RL stage further refines the concept alignment.

🔮   Method   🔮

Architecture

The capability of ID-Crafter in generating multi-subject videos from a text prompt and multiple reference images is achieved by (1) Hierarchical Identity-Preserving Attention, which aggregates features both within and across subjects and modalities, ensuring identity consistency and faithful textual alignment; (2) Semantic Guidance via Pretrained Vision-Language Models (VLMs), leveraging VLMs' rich semantic understanding to capture fine-grained interactions among multiple subjects and modalities. (3) An online reinforcement learning phase is further employed to enhance video quality and preserve subject identities across time. Extensive experiments show that ID-Crafter outperforms previous methods in identity preservation, temporal consistency, and overall video quality.


🎬   Results   🎬

Multi-Reference Subject-to-Video Generation

ID-Crafter can generate natural and dynamic details across multiple subjects from given text, such as fabric wrinkles induced by movement, wind-swept hair, and crumbs scattered from torn bread.

reference 1
reference 2
reference 2
In a cherry blossom forest where pink-and-white cherry blossoms fall, the boy squats down to call a white cat softly, holding out a dried fish. The cat is wary at first, then approaches slowly—after taking the fish, it gently rubs its head against the boy’s palm, with petals landing on their shoulders.
reference 1
reference 2
In the living room, the dog jumps onto the beige sofa with the rabbit doll in its mouth, then lies down beside the sofa armrest.

ID-preserving Generation

ID-Crafter faithfully preserves the identity of reference subjects (including people and items) while producing vivid videos aligned with the given prompt.

reference 1
reference 2
reference 3
On the pitch, just as the man mounts the broomstick to steady himself, the Golden Snitch suddenly zooms around his hair. The broom wobbles, he grabs the stick in a panic and leans sideways.
reference 1
reference 2
reference 3
In the European-style alley, the man in a gray sweatshirt just lifts a bread slice.

📝   Comparison Results   📝

reference 1
reference 2
reference 3
Under warm orange soft light, on a park lawn covered with ginkgo leaves, sunlight filters through branches, casting light spots. A chubby light brown hedgehog tops before a fuzzy acorn, stretches pink paws to nudge it gently, and the acorn rolls slowly once on the leaves, full of cozy autumn charm.

Our
Phantom
reference 1
reference 2
A sulphur-crested cockatoo swoops to grab its corner. He freezes, bread crumbs cover his sweatshirt, and the cockatoo shakes the bread while tilting its head.

Our
Phantom
reference 1
reference 2
reference 3
In the gravel area of the Martian desert, Trump, in a spacesuit and in boots, touches the texture of a reddish-brown Martian rock with his gloved hand. He leans over slightly, the helmet visor reflects the surrounding red sand, and several gravels of different sizes are beside him.

Our
Phantom
reference 1
reference 2
In a room, a woman wearing pink headphones gently closes her eyes with a smile. Her face shows sheer enjoyment as she is lost in the music.

📝   Benchmark   📝

📊   Data   📊