The capability of ID-Crafter in generating
multi-subject videos from a text prompt and multiple reference images is achieved by
(1) Hierarchical Identity-Preserving Attention, which aggregates features both within and across
subjects and modalities, ensuring identity consistency and faithful textual alignment;
(2) Semantic Guidance via Pretrained Vision-Language Models (VLMs), leveraging VLMs' rich semantic
understanding to capture fine-grained interactions among multiple subjects and modalities.
(3) An online reinforcement learning phase is further employed to enhance video quality and
preserve subject identities across time. Extensive experiments show that ID-Crafter outperforms previous methods in identity
preservation, temporal consistency, and overall video quality.