MuseV: Infinite-length and High Fidelity Virtual Human Video Generation with Visual Conditioned Parallel Denoising
MuseV: Infinite-length and High Fidelity Virtual Human Video Generation with Visual Conditioned Parallel Denoising Zhiqiang Xia *, Zhaokang Chen*, Bin Wu†, Chao Li, Kwok-Wai Hung, Chao Zhan, Yingjie He, Wenjiang Zhou (*co-first author, †Corresponding Author, [email protected])
github huggingface HuggingfaceSpace project Technical report (comming soon)
We have setup the world simulator vision since March 2023, believing diffusion models can simulate the world. MuseV
was a milestone achieved around July 2023. Amazed by the progress of Sora, we decided to opensource MuseV
, hopefully it will benefit the community. Next we will move on to the promising diffusion+transformer scheme.
Update: We have released MuseTalk, a real-time high quality lip sync model, which can be applied with MuseV as a complete virtual human generation solution.
MuseV
is a diffusion-based virtual human video generation framework, which
base_model
, lora
, controlnet
, etc.IPAdapter
, ReferenceOnly
, ReferenceNet
, IPAdapterFaceID
.musev_referencenet_pose
: model_name of unet
, ip_adapter
of Command is not correct, please use musev_referencenet_pose
instead of musev_referencenet
.MuseV
project and trained model musev
, muse_referencenet
.All frames were generated directly from text2video model, without any post process. MoreCase is in project, including 1-2 minute video.
Examples bellow can be accessed at configs/tasks/example.yaml
image | video | prompt |
(masterpiece, best quality, highres:1),(1boy, solo:1),(eye blinks:1.8),(head wave:1.3) | ||
(masterpiece, best quality, highres:1), peaceful beautiful sea scene | ||
(masterpiece, best quality, highres:1), peaceful beautiful sea scene | ||
(masterpiece, best quality, highres:1), playing guitar | ||
(masterpiece, best quality, highres:1), playing guitar | ||
(masterpiece, best quality, highres:1),(1man, solo:1),(eye blinks:1.8),(head wave:1.3),Chinese ink painting style | ||
(masterpiece, best quality, highres:1),(1girl, solo:1),(beautiful face, soft skin, costume:1),(eye blinks:{eye_blinks_factor}),(head wave:1.3) |
image | video | prompt |
(masterpiece, best quality, highres:1), peaceful beautiful waterfall, an endless waterfall | ||
(masterpiece, best quality, highres:1), peaceful beautiful sea scene |
pose2video
In duffy
mode, pose of the vision condition frame is not aligned with the first frame of control video. posealign
will solve the problem.
image | video | prompt |
(masterpiece, best quality, highres:1) , a girl is dancing, animation | ||
(masterpiece, best quality, highres:1), is dancing, animation |
The character of talk, Sun Xinying
is a supermodel KOL. You can follow her on douyin.
name | video |
talk | |
sing |
posealign
modulePrepare python environment and install extra package like diffusers
, controlnet_aux
, mmcm
.
Thanks for the third-party integration, which makes installation and use more convenient for everyone. We also hope you note that we have not verified, maintained, or updated third-party. Please refer to this project for specific results.
netdisk:https://www.123pan.com/s/Pf5Yjv-Bb9W3.html
code: glut
You are recommended to use docker
primarily to prepare python environment.
Attention: we only test with docker, there are maybe trouble with conda, or requirement. We will try to fix it. Use docker
Please.
docker pull anchorxia/musev:latest
docker run --gpus all -it --entrypoint /bin/bash anchorxia/musev:latest
The default conda env is musev
.
create conda environment from environment.yaml
conda env create --name musev --file ./environment.yml
pip install -r requirements.txt
if not use docker, should install mmlab package additionally.
pip install --no-cache-dir -U openmim
mim install mmengine
mim install "mmcv>=2.0.1"
mim install "mmdet>=3.1.0"
mim install "mmpose>=1.1.0"
git clone --recursive https://github.com/TMElyralab/MuseV.git
current_dir=$(pwd)
export PYTHONPATH=${PYTHONPATH}:${current_dir}/MuseV
export PYTHONPATH=${PYTHONPATH}:${current_dir}/MuseV/MMCM
export PYTHONPATH=${PYTHONPATH}:${current_dir}/MuseV/diffusers/src
export PYTHONPATH=${PYTHONPATH}:${current_dir}/MuseV/controlnet_aux/src
cd MuseV
MMCM
: multi media, cross modal process package。diffusers
: modified diffusers package based on diffusers
controlnet_aux
: modified based on controlnet_aux
git clone https://huggingface.co/TMElyralab/MuseV ./checkpoints
motion
: text2video model, trained on tiny ucf101
and tiny webvid
dataset, approximately 60K videos text pairs. GPU memory consumption testing on resolution
$=512*512$, time_size=12
.
musev/unet
: only has and train unet
motion module. GPU memory consumption
$\approx 8G$.musev_referencenet
: train unet
module, referencenet
, IPAdapter
. GPU memory consumption
$\approx 12G$.
unet
: motion
module, which has to_k
, to_v
in Attention
layer refer to IPAdapter
referencenet
: similar to AnimateAnyone
ip_adapter_image_proj.bin
: images clip emb project layer, refer to IPAdapter
musev_referencenet_pose
: based on musev_referencenet
, fix referencenet
and controlnet_pose
, train unet motion
and IPAdapter
. GPU memory consumption
$\approx 12G$t2i/sd1.5
: text2image model, parameter are frozen when training motion module. Different t2i
base_model has a significant impact.could be replaced with other t2i base.
majicmixRealv6Fp16
: example, download from majicmixRealv6Fp16
fantasticmix_v10
: example, download from fantasticmix_v10
IP-Adapter/models
: download from IPAdapter
image_encoder
: vision clip model.ip-adapter_sd15.bin
: original IPAdapter model checkpoint.ip-adapter-faceid_sd15.bin
: original IPAdapter model checkpoint.Skip this step when run example task with example inference command. Set model path and abbreviation in config, to use abbreviation in inference script.
musev/configs/model/T2I_all_model.py
musev/configs/model/motion_model.py
musev/configs/tasks/example.yaml
python scripts/inference/text2video.py --sd_model_name majicmixRealv6Fp16 --unet_model_name musev_referencenet --referencenet_model_name musev_referencenet --ip_adapter_model_name musev_referencenet -test_data_path ./configs/tasks/example.yaml --output_dir ./output --n_batch 1 --target_datas yongen --vision_clip_extractor_class_name ImageClipVisionFeatureExtractor --vision_clip_model_path ./checkpoints/IP-Adapter/models/image_encoder --time_size 12 --fps 12
common parameters:
test_data_path
: task_path in yaml extentiontarget_datas
: sep is ,
, sample subtasks if name
in test_data_path
is in target_datas
.sd_model_cfg_path
: T2I sd models path, model config path or model path.sd_model_name
: sd model name, which use to choose full model path in sd_model_cfg_path. multi model names with sep =,
, or all
unet_model_cfg_path
: motion unet model config path or model path。unet_model_name
: unet model name, use to get model path in unet_model_cfg_path
, and init unet class instance in musev/models/unet_loader.py
. multi model names with sep=,
, or all
. If unet_model_cfg_path
is model path, unet_name
must be supported in musev/models/unet_loader.py
time_size
: num_frames per diffusion denoise generation。default=12
.n_batch
: generation numbers of shot, $total_frames=n_batch * time_size + n_viscond$, default=1
。context_frames
: context_frames num. If time_size
> context_frame
,time_size
window is split into many sub-windows for parallel denoising"。 default=12
。To generate long videos, there two ways:
visual conditioned parallel denoise
: set n_batch=1
, time_size
= all frames you want.traditional end-to-end
: set time_size
= context_frames
= frames of a shot (12
), context_overlap
= 0;model parameters:
supports referencenet
, IPAdapter
, IPAdapterFaceID
, Facein
.
referencenet
model name.ImageEmbExtractor
name, extractor vision clip emb used in IPAdapter
.ImageClipVisionFeatureExtractor
model path.IPAdapter
, it's ImagePromptEmbProj
, used with ImageEmbExtractor
。IPAdapterFaceID
, from IPAdapter
to keep faceid,should set face_image_path
。Some parameters that affect the motion range and generation results:
video_guidance_scale
: Similar to text2image, control influence between cond and uncond,default=3.5
use_condition_image
: Whether to use the given first frame for video generation, if not generate vision condition frames first. Default=True
.redraw_condition_image
: Whether to redraw the given first frame image.video_negative_prompt
: Abbreviation of full negative_prompt
in config path. default=V2
.t2i
base_model has a significant impact. In this case, fantasticmix_v10
performs better than majicmixRealv6Fp16
.
python scripts/inference/video2video.py --sd_model_name fantasticmix_v10 --unet_model_name musev_referencenet --referencenet_model_name musev_referencenet --ip_adapter_model_name musev_referencenet -test_data_path ./configs/tasks/example.yaml --vision_clip_extractor_class_name ImageClipVisionFeatureExtractor --vision_clip_model_path ./checkpoints/IP-Adapter/models/image_encoder --output_dir ./output --n_batch 1 --controlnet_name dwpose_body_hand --which2video "video_middle" --target_datas dance1 --fps 12 --time_size 12
import parameters
Most of the parameters are same as musev_text2video
. Special parameters of video2video
are:
video_path
as reference video in test_data
. Now reference video supports rgb video
and controlnet_middle_video
。which2video
: whether rgb
video influences initial noise, influence of rgb
is stronger than of controlnet condition.controlnet_name
:whether to use controlnet condition
, such as dwpose,depth
.video_is_middle
: video_path
is rgb video
or controlnet_middle_video
. Can be set for every test_data
in test_data_path.video_has_condition
: whether condtion_images is aligned with the first frame of video_path. If Not, exrtact condition of condition_images
firstly generate, and then align with concatation. set in test_data
。all controlnet_names refer to mmcm
['pose', 'pose_body', 'pose_hand', 'pose_face', 'pose_hand_body', 'pose_hand_face', 'dwpose', 'dwpose_face', 'dwpose_hand', 'dwpose_body', 'dwpose_body_hand', 'canny', 'tile', 'hed', 'hed_scribble', 'depth', 'pidi', 'normal_bae', 'lineart', 'lineart_anime', 'zoe', 'sam', 'mobile_sam', 'leres', 'content', 'face_detector']
Only used for pose2video
train based on musev_referencenet
, fix referencenet
, pose-controlnet
, and T2I
, train motion
module and IPAdapter
.
t2i
base_model has a significant impact. In this case, fantasticmix_v10
performs better than majicmixRealv6Fp16
.
python scripts/inference/video2video.py --sd_model_name fantasticmix_v10 --unet_model_name musev_referencenet_pose --referencenet_model_name musev_referencenet --ip_adapter_model_name musev_referencenet_pose -test_data_path ./configs/tasks/example.yaml --vision_clip_extractor_class_name ImageClipVisionFeatureExtractor --vision_clip_model_path ./checkpoints/IP-Adapter/models/image_encoder --output_dir ./output --n_batch 1 --controlnet_name dwpose_body_hand --which2video "video_middle" --target_datas dance1 --fps 12 --time_size 12
Only has motion module, no referencenet, requiring less gpu memory.
python scripts/inference/text2video.py --sd_model_name majicmixRealv6Fp16 --unet_model_name musev -test_data_path ./configs/tasks/example.yaml --output_dir ./output --n_batch 1 --target_datas yongen --time_size 12 --fps 12
python scripts/inference/video2video.py --sd_model_name fantasticmix_v10 --unet_model_name musev -test_data_path ./configs/tasks/example.yaml --output_dir ./output --n_batch 1 --controlnet_name dwpose_body_hand --which2video "video_middle" --target_datas dance1 --fps 12 --time_size 12
MuseV provides gradio script to generate a GUI in a local machine to generate video conveniently.
cd scripts/gradio
python app.py
ucf101
and webvid
datasets.Thanks for open-sourcing!
There are still many limitations, including
MuseV
has been trained on approximately 60K human text-video pairs with resolution 512*320
. MuseV
has greater motion range while lower video quality at lower resolution. MuseV
tends to generate less motion range with high video quality. Trained on larger, higher resolution, higher quality text-video dataset may make MuseV
better.webvid
. A cleaner dataset without watermarks may solve this issue.MuseV
supports rich and dynamic features, but with complex and unrefacted codes. It takes time to familiarize.@article{musev,
title={MuseV: Infinite-length and High Fidelity Virtual Human Video Generation with Visual Conditioned Parallel Denoising},
author={Xia, Zhiqiang and Chen, Zhaokang and Wu, Bin and Li, Chao and Hung, Kwok-Wai and Zhan, Chao and He, Yingjie and Zhou, Wenjiang},
journal={arxiv},
year={2024}
}
code
: The code of MuseV is released under the MIT License. There is no limitation for both academic and commercial usage.model
: The trained model are available for non-commercial research purposes only.other opensource model
: Other open-source models used must comply with their license, such as insightface
, IP-Adapter
, ft-mse-vae
, etc.AIGC
: This project strives to impact the domain of AI-driven video generation positively. Users are granted the freedom to create videos using this tool, but they are expected to comply with local laws and utilize it responsibly. The developers do not assume any responsibility for potential misuse by users.