Transform a pretrained text-to-image model to a text-to-video model
This project involves training a video diffusion model based on Stable Diffusion XL image priors.
git clone https://github.com/motexture/stable-diffusion-xl-video.git
cd stable-diffusion-xl-video
pip install deepspeed
pip install -r requirements.txt
On some systems, deepspeed requires installing the CUDA toolkit first in order to properly install. If you do not have CUDA toolkit, or deepspeed shows an error follow the instructions by NVIDIA: https://developer.nvidia.com/cuda-downloads
or on linux systems:
sudo apt install build-essential
wget https://developer.download.nvidia.com/compute/cuda/12.2.0/local_installers/cuda_12.2.0_535.54.03_linux.run
sudo sh cuda_12.2.0_535.54.03_linux.run
During the installation you only need to install toolkit, not the drivers or documentation.
Open the training.yaml file and modify the parameters according to your needs.
deepspeed train.py --config training.yaml
The inference.py
script can be used to render videos with trained checkpoints.
Example usage:
python inference.py \
--model sdxlvid \
--prompt "a fast moving fancy sports car" \
--num-frames 16 \
--width 1024 \
--height 1024 \
--sdp