Feed forward VQGAN-CLIP model, where the goal is to eliminate the need for optimizing the latent space of VQGAN for each input prompt
Feed forward VQGAN-CLIP model, where the goal is to eliminate the need for optimizing the latent space of VQGAN for each input prompt. This is done by training a model that takes as input a text prompt, and returns as an output the VQGAN latent space, which is then transformed into an RGB image. The model is trained on a dataset of text prompts and can be used on unseen text prompts. The loss function is minimizing the distance between the CLIP generated image features and the CLIP input text features. Additionally, a diversity loss can be used to make increase the diversity of the generated images given the same prompt.
09 July 2022
04 January 2022
22 September 2021
Links:
conda create -n ff_vqgan_clip_env python=3.8
conda activate ff_vqgan_clip_env
# Install pytorch/torchvision - See https://pytorch.org/get-started/locally/ for more info.
(ff_vqgan_clip_env) conda install pytorch torchvision torchaudio cudatoolkit=11.1 -c pytorch -c nvidia
(ff_vqgan_clip_env) pip install -r requirements.txt
conda deactivate # Make sure to use your global python3
python3 -m pip install venv
python3 -m venv ./ff_vqgan_clip_venv
source ./ff_vqgan_clip_venv/bin/activate
$ (ff_vqgan_clip_venv) python -m pip install -r requirements.txt
pip install git+https://github.com/CompVis/net2net
$ (ff_vqgan_clip_venv) python main.py tokenize data/list_of_captions.txt cembeds 128
Modify configs/example.yaml
as needed.
$ (ff_vqgan_clip_venv) python main.py train configs/example.yaml
Loss will be output for tensorboard.
# in a new terminal/session
(ff_vqgan_clip_venv) pip install tensorboard
(ff_vqgan_clip_venv) tensorboard --logdir results
After downloading a model (see Pre-trained models available below) or finishing training your own model, you can test it with new prompts, e.g.:
wget https://github.com/mehdidc/feed_forward_vqgan_clip/releases/download/0.2/cc12m_32x1024_vitgan.th
python -u main.py test cc12m_32x1024_vitgan.th "Picture of a futuristic snowy city during the night, the tree is lit with a lantern"
You can also use the priors to generate multiple images for the same text prompt, e.g.:
wget https://github.com/mehdidc/feed_forward_vqgan_clip/releases/download/0.4/cc12m_32x1024_mlp_mixer_openclip_laion2b_ViTB32_256x256_v0.4.th
wget https://github.com/mehdidc/feed_forward_vqgan_clip/releases/download/0.4/prior_cc12m_2x1024_openclip_laion2b_ViTB32_v0.4.th
python main.py test cc12m_32x1024_mlp_mixer_openclip_laion2b_ViTB32_256x256_v0.4.th "bedroom from 1700" --prior-path=prior_cc12m_2x1024_openclip_laion2b_ViTB32_v0.4.th --nb-repeats=4 --images-per-row=4
You can also try all the models in the Colab Notebook and in Replicate. Using the notebook, you can generate images from pre-trained models and do interpolations between text prompts to create videos, see for instance video 1 or video 2 or video 3.
Name | Type | Size | Dataset | Link | Author |
---|---|---|---|---|---|
cc12m_32x1024_mlp_mixer_clip_ViTB32_pixelrecons_256x256 | MLPMixer | 1.2GB | Conceptual captions 12M | Download | @mehdidc |
cc12m_32x1024_mlp_mixer_openclip_laion2b_ViTB32_256x256 | MLPMixer | 1.2GB | Conceptual captions 12M | Download | @mehdidc |
cc12m_32x1024_mlp_mixer_openclip_laion2b_imgEmb_ViTB32_256x256 | MLPMixer | 1.2GB | Conceptual captions 12M | Download | @mehdidc |
cc12m_1x1024_mlp_mixer_openclip_laion2b_ViTB32_512x512 | MLPMixer | 580MB | Conceptual captions 12M | Download | @mehdidc |
prior_cc12m_2x1024_openclip_laion2b_ViTB32 | Net2Net | 964MB | Conceptual captions 12M | Download | @mehdidc |
prior_cc12m_2x1024_clip_ViTB32 | Net2Net | 964MB | Conceptual captions 12M | Download | @mehdidc |
Name | Type | Size | Dataset | Link | Author |
---|---|---|---|---|---|
cc12m_32x1024_mlp_mixer_clip_ViTB32_256x256 | MLPMixer | 1.19GB | Conceptual captions 12M | Download | @mehdidc |
cc12m_32x1024_mlp_mixer_cloob_rn50_256x256 | MLPMixer | 1.32GB | Conceptual captions 12M | Download | @mehdidc |
cc12m_256x16_xtransformer_clip_ViTB32_512x512 | Transformer | 571MB | Conceptual captions 12M | Download | @mehdidc |
Name | Type | Size | Dataset | Link | Author |
---|---|---|---|---|---|
cc12m_8x128 | MLPMixer | 12.1MB | Conceptual captions 12M | Download | @mehdidc |
cc12m_32x1024 | MLPMixer | 1.19GB | Conceptual captions 12M | Download | @mehdidc |
cc12m_32x1024 | VitGAN | 1.55GB | Conceptual captions 12M | Download | @mehdidc |
Name | Type | Size | Dataset | Link | Author |
---|---|---|---|---|---|
cc12m_8x128 | VitGAN | 12.1MB | Conceptual captions 12M | Download | @mehdidc |
cc12m_16x256 | VitGAN | 60.1MB | Conceptual captions 12M | Download | @mehdidc |
cc12m_32x512 | VitGAN | 408.4MB | Conceptual captions 12M | Download | @mehdidc |
cc12m_32x1024 | VitGAN | 1.55GB | Conceptual captions 12M | Download | @mehdidc |
cc12m_64x1024 | VitGAN | 3.05GB | Conceptual captions 12M | Download | @mehdidc |
bcaptmod_8x128 | VitGAN | 11.2MB | Modified blog captions | Download | @afiaka87 |
bcapt_16x128 | MLPMixer | 168.8MB | Blog captions | Download | @mehdidc |
NB: cc12m_AxB means a model trained on conceptual captions 12M, with depth A and hidden state dimension B
mlp_mixer_pytorch.py
) is from https://github.com/lucidrains/mlp-mixer-pytorch.