Diffusion Models Beat GANs on Image Synthesis, Dhariwal & Nichol 2021.
Proposes architecture improvements (as of the state of the art in 2021, i.e. DDPM and DDIM) that could give some insight when we write models from scratch. In addition, introduces classifier guidance to improve conditional image synthesis. This was later replaced by classifier-free guidance, but using a classifier looks like the natural thing to do for conditional generation.
DEIS Scheduler. Authors claim excellent sampling results with as few as 12 steps. I haven't read it yet.
Some of these tricks could be effective / didactic.
"Text Inversion": create new text embeddings from a few sample images. This effectively introduces new terms in the vocabulary that can be used in phrases for text to image generation.
Similar goal as the text inversion paper, but different approach I think (I haven't read it yet).
Prompt-to-Prompt Image Editing with Cross Attention Control, Hertz et al. 2022.
Manipulate the cross-attention layers to produce changes in the text-to-image generation by replacing words, introducing new terms or weighting the importance of existing terms.
High-quality and temporally-coherent artistic portrait videos with flexible style controls.
Stable Diffusion fine-tuning (for specific styles or domains).
Image Variations. Demo, with links to code. Use the CLIP image embeddings as conditioning for the generation, instead of the text embeddings. This requires fine-tuning of the model because, as far as I understand it, the text and image embeddings are not aligned in the embedding space. CLOOB doesn't have this limitation, but I heard (source: Boris Dayma from a conversation with Katherine Crowson) that attempting to train a diffusion model with CLOOB conditioning instead of CLIP produced less variety of results.
Image to image generation. Demo sketch -> image.