An open source implementation of "Scaling Autoregressive Multi-Modal Models: Pretraining and Instruction Tuning", an all-new multi modal AI that uses just a decoder to generate both text and images
CM3Leon is a transformer-based autoregressive model designed for multi-modal tasks, specifically text and image generation. The model is trained in two stages, using a large diverse multimodal dataset and augmented retrieval pretraining. It also implements contrastive decoding to enhance the quality of the generated samples.
pip3 install cm3
To start with CM3Leon in a PyTorch environment:
import torch
from cm3.model import CM3
# usage
img = torch.randn(1, 3, 256, 256)
caption = torch.randint(0, 20000, (1, 1024))
model = CM3()
output = model(img, caption)
print(output.shape) # (1, 1024, 20000)
This repository hosts the open-source implementation of CM3Leon, a state-of-the-art autoregressive multi-modal model for text and image generation. The model is introduced in the paper "Scaling Autoregressive Multi-Modal Models: Pretraining and Instruction Tuning".
Key Features of CM3Leon:
CM3Leon sets a new benchmark in text-to-image generation, outperforming comparable models while requiring 5x less computational resources.
The following sections provide a detailed analysis of the model architecture, the necessary resources, and the steps needed to replicate the CM3Leon model.
Replicating CM3Leon involves several critical components and requires proficiency in the following areas:
The CM3Leon implementation comprises:
Implementing these components involves challenges such as efficient utilization of large compute clusters, minimizing data loading and preprocessing bottlenecks, optimizing memory usage during training and inference, and ensuring low latency serving.
The architecture of CM3Leon includes:
<break>
token to indicate modality transitions.The model size ranges from 350M to 7B parameters.
Here is a markdown table with the datasets used in the paper along with additional metadata and source links:
Dataset | Domain | Size | Source |
---|---|---|---|
Shutterstock | Images and captions | 3 billion text tokens, licensed image data | Proprietary dataset, described in paper |
MS-COCO | Image captioning | 591K image-caption pairs | Microsoft COCO Captions |
Flickr30k | Image captioning | 144K image-caption pairs | Flickr30k Entities |
Image Paragraph | Dense image captioning | 14K images with paragraph captions | Image Paragraph dataset |
Localized Narratives | Image paragraph captioning | 164K images with localized narratives | Localized Narratives |
VQA2 | Visual question answering | 1.3M images with question-answer pairs | VQA2 dataset |
VizWiz | Visual question answering for blind users | 92K images with question-answer pairs | VizWiz dataset |
OKVQA | Knowledge-based VQA | 26K images with question-answer pairs | OK-VQA dataset |
ScienceQA | Scientific visual QA | 6K images with multi-choice QA pairs | ScienceQA |
The model was trained and evaluated on several datasets including MS-COCO [...] (Chen et al., 2015), Flickr30k [...] (Young et al., 2014), etc.
For successful implementation, CM3Leon requires:
CM3Leon's training process involves:
For efficient inference, consider:
350M 24 1024 4096 8M 6e-04 1500 256 1.4T
760M 24 1536 4096 8M 5e-04 1500 256 1.9T
7B 32 4096 4096 8M 1.2e-04 1500 512 2.4T
Model # GPUS Seq Length Batch Size LR Warm-up Steps # Tokens
CM3Leon-760m 64 4096 2M 5e-05 150 30B
CM3Leon-7b 128 4096 2M 5e-05 150 30B
Conditional text + image generation with objective function + contrastive top k decoding
Multi-Modality models need to be dynamic they can't just generate the types of data they were trained on they need to be able to adapt to user needs therefore multi-modality models should be conditional, if prompted the model will generate text and or images, this is the future.
This repository welcomes contributions. Feel free to submit pull requests, create issues, or suggest any enhancements.
If you encounter any issues or need further clarification, please create an issue in the GitHub issue tracker.
CM3Leon is open-sourced under the MIT license.
Implement Objective function where multi-modal inputs are transformed into an infilling instance by masking specific spans and relocating them to the end.
Implement a next token prediction loss, -log p(x input)
Implement TopP sampling
Implement Free Guidance CFG => directing an unconditional sample towards a conditional sample. Replace text with mask token from cm3 objective for uncoditional sampling so that during inference 2 concurrent tokens tsreams are generated a conditional stream, which is contigent on the input text and an unconditional token stream which is conditioned on a mask token Where
Logits, cond = T(ty | ty), logit.uncond = T(ty | <mask>)
logits.cf = logits.uncond + a.c * (logits.cond - logits.uncond)
T = transformer
ty = output tokens
tx = conditional input text <mask>
<mask> = no input text + replacement with a mask token
a.c = scaling factor
V(t.y < .i) = {t.yi is in V: P.exp(t.yi | t.y<.i) >= a * kmax(p.exp(w|t.y<i))}