Lumina-T2X is a model for Text to Any Modality Generation
For more about training and inference, please refer to Lumina-T2I README.md
We introduce the $\textbf{Lumina-T2X}$ family, a series of text-conditioned Diffusion Transformers (DiT) designed to convert noise into images, videos, and multi-view images of 3D objects and generate speech based on textual instructions. At the core of Lumina-T2X lies the Flow-based Large Diffusion Transformer (Flag-DiT), which supports scaling up to 7 billion parameters and extending sequence lengths up to 128,000. Inspired by Sora, Lumina-T2X integrates images, videos, multi-views of 3D objects, and speech spectrograms within a spatial-temporal latent token space.
$\textbf{Lumina-T2X}$ allows for the generation of outputs in any resolution, aspect ratio, and duration, facilitated by learnable newline
and newframe
tokens.
Furthermore, training $\textbf{Lumina-T2X}$ is computationally efficient. The largest model, with 5 billion parameters, requires only 20% of the training time needed for Pixart-alpha, which has 600 million parameters.
π Features:
nextline
and nextframe
tokens, our model can support resolution extrapolation, which allows the generation of resolutions out-of-domain that were unseen during training.
720P Videos:
Prompt: The majestic beauty of a waterfall cascading down a cliff into a serene lake.
https://github.com/Alpha-VLLM/Lumina-T2X/assets/54879512/17187de8-7a07-49a8-92f9-fdb8e2f5e64c
https://github.com/Alpha-VLLM/Lumina-T2X/assets/54879512/0a20bb39-f6f7-430f-aaa0-7193a71b256a
Prompt: A stylish woman walks down a Tokyo street filled with warm glowing neon and animated city signage. She wears a black leather jacket, a long red dress, and black boots, and carries a black purse. She wears sunglasses and red lipstick. She walks confidently and casually. The street is damp and reflective, creating a mirror effect of the colorful lights. Many pedestrians walk about.
https://github.com/Alpha-VLLM/Lumina-T2X/assets/54879512/7bf9ce7e-f454-4430-babe-b14264e0f194
360P Videos:
https://github.com/Alpha-VLLM/Lumina-T2X/assets/54879512/d7fec32c-3655-4fd1-aa14-c0cb3ace3845
For more demos visit this website
We support diverse configurations, including text encoders, DiTs of different parameter sizes, inference methods, and VAE encoders. Additionally, we offer features such as 1D-RoPE, image enhancement, and more.