Lumina T2X Save

Lumina-T2X is a model for Text to Any Modality Generation

Project README

$\textbf{Lumina-T2X}$: Transform Text into Any Modality, Res. and Duration via Flow-based Large Diffusion Transformer

intro

📰 News

[2024-04-25] 🔥🔥🔥 Support 720p video generation with arbitary resolution. Demo 🚀🚀🚀
[2024-04-19] 🔥🔥🔥 Demo released.
[2024-04-05] 😆😆😆 Code released for Lumina-T2I.
[2024-04-01] 🚀🚀🚀 We release the initial version of Lumina-T2I for text-to-image generation.

🚀 Quick Start

For more about training and inference, please refer to Lumina-T2I README.md

📑 Open-source Plan

Lumina-T2I (Training, Inference)
Lumina-T2V
Web Demo
Cli Demo

📜 Index of Content

Lumina-T2X

Introduction

We introduce the $\textbf{Lumina-T2X}$ family, a series of text-conditioned Diffusion Transformers (DiT) designed to convert noise into images, videos, and multi-view images of 3D objects and generate speech based on textual instructions. At the core of Lumina-T2X lies the Flow-based Large Diffusion Transformer (Flag-DiT), which supports scaling up to 7 billion parameters and extending sequence lengths up to 128,000. Inspired by Sora, Lumina-T2X integrates images, videos, multi-views of 3D objects, and speech spectrograms within a spatial-temporal latent token space.

$\textbf{Lumina-T2X}$ allows for the generation of outputs in any resolution, aspect ratio, and duration, facilitated by learnable newline and newframe tokens.

Furthermore, training $\textbf{Lumina-T2X}$ is computationally efficient. The largest model, with 5 billion parameters, requires only 20% of the training time needed for Pixart-alpha, which has 600 million parameters.

🌟 Features:

Flow-based Large Diffusion Transformer (Flag-DiT): Lumina-T2X is trained with the flow matching object. To increase training stability and model scalability, we support a bunch of techniques, such as RoPE, RMSNorm, and KQ-norm, demonstrating faster training convergence, stable training dynamics, and a simplified pipeline.
Any Modalities, Res., and Duration within One Framework:
1. Lumina-T2X tokenizes images, videos, multi-views of 3D objects, and spectrograms into one-dimensional sequences.
2. Lumina-T2X can naturally encode any modality—regardless of resolution, aspect ratios, and temporal durations into a unified 1-D token sequence akin to LLMs, by utilizing Flag-DiT with text conditioning to iteratively transform noise into outputs across any modality, resolution, and duration during inference time.
3. Due to the use of nextline and nextframe tokens, our model can support resolution extrapolation, which allows the generation of resolutions out-of-domain that were unseen during training.
Low Training Resources: increasing token length in transformers extends iteration times but reduces overall training duration by decreasing the number of iterations needed. Moreover, our Lumina-T2X model can generate high-resolution images and coherent videos with minimal computational demands. Remarkably, the default Lumina-T2I configuration, equipped with a 5B Flag-DiT and a 7B LLaMA as text encoder, requires only 20% of the computational resources needed by Pixelart-$\alpha$.

framework

📽️ Demos

Text-to-Image Generation

Text-to-Video Generation

720P Videos:

Prompt: The majestic beauty of a waterfall cascading down a cliff into a serene lake.

https://github.com/Alpha-VLLM/Lumina-T2X/assets/54879512/17187de8-7a07-49a8-92f9-fdb8e2f5e64c

https://github.com/Alpha-VLLM/Lumina-T2X/assets/54879512/0a20bb39-f6f7-430f-aaa0-7193a71b256a

Prompt: A stylish woman walks down a Tokyo street filled with warm glowing neon and animated city signage. She wears a black leather jacket, a long red dress, and black boots, and carries a black purse. She wears sunglasses and red lipstick. She walks confidently and casually. The street is damp and reflective, creating a mirror effect of the colorful lights. Many pedestrians walk about.

https://github.com/Alpha-VLLM/Lumina-T2X/assets/54879512/7bf9ce7e-f454-4430-babe-b14264e0f194

360P Videos:

https://github.com/Alpha-VLLM/Lumina-T2X/assets/54879512/d7fec32c-3655-4fd1-aa14-c0cb3ace3845

Text-to-Multi-views Generation

More demos

For more demos visit this website

⚙️ Diverse Configurations

We support diverse configurations, including text encoders, DiTs of different parameter sizes, inference methods, and VAE encoders. Additionally, we offer features such as 1D-RoPE, image enhancement, and more.

Open Source Agenda is not affiliated with "Lumina T2X" Project. README Source: Alpha-VLLM/Lumina-T2X

Stars

Open Issues

Last Commit

1 week ago

Repository

Alpha-VLLM/Lumina-T2X

License

MIT

Open Source Agenda Badge

<a href="https://www.opensourceagenda.com/projects/lumina-t2x"><img src="https://www.opensourceagenda.com/projects/lumina-t2x/reviews/badge.svg" alt="Open Source Agenda"></a>

Submit Review Review Your Favorite Project

Submit Resource Articles, Courses, Videos

Submit Article Submit a post to our blog

From the blog

Dec 11, 2022

How to Choose Which Programming Language to Learn First?

From the blog

Dec 11, 2022