Lumina T2X Save

Lumina-T2X is a model for Text to Any Modality Generation

Project README

$\textbf{Lumina-T2X}$: Transform Text into Any Modality, Res. and Duration via Flow-based Large Diffusion Transformer

GitHub repo contributors GitHub Commit Pr GitHub repo starsΒ  GitHub repo watchersΒ  GitHub repo size

intro

πŸ“° News

  • [2024-04-25] πŸ”₯πŸ”₯πŸ”₯ Support 720p video generation with arbitary resolution. Demo πŸš€πŸš€πŸš€
  • [2024-04-19] πŸ”₯πŸ”₯πŸ”₯ Demo released.
  • [2024-04-05] πŸ˜†πŸ˜†πŸ˜† Code released for Lumina-T2I.
  • [2024-04-01] πŸš€πŸš€πŸš€ We release the initial version of Lumina-T2I for text-to-image generation.

πŸš€ Quick Start

For more about training and inference, please refer to Lumina-T2I README.md

πŸ“‘ Open-source Plan

  • Lumina-T2I (Training, Inference)
  • Lumina-T2V
  • Web Demo
  • Cli Demo

πŸ“œ Index of Content

Introduction

We introduce the $\textbf{Lumina-T2X}$ family, a series of text-conditioned Diffusion Transformers (DiT) designed to convert noise into images, videos, and multi-view images of 3D objects and generate speech based on textual instructions. At the core of Lumina-T2X lies the Flow-based Large Diffusion Transformer (Flag-DiT), which supports scaling up to 7 billion parameters and extending sequence lengths up to 128,000. Inspired by Sora, Lumina-T2X integrates images, videos, multi-views of 3D objects, and speech spectrograms within a spatial-temporal latent token space.

$\textbf{Lumina-T2X}$ allows for the generation of outputs in any resolution, aspect ratio, and duration, facilitated by learnable newline and newframe tokens.

Furthermore, training $\textbf{Lumina-T2X}$ is computationally efficient. The largest model, with 5 billion parameters, requires only 20% of the training time needed for Pixart-alpha, which has 600 million parameters.

🌟 Features:

  • Flow-based Large Diffusion Transformer (Flag-DiT): Lumina-T2X is trained with the flow matching object. To increase training stability and model scalability, we support a bunch of techniques, such as RoPE, RMSNorm, and KQ-norm, demonstrating faster training convergence, stable training dynamics, and a simplified pipeline.
  • Any Modalities, Res., and Duration within One Framework:
    1. Lumina-T2X tokenizes images, videos, multi-views of 3D objects, and spectrograms into one-dimensional sequences.
    2. Lumina-T2X can naturally encode any modalityβ€”regardless of resolution, aspect ratios, and temporal durations into a unified 1-D token sequence akin to LLMs, by utilizing Flag-DiT with text conditioning to iteratively transform noise into outputs across any modality, resolution, and duration during inference time.
    3. Due to the use of nextline and nextframe tokens, our model can support resolution extrapolation, which allows the generation of resolutions out-of-domain that were unseen during training.
  • Low Training Resources: increasing token length in transformers extends iteration times but reduces overall training duration by decreasing the number of iterations needed. Moreover, our Lumina-T2X model can generate high-resolution images and coherent videos with minimal computational demands. Remarkably, the default Lumina-T2I configuration, equipped with a 5B Flag-DiT and a 7B LLaMA as text encoder, requires only 20% of the computational resources needed by Pixelart-$\alpha$.

framework

πŸ“½οΈ Demos

Text-to-Image Generation


Text-to-Video Generation

720P Videos:

Prompt: The majestic beauty of a waterfall cascading down a cliff into a serene lake.

https://github.com/Alpha-VLLM/Lumina-T2X/assets/54879512/17187de8-7a07-49a8-92f9-fdb8e2f5e64c

https://github.com/Alpha-VLLM/Lumina-T2X/assets/54879512/0a20bb39-f6f7-430f-aaa0-7193a71b256a

Prompt: A stylish woman walks down a Tokyo street filled with warm glowing neon and animated city signage. She wears a black leather jacket, a long red dress, and black boots, and carries a black purse. She wears sunglasses and red lipstick. She walks confidently and casually. The street is damp and reflective, creating a mirror effect of the colorful lights. Many pedestrians walk about.

https://github.com/Alpha-VLLM/Lumina-T2X/assets/54879512/7bf9ce7e-f454-4430-babe-b14264e0f194

360P Videos:

https://github.com/Alpha-VLLM/Lumina-T2X/assets/54879512/d7fec32c-3655-4fd1-aa14-c0cb3ace3845

Text-to-Multi-views Generation

More demos

For more demos visit this website

βš™οΈ Diverse Configurations

We support diverse configurations, including text encoders, DiTs of different parameter sizes, inference methods, and VAE encoders. Additionally, we offer features such as 1D-RoPE, image enhancement, and more.


Open Source Agenda is not affiliated with "Lumina T2X" Project. README Source: Alpha-VLLM/Lumina-T2X
Stars
57
Open Issues
0
Last Commit
1 week ago
License
MIT

Open Source Agenda Badge

Open Source Agenda Rating