Macaw-LLM: Multi-Modal Language Modeling with Image, Video, Audio, and Text Integration
¹ ² Chenyang Lyu, ³ Minghao Wu, ¹ * Longyue Wang, ¹ Xinting Huang,
¹ Bingshuai Liu, ¹ Zefeng Du, ¹ Shuming Shi, ¹ Zhaopeng Tu
¹ Tencent AI Lab, ² Dublin City University, ³ Monash University
*Longyue Wang is the corresponding author: [email protected]
Macaw-LLM is an exploratory endeavor that pioneers multi-modal language modeling by seamlessly combining image🖼️, video📹, audio🎵, and text📝 data, built upon the foundations of CLIP, Whisper, and LLaMA.
In recent years, the field of language modeling has witnessed remarkable advancements. However, the integration of multiple modalities, such as images, videos, audios, and text, has remained a challenging task. Macaw-LLM is a model of its kind, bringing together state-of-the-art models for processing visual, auditory, and textual information, namely CLIP, Whisper, and LLaMA.
Macaw-LLM boasts the following unique features:
Macaw-LLM is composed of three main components:
The integration of these models allows Macaw-LLM to process and analyze multi-modal data effectively.
Our novel alignment strategy enables faster adaptation by efficiently bridging multi-modal features to textual features. The process involves:
To install Macaw-LLM, follow these steps:
# Clone the repository
git clone https://github.com/lyuchenyang/Macaw-LLM.git
# Change to the Macaw-LLM directory
cd Macaw-LLM
# Install required packages
pip install -r requirements.txt
# Install ffmpeg
yum install ffmpeg -y
# Install apex
git clone https://github.com/NVIDIA/apex.git
cd apex
python setup.py install
cd ..
Downloading dataset:
Dataset preprocessing:
data/text/
, data/image/
, data/video/
python preprocess_data.py
python preprocess_data_supervised.py
python preprocess_data_unsupervised.py
Training:
./train.sh
Inference:
./inference.sh
We present several examples that highlight the proficiency of our Macaw-LLM in understanding and following multi-modal instructions. These examples showcase our system's multi-modal ability to understand and generate responses based on images and videos. These examples demonstrate how our system comprehends visual content and produces high-quality, fluent responses in natural language conversations. Our system generates contextually relevant and informative answers to various questions about the image, demonstrating its capability to communicate about visual content naturally and fluently.
While our model is still in its early stages, we believe that Macaw-LLM paves the way for future research in the realm of multi-modal language modeling. The integration of diverse data modalities holds immense potential for pushing the boundaries of artificial intelligence and enhancing our understanding of complex real-world scenarios. By introducing Macaw-LLM, we hope to inspire further exploration and innovation in this exciting area of study.
We welcome contributions from the community to improve and expand Macaw-LLM's capabilities. 🤝
Evaluation: We show some examples showcasing the multi-modal ability of our Macaw-LLM. However, we acknowledge that these efforts may not be fully adequate for accurately and comprehensively demonstrate model capabilities. We aim to conduct extensive evaluation on our systems to evaluate its capability.
More Language Models: We aim to extend Macaw-LLM by incorporating additional language models like Dolly, BLOOM, T-5, etc. This will enable more robust and versatile processing and understanding of multi-modal data.
Multilingual Support: Our next step is to support multiple languages, moving towards true multi-modal and multilingual language models. We believe this will significantly broaden Macaw-LLM's applicability and enhance its understanding of diverse, global contexts.
We would like to express our gratitude to the following open-source projects for their valuable contributions to Macaw-LLM:
We would also like to thank the developers and maintainers of these projects for their dedication and hard work in making their projects open-source and accessible to the community.
@article{lyu2023macaw,
title={Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and Text Integration},
author={Lyu, Chenyang and Wu, Minghao and Wang, Longyue and Huang, Xinting and Liu, Bingshuai and Du, Zefeng and Shi, Shuming and Tu, Zhaopeng},
journal={arXiv preprint arXiv:2306.09093},
year={2023}
}