Video2Music Save

Video2Music: Suitable Music Generation from Videos using an Affective Multimodal Transformer model

Project README

Video2Music: Suitable Music Generation from Videos using an Affective Multimodal Transformer model

Demo | Website and Examples | Paper | Dataset (MuVi-Sync)

Hugging Face Spaces arXiv

This repository contains the code and dataset accompanying the paper "Video2Music: Suitable Music Generation from Videos using an Affective Multimodal Transformer model" by Dr. Jaeyong Kang, Prof. Soujanya Poria, and Prof. Dorien Herremans.

🔥 Live demo available on HuggingFace and Replicate.

Introduction

We propose a novel AI-powered multimodal music generation framework called Video2Music. This framework uniquely uses video features as conditioning input to generate matching music using a Transformer architecture. By employing cutting-edge technology, our system aims to provide video creators with a seamless and efficient solution for generating tailor-made background music.

Change Log

  • 2023-11-28: add new input method (YouTube URL) on HuggingFace

Quickstart Guide

Generate music from video:

import IPython
from video2music import Video2music

input_video = "input.mp4"

input_primer = "C Am F G"
input_key = "C major"

video2music = Video2music()
output_filename = video2music.generate(input_video, input_primer, input_key)

IPython.display.Video(output_filename)

Installation

This repo is developed using python version 3.8

apt-get update
apt-get install ffmpeg
apt-get install fluidsynth
git clone https://github.com/AMAAI-Lab/Video2Music
cd Video2Music
pip install -r requirements.txt
  • Download the processed training data AMT.zip from HERE and extract the zip file and put the extracted two files directly under this folder (saved_models/AMT/)

  • Download the soundfont file default_sound_font.sf2 from HERE and put the file directly under this folder (soundfonts/)

  • Our code is built on pytorch version 1.12.1 (torch==1.12.1 in the requirements.txt). But you might need to choose the correct version of torch based on your CUDA version

Dataset

  • Obtain the dataset:

  • Put all directories started with vevo in the dataset under this folder (dataset/)

Directory Structure

  • saved_models/: saved model files
  • utilities/
    • run_model_vevo.py: code for running model (AMT)
    • run_model_regression.py: code for running model (bi-GRU)
  • model/
    • video_music_transformer.py: Affective Multimodal Transformer (AMT) model
    • video_regression.py: Bi-GRU regression model used for predicting note density/loudness
    • positional_encoding.py: code for Positional encoding
    • rpr.py: code for RPR (Relative Positional Representation)
  • dataset/
    • vevo_dataset.py: Dataset loader
  • script/ : code for extracting video/music features (sementic, motion, emotion, scene offset, loudness, and note density)
  • train.py: training script (AMT)
  • train_regression.py: training script (bi-GRU)
  • evaluate.py: evaluation script
  • generate.py: inference script
  • video2music.py: Video2Music module that outputs video with generated background music from input video
  • demo.ipynb: Jupyter notebook for Quickstart Guide

Training

python train.py

Inference

python generate.py

Subjective Evaluation by Listeners

Model Overall Music Quality ↑ Music-Video Correspondence ↑ Harmonic Matching ↑ Rhythmic Matching ↑ Loudness Matching ↑
Music Transformer 3.4905 2.7476 2.6333 2.8476 3.1286
Video2Music 4.2095 3.6667 3.4143 3.8714 3.8143

TODO

  • Add other instruments (e.g., drum) for live demo

Citation

If you find this resource useful, please cite the original work:

@article{KANG2024123640,
  title = {Video2Music: Suitable music generation from videos using an Affective Multimodal Transformer model},
  author = {Jaeyong Kang and Soujanya Poria and Dorien Herremans},
  journal = {Expert Systems with Applications},
  pages = {123640},
  year = {2024},
  issn = {0957-4174},
  doi = {https://doi.org/10.1016/j.eswa.2024.123640},
}

Kang, J., Poria, S. & Herremans, D. (2024). Video2Music: Suitable Music Generation from Videos using an Affective Multimodal Transformer model, Expert Systems with Applications (in press).

Acknowledgements

Our code is based on Music Transformer.

Open Source Agenda is not affiliated with "Video2Music" Project. README Source: AMAAI-Lab/Video2Music
Stars
124
Open Issues
1
Last Commit
1 month ago
License
MIT

Open Source Agenda Badge

Open Source Agenda Rating