AudioCLIP Versions Save

Source code for models described in the paper "AudioCLIP: Extending CLIP to Image, Text and Audio" (https://arxiv.org/abs/2106.13043)

2 years ago

Text embeddings' vocabulary and PyTorch' state_dicts containing weights of the AudioCLIP model trained on AudioSet:

bpe_simple_vocab_16e6.txt.gz – CLIP's vocabulary (origin)
CLIP.pt – vanilla CLIP (text Transformer & ResNet-50 image-head, origin)
ESRNXFBSP.pt – ESResNeXt trained on AudioSet (standalone)
AudioCLIP trained on AudioSet (+ video frames)
- AudioCLIP-Full-Training.pt – training of all three heads (text, image and audio)
- AudioCLIP-Partial-Training.pt – training of the audio-head only