AudioCLIP Versions Save

Source code for models described in the paper "AudioCLIP: Extending CLIP to Image, Text and Audio" (https://arxiv.org/abs/2106.13043)

v0.1

2 years ago

Text embeddings' vocabulary and PyTorch' state_dicts containing weights of the AudioCLIP model trained on AudioSet:

  • bpe_simple_vocab_16e6.txt.gz – CLIP's vocabulary (origin)
  • CLIP.pt – vanilla CLIP (text Transformer & ResNet-50 image-head, origin)
  • ESRNXFBSP.pt – ESResNeXt trained on AudioSet (standalone)
  • AudioCLIP trained on AudioSet (+ video frames)
    • AudioCLIP-Full-Training.pt – training of all three heads (text, image and audio)
    • AudioCLIP-Partial-Training.pt – training of the audio-head only