Source code for models described in the paper "AudioCLIP: Extending CLIP to Image, Text and Audio" (https://arxiv.org/abs/2106.13043)
Text embeddings' vocabulary and PyTorch' state_dict
s containing weights of the AudioCLIP model trained on AudioSet: