[NAACL 2022]Mobile Text-to-Image search powered by multimodal semantic representation models(e.g., OpenAI's CLIP)
MoTIS is a minimal demo demonstrating semantic multimodal text-to-image search using pretrained vision-language models. Semantic search represents each sample(text and image) as a vector in a shared semantic embedding space. The relevance score can then be measured as similarity(cosine similarity or distance) between vectors.
Performance: These two combined achieves 40.4/68.5/78.4 R@1/R@5/R@10 on MS COCO 2014 5K test set, matching CLIP model(40.9/67.6/77.9) finetuned with contrastive loss. On the 1K test split, our current best compressed dual-encoder achieves 61.2/87.6/94.2 R@1/R@5/R@10, while CLIP obtains 61.0/87.9/94.7.
Inference Speed: The image encoder is approximately 1.6 times faster than CLIP's ViT/B-32, and the text encoder is about 2.9 times faster than CLIP's text encoder.
Model | Disk Space | Google Drive | R@10 on MS COCO2014 5K testset |
---|---|---|---|
original CLIP | 224MB | https://drive.google.com/file/d/1583IT_K9cCkeHfrmuTpMbImbS5qB8SA1/view?usp=sharing | 64.5 |
fine-tuned CLIP | 224MB | - | 77.9 |
6-Layer Transformer with hard negatives | 170MB | https://drive.google.com/file/d/1isMy64zuWnggd9K63RMHG4fx6U4O-izE/view?usp=sharing | 79.4 |
4-Layer Transformer with hard negatives | 146MB | https://drive.google.com/file/d/1c83gD8NGT8v8RcE_E_rCrkqWN2RIzHEg/view?usp=sharing | 79.0 |
2-Layer Transformer with hard negatives | 121MB | https://drive.google.com/file/d/1QdWJw_29MWQnb9SgClwbM_9iZquB9QKT/view?usp=sharing | 78.4 |
Model | Disk Space | Google Drive | R@10 on MS COCO2014 5K testset |
---|---|---|---|
original CLIP | 336MB | https://drive.google.com/file/d/1K2wIyTuSWLTKBXzUlyTEsa4xXLNDuI7P/view?usp=sharing | 64.5 |
fine-tuned CLIP | 336MB | - | 77.9 |
ViT-small-patch16-224 | 85MB | https://drive.google.com/file/d/1s_oX0-HIELpjjrBXsjlofIbTGZ_Wllo0/view?usp=sharing | 68.9 |
ViT-small-patch16-224(larger batch size) | 85MB | https://drive.google.com/file/d/1h_w9msJMB4F-dR6uNwp-BHeguS5QIrnE/view?usp=sharing | 68.3 |
ViT-small-patch16-224(arger batch size and hard negatives sampled from training set) | 85MB | https://drive.google.com/file/d/14AqCaORjxePrscdwUTGprII8siJ7ik8X/view?usp=sharing | 69.4 |
ViT-small-patch16-224(larger batch size, bigger image corpus, and hard negatives sampled from training set) | 85MB | https://drive.google.com/file/d/1q3dllreyVTofWh5JZywzWYHQlNgcRacq/view?usp=sharing | 69.9 |
ViT-small-patch16-224-ImageNet21K(larger batch size, bigger image corpus, and hard negatives sampled from training set) | 85MB | https://drive.google.com/file/d/1Whacd4qeFuP_sair3yNGUeQTm4bshDYh/view?usp=sharing | 75.3 |
Note that these checkpoints are not taken from state_dict(), but rather after torch.jit.script operation. The same original CLIP text encoder is used for all above image encoders.
pod install
Just type any keyword in order to search the relecant images. Type "reset" to return to the default one.