Implementation of Vision Transformer from scratch and performance compared to standard CNNs (ResNets) and pre-trained ViT on CIFAR10 and CIFAR100.
Implementation of the ViT model in Pytorch from the paper 'An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale' by Google Research.
ImageNet
, JFT300M
, CIFAR10/100
etc.BiT (Resnet 152x4)
, and EfficientNet
, on same conditions.VTAB classification
suite consisting of 19
tasks divided into groups as Natural, Specialized and Structured Tasks.Due to non-availability of powerful compute on Google Colab, we chose to train and test on these 2 datasets –
Presentation can be accessed here.
Name | ID |
---|---|
Akshit Khanna | 2017A7PS0023P |
Vishal Mittal | 2017A7PS0080P |
Raghav Bansal | 2017A3PS0196P |