Awesome Visual Representation Learning With Transformers Save

Awesome Transformers (self-attention) in Computer Vision

Project README

Awesome Visual Representation Learning with Transformers

Awesome Transformers (self-attention) in Computer Vision

About transformers

Attention Is All You Need, NeurIPS 2017
- Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin
- [paper] [official code] [pytorch implementation]
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, NAACL 2019
- Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova
- [paper] [offficial code] [huggingface/transformers]
Efficient Transformers: A Survey, arXiv 2020
- Yi Tay, Mostafa Dehghani, Dara Bahri, Donald Metzler
- [paper]
A Survey on Visual Transformer, arXiv 2020
- Kai Han, Yunhe Wang, Hanting Chen, Xinghao Chen, Jianyuan Guo, Zhenhua Liu, Yehui Tang, An Xiao, Chunjing Xu, Yixing Xu, Zhaohui Yang, Yiman Zhang, Dacheng Tao
- [paper]
Transformers in Vision: A Survey, arXiv 2021
- Salman Khan, Muzammal Naseer, Munawar Hayat, Syed Waqas Zamir, Fahad Shahbaz Khan, Mubarak Shah
- [paper]

Combining CNN with self-attention

Attention augmented convolutional networks, ICCV 2019, image classification
- Irwan Bello, Barret Zoph, Ashish Vaswani, Jonathon Shlens, Quoc V. Le
- [paper] [pytorch implementation]
Self-Attention Generative Adversarial Networks, ICML 2019, generative model(GANs)
- Han Zhang, Ian Goodfellow, Dimitris Metaxas, Augustus Odena
- [paper] [official code]
Videobert: A joint model for video and language representation learning, ICCV 2019, video processing
- Chen Sun, Austin Myers, Carl Vondrick, Kevin Murphy, Cordelia Schmid
- [paper]
Visual Transformers: Token-based Image Representation and Processing for Computer Vision, arXiv 2020, image classification
- Bichen Wu, Chenfeng Xu, Xiaoliang Dai, Alvin Wan, Peizhao Zhang, Masayoshi Tomizuka, Kurt Keutzer, Peter Vajda
- [paper]
Feature Pyramid Transformer, ECCV 2020, detection and segmentation
- Dong Zhang, Hanwang Zhang, Jinhui Tang, Meng Wang, Xiansheng Hua, Qianru Sun
- [paper] [official code]
Revisiting Stereo Depth Estimation From a Sequence-to-Sequence Perspective with Transformers, arXiv 2020, depth estimation
- Zhaoshuo Li, Xingtong Liu, Francis X. Creighton, Russell H. Taylor, and Mathias Unberath
- [paper] [official code]
End-to-end Lane Shape Prediction with Transformers, arXiv 2020, lane detection
- Ruijin Liu, Zejian Yuan, Tie Liu, Zhiliang Xiong
- [paper] [official code]
Taming Transformers for High-Resolution Image Synthesis, arXiv 2020, image synthesis
- Patrick Esser, Robin Rombach, Bjorn Ommer
- [paper][official code]
TransPose: Towards Explainable Human Pose Estimation by Transformer, arXiv 2020, pose estimation
- Sen Yang, Zhibin Quan, Mu Nie, Wankou Yang
- [paper]
End-to-End Video Instance Segmentation with Transformers, arXiv 2020, video instance segmentation
- Yuqing Wang, Zhaoliang Xu, Xinlong Wang, Chunhua Shen, Baoshan Cheng, Hao Shen, Huaxia Xia
- [paper]
TransTrack: Multiple-Object Tracking with Transformer, arXiv 2020, MOT
- Peize Sun, Yi Jiang, Rufeng Zhang, Enze Xie, Jinkun Cao, Xinting Hu, Tao Kong, Zehuan Yuan, Changhu Wang, Ping Luo
- [paper][official code]
TrackFormer: Multi-Object Tracking with Transformers, arXiv 2021, MOT
- Tim Meinhardt, Alexander Kirillov, Laura Leal-Taixe, Christoph Feichtenhofer
- [paper]
Line Segment Detection Using Transformers without Edges, arXiv 2021, line segmentation
- Yifan Xu, Weijian Xu, David Cheung, Zhuowen Tu
- [paper]
Segmenting Transparent Object in the Wild with Transformer, arXiv 2021, transparent object segmentation
- Enze Xie, Wenjia Wang, Wenhai Wang, Peize Sun, Hang Xu, Ding Liang, Ping Luo
- [paper][official code]
Bottleneck Transformers for Visual Recognition, arXiv 2021, backbone design
- Aravind Srinivas, Tsung-Yi Lin, Niki Parmar, Jonathon Shlens, Pieter Abbeel, Ashish Vaswani
- [paper]

DETR Family

End-to-end object detection with transformers, ECCV 2020, object detection
- Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, Sergey Zagoruyko
- [paper] [official code] [detectron2 implementation]
Deformable DETR: Deformable Transformers for End-to-End Object Detection, ICLR 2021, object detection
- Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang, Jifeng Dai
- [paper] [official code]
End-to-End Object Detection with Adaptive Clustering Transformer, arXiv 2020, object detection
- Minghang Zheng, Peng Gao, Xiaogang Wang, Hongsheng Li, Hao Dong
- [paper]
UP-DETR: Unsupervised Pre-training for Object Detection with Transformers, arXiv 2020, object detection
- Zhigang Dai, Bolun Cai, Yugeng Lin, Junying Chen
- [paper]
DETR for Pedestrian Detection, arXiv 2020, pedestrian detection
- Matthieu Lin, Chuming Li, Xingyuan Bu, Ming Sun, Chen Lin, Junjie Yan, Wanli Ouyang, Zhidong Deng
- [paper]

Stand-alone transformers for Computer Vision

Self-attention only in local neighborhood

Image Transformer, ICML 2018
- Niki Parmar, Ashish Vaswani, Jakob Uszkoreit, Łukasz Kaiser, Noam Shazeer, Alexander Ku, Dustin Tran
- [paper] [official code]
Stand-alone self-attention in vision models, NeurIPS 2019
- Prajit Ramachandran, Niki Parmar, Ashish Vaswani, Irwan Bello, Anselm Levskaya, Jonathon Shlens
- [paper] [official code(underconstruction)]
On the relationship between self-attention and convolutional layers, ICLR 2020
- Jean-Baptiste Cordonnier, Andreas Loukas, Martin Jaggi
- [paper] [official code]
Exploring self-attention for image recognition, CVPR 2020
- Hengshuang Zhao, Jiaya Jia, Vladlen Koltun
- [paper] [official code]

Scalable approximations to global self-attention

Generating long sequences with sparse transformers, arXiv 2019
- Rewon Child, Scott Gray, Alec Radford, Ilya Sutskever
- [paper] [official code]
Scaling autoregressive video models, ICLR 2019
- Dirk Weissenborn, Oscar Täckström, Jakob Uszkoreit
- [paper]
Axial attention in multidimensional transformers, arXiv 2019
- Jonathan Ho, Nal Kalchbrenner, Dirk Weissenborn, Tim Salimans
- [paper] [pytorch implementation]
Axial-deeplab: Stand-alone axial-attention for panoptic segmentation, ECCV 2020
- Huiyu Wang, Yukun Zhu, Bradley Green, Hartwig Adam, Alan Yuille, Liang-Chieh Chen
- [paper] [pytorch implementation]
MaX-DeepLab: End-to-End Panoptic Segmentation with Mask Transformers, arXiv 2020
- Huiyu Wang, Yukun Zhu, Hartwig Adam, Alan Yuille, Liang-Chieh Chen
- [paper]

Global self-attention with image preprocessing

Generative pretraining from pixels, ICML 2020, iGPT
- Mark Chen, Alec Radford, Rewon Child, Jeff Wu, Heewoo Jun, Prafulla Dhariwal, David Luan, Ilya Sutskever
- [paper] [official code]
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale, ICLR 2021, ViT
- Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby
- [paper] [pytorch implementation]
Pre-Trained Image Processing Transformer, arXiv, IPT
- Hanting Chen, Yunhe Wang, Tianyu Guo, Chang Xu, Yiping Deng, Zhenhua Liu, Siwei Ma, Chunjing Xu, Chao Xu, Wen Gao
- [paper]
Training data-efficient image transformers & distillation through attention, arXiv 2020, DeiT
- Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, Herve Jegou
- [paper][official code]
Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers, arXiv 2020, SETR
- Sixiao Zheng, Jiachen Lu, Hengshuang Zhao, Xiatian Zhu, Zekun Luo, Yabiao Wang, Yanwei Fu, Jianfeng Feng, Tao Xiang, Philip H.S. Torr, Li Zhang
- [paper][official code]
Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet, arXiv 2021, T2T-ViT
- Li Yuan, Yunpeng Chen, Tao Wang, Weihao Yu, Yujun Shi, Francis EH Tay, Jiashi Feng, Shuicheng Yan
- [paper][official code]
TransReID: Transformer-based Object Re-Identification, arXiv 2021
- Shuting He, Hao Luo, Pichao Wang, Fan Wang, Hao Li, Wei Jiang
- [paper]
Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions
- Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lu, Ping Luo, Ling Shao
- [paper][official code]

Global self-attention on 3D point clouds

Point Transformer, arXiv 2020, points classification + part/semantic segmentation
- Hengshuang Zhao, Li Jiang, Jiaya Jia, Philip Torr, Vladlen Koltun
- [paper]

Unified text-vision tasks

Focused on VQA

ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks, NeurIPS 2019
- Jiasen Lu, Dhruv Batra, Devi Parikh, Stefan Lee
- [paper] [official code]
LXMERT: Learning Cross-Modality Encoder Representations from Transformers, EMNLP 2019
- Hao Tan, Mohit Bansal
- [paper] [official code]
VisualBERT: A Simple and Performant Baseline for Vision and Language, arXiv 2019
- Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, Kai-Wei Chang
- [paper] [official code]
VL-BERT: Pre-training of Generic Visual-Linguistic Representations, ICLR 2020
- Weijie Su, Xizhou Zhu, Yue Cao, Bin Li, Lewei Lu, Furu Wei, Jifeng Dai
- [paper] [official code]
UNITER: UNiversal Image-TExt Representation Learning, ECCV 2020
- Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, Jingjing Liu
- [paper] [official code]
Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Transformers, arXiv 2020
- Zhicheng Huang, Zhaoyang Zeng, Bei Liu, Dongmei Fu, Jianlong Fu
- [paper]
ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision, arXiv 2021
- Wonjae Kim, Bokyung Son, Ildoo Kim
- [paper]

Focused on Image Retrieval

Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training, AAAI 2020
- Gen Li, Nan Duan, Yuejian Fang, Ming Gong, Daxin Jiang, Ming Zhou
- [paper] [official code]
ImageBERT: Cross-modal Pre-training with Large-scale Weak-supervised Image-Text Data, arXiv 2020
- Di Qi, Lin Su, Jia Song, Edward Cui, Taroon Bharti, Arun Sacheti
- [paper]
Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks, ECCV 2020
- Xiujun Li, Xi Yin, Chunyuan Li, Pengchuan Zhang, Xiaowei Hu, Lei Zhang, Lijuan Wang, Houdong Hu, Li Dong, Furu Wei, Yejin Choi, Jianfeng Gao
- [paper] [official code]
Training Vision Transformers for Image Retrieval, arXiv 2021
- Alaaeldin El-Nouby, Natalia Neverova, Ivan Laptev, Herve Jegou
- [paper]

Focused on OCR

LayoutLM: Pre-training of Text and Layout for Document Image Understanding
- Yiheng Xu, Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, Ming Zhou
- [paper] [official code]

Focused on Image Captioning

CPTR: Full Transformer Network for Image Captioning, arXiv 2021
- Wei Liu, Sihan Chen, Longteng Guo, Xinxin Zhu, Jing Liu
- [paper]

Multi-Task

12-in-1: Multi-Task Vision and Language Representation Learning
- Jiasen Lu, Vedanuj Goswami, Marcus Rohrbach, Devi Parikh, Stefan Lee
- [paper] [official code]

Open Source Agenda is not affiliated with "Awesome Visual Representation Learning With Transformers" Project. README Source: alohays/awesome-visual-representation-learning-with-transformers

Stars

263

Open Issues

Last Commit

2 years ago

Repository

alohays/awesome-visual-representation-learning-with-transformers

License

MIT

Open Source Agenda Badge

<a href="https://www.opensourceagenda.com/projects/awesome-visual-representation-learning-with-transformers"><img src="https://www.opensourceagenda.com/projects/awesome-visual-representation-learning-with-transformers/reviews/badge.svg" alt="Open Source Agenda"></a>

Submit Review Review Your Favorite Project

Submit Resource Articles, Courses, Videos

Submit Article Submit a post to our blog

From the blog

Dec 11, 2022

How to Choose Which Programming Language to Learn First?

From the blog

Dec 11, 2022