Implementing Stand-Alone Self-Attention in Vision Models using Pytorch
The row and column offsets are associated with an embedding and respectively each with dimension . The row and column offset embeddings are concatenated to form . This spatial-relative attention is now defined as below equation.
Equation 2:
I refer to the following paper when implementing this part.
Datasets | Model | Accuracy | Parameters (My Model, Paper Model) |
---|---|---|---|
CIFAR-10 | ResNet 26 | 90.94% | 8.30M, - |
CIFAR-10 | Naive ResNet 26 | 94.29% | 8.74M |
CIFAR-10 | ResNet 26 + stem | 90.22% | 8.30M, - |
CIFAR-10 | ResNet 38 (WORK IN PROCESS) | 89.46% | 12.1M, - |
CIFAR-10 | Naive ResNet 38 | 94.93% | 15.0M |
CIFAR-10 | ResNet 50 (WORK IN PROCESS) | 16.0M, - | |
IMAGENET | ResNet 26 (WORK IN PROCESS) | 10.3M, 10.3M | |
IMAGENET | ResNet 38 (WORK IN PROCESS) | 14.1M, 14.1M | |
IMAGENET | ResNet 50 (WORK IN PROCESS) | 18.0M, 18.0M |