ViT-B/32 is comparable to ResNet50 in inference compute
#memo
ViT-B/32 is comparable to ResNet50 in inference compute (139.6 vs 141.5 GFLOPs)
引用: Simple Open-Vocabulary Object Detection with Vision Transformers