ViT-B/32 is comparable to ResNet50 in inference compute

ViT-B/32 is comparable to ResNet50 in inference compute (139.6 vs 141.5 GFLOPs)

引用: Simple Open-Vocabulary Object Detection with Vision Transformers