When Vision Transformers Outperform ResNets without Pre-training or Strong Data Augmentations