·
my research interest: Computer Vision

Transformer helps your network capture long dependency. It fuses long-distance features with a correlated matrix.

As you know, Convolutional Nerual Network (CNN) is featured with local connection. It is a great prior for low-rank image information. It reduces paramters and suppresses long-distance noise. While it fails to consider long-distance information.

By the way, you can easily get the shortage of Transformer. It has more parameters than CNN, as it need to deal with more features. If your task need long dependency, please use transformer. And vice versa.