Hardware Accelerators

Hardware Accelerator for Vision Transformers’ inference on the edge

Vision Transformer models, such as ViT, Swin Transformer, and Transformer-in-Transformer, have recently gained significant traction in computer vision tasks due to their superior performance and scalability. Several applications such as drone navigation require real-time inference of these models on the edge. However, these models are quite large and compute-heavy, making them difficult to deploy in resource-constrained edge devices.

There are two primary issues with deploying such models on edge devices :

  • limited availability of resources, thus limiting the compute capability & increasing the latency
  • limited on-chip memory that cannot hold the entire model weights, thus necessitating off-chip memory accesses - which have high energy cost

In our accelerator design, we address the first issue by employing a head-level pipeline with a configurable design, in order to ensure that all the available resources are utilised with the best possible hardware utilization efficiency. The second issue is addressed by eliminating repeated off-chip memory accesses for any data through an optimal scheduling scheme. We adopt an input-stationary scheme, perform column-wise computation, and employ inter-layer MLP optimizations in order to realise the same.

These optimizations result in the design achieving nearly 90% hardware utilization efficiency, reporting a power of 0.88W at a clock of 150 MHz when implemented on a Zynq 7020 MPSoC. The design supports several variants of vision transformer models, and achieves a reasonable processing performance in terms of the fps.


Head-level pipeline

Overall hardware accelerator design

Read the complete paper here