Project: Vision Transformer Analysis
· 4 min read
This is the final project for the course AIST4010. More details on the project can be found in the report. This project is done in April 2024.
Overview
Project Goals
The project investigates the generalizability of Vision Transformers (ViTs) compared to Convolutional Neural Networks (CNNs) for small-scale computer vision tasks. While ViTs excel in large datasets, they struggle with smaller ones. This work evaluates and compares the performance of models like ResNet, ViT, DeiT, and T2T-ViT on classification tasks using small subsets of CIFAR-10 and STL-10 datasets.
Key Contributions
- Scalability Analysis: Demonstrated performance degradation of ViTs with reduced dataset sizes, showing CNNs are more effective for small datasets.
- Computational Efficiency: Analyzed training iterations and time-to-convergence, highlighting that ViTs, while converging faster, still lack efficiency due to lower accuracy on small datasets.
- Comparison of Architectures: Implemented and trained models with similar parameter counts for fair performance evaluations.