Skip to main content

One post tagged with "Vision Transformers"

View All Tags

Project: Vision Transformer Analysis

· 4 min read

View Project Report

Python PyTorch UNet ResNet ViT DeiT T2T Dataset | CIFAR10 Dataset | STL10 Dataset | Cityscapes Last Update | April 2024

This is the final project for the course AIST4010. More details on the project can be found in the report. This project is done in April 2024.

Report: Report


Overview

Project Goals

The project investigates the generalizability of Vision Transformers (ViTs) compared to Convolutional Neural Networks (CNNs) for small-scale computer vision tasks. While ViTs excel in large datasets, they struggle with smaller ones. This work evaluates and compares the performance of models like ResNet, ViT, DeiT, and T2T-ViT on classification tasks using small subsets of CIFAR-10 and STL-10 datasets.

Key Contributions

  1. Scalability Analysis: Demonstrated performance degradation of ViTs with reduced dataset sizes, showing CNNs are more effective for small datasets.
  2. Computational Efficiency: Analyzed training iterations and time-to-convergence, highlighting that ViTs, while converging faster, still lack efficiency due to lower accuracy on small datasets.
  3. Comparison of Architectures: Implemented and trained models with similar parameter counts for fair performance evaluations.