This project uses a pre-trained vision transformer, google/vit-base-patch16-224, as a feature extractor for clothing images. The final classification head is skipped and the pooled image representation is used as an embedding, allowing uploaded images to be compared against a catalogue in vector space.
The dataset started from 50,000 clothing images. The preprocessing pipeline removes unreadable files, 1x1 images, and duplicate filenames before generating embeddings for the remaining catalogue. This leaves around 30,000 usable images for search.
I first implemented cosine similarity by loading all embeddings and comparing them one by one, which made searches take around 20 seconds. Stacking the embeddings into one tensor and computing cosine similarity in a vectorised pass reduced this to around 2 seconds while also simplifying storage into a single artifact.
The app also includes an approximate nearest-neighbour mode using
Hierarchical Navigable Small Worlds through nmslib. The
frontend lets users switch between vectorised cosine search and HNSW, adjust
the neighbour count, and compare query times directly in the UI.
Useful next steps would be increasing the dataset size, comparing more ANN libraries or vector databases, testing domain-specific image encoders, and moving large generated artifacts to object storage or Git LFS.