# Dressify Experiments and Rationale (Research Report) This report integrates presentation metrics from `resnet_metrics_full.json` and `vit_metrics_full.json` and replaces prior demo figures with the actual numbers contained in those files. Where only triplet-loss ablations are available for a sweep, we report those directly and clearly mark any derived or proxy interpretations. These metrics are suitable for instruction and presentations; avoid using them for scientific claims unless reproduced. ## Goals - Achieve strong item embeddings (ResNet) for retrieval and similarity. - Learn outfit compatibility (ViT) that generalizes across styles and contexts. - Provide interpretable ablations and parameter-impact narratives for instruction/demo. ## Training pipeline (what actually happens) - ResNet item embedder (triplet loss): - Triplet sampling builds (anchor, positive, negative) where positives come from the same outfit/category and negatives from different outfits/categories. - The model is trained to pull positives closer and push negatives away in a normalized 512D space using triplet margin loss with cosine distance. - Margin is configurable (code default often 0.5), but our tuned full-run best used 0.2 with semi-hard mining for stable, informative gradients. - ViT outfit compatibility (sequence scoring): - Outfits are sequences of item embeddings; positives are real outfits, negatives are constructed by mixing items across outfits with controlled negative sampling (random/in-batch/hard). - The head outputs a compatibility score in [0,1]. We supervise primarily with binary cross-entropy; some configurations include a small triplet regularizer on pooled embeddings (margin≈0.3). - This learns context-aware compatibility (occasion/weather/style) beyond simple item similarity. Why this dual-model setup works: - Item-level (ResNet) captures visual semantics and fine-grained similarity; outfit-level (ViT) captures cross-item relations and coherence. - Together they enable retrieval-first shortlisting and context-aware reranking with calibrated scores. ## Datasets and Sizing Strategy - Base: Polyvore Outfits (nondisjoint). - Splits used in full evaluations: - ViT (Outfits): train 53,306 outfits, val 5,000, test 5,000 (avg 3.7 items/outfit). - ResNet (Items): ~106,000 items total; val/test queries 5,000 each; gallery ≈106k. - Scaling stages for controlled experiments and capacity planning: - 500 → 2,000 → 10,000 → 50,000 → full (≈53k outfits / ≈106k items). - Effects of dataset size on validation triplet loss (from ablations): - ResNet (Item Embedder): | Samples | Best Val Triplet Loss | |--------:|----------------------:| | 2,000 | 0.183 | | 5,000 | 0.176 | | 10,000 | 0.171 | | 50,000 | 0.162 | | 106,000 | 0.152 | - ViT (Outfit Compatibility): | Outfits | Best Val Triplet Loss | |--------:|----------------------:| | 5,000 | 0.462 | | 20,000 | 0.418 | | 53,306 | 0.391 | Interpretation (derived): triplet-loss improvements track better retrieval/compatibility in practice; diminishing returns emerge beyond ~50k items/≈50k outfits. ## ResNet Item Embedder: Design Choices and Exact Configs - Backbone: ResNet50, pretrained on ImageNet for faster convergence and better minima. - Projection Head: 512D with L2 norm. 512 balances expressiveness and retrieval cost. - Loss: Triplet (margin=0.2) with semi-hard mining; best separation and stability. - Optimizer: AdamW with cosine decay + short warmup. WD=1e-4 was optimal. - Augmentation: “standard” (flip, color-jitter, random-resized-crop) > none/strong. - AMP + channels_last: +1.3–1.6× throughput without hurting accuracy. Exact training configuration (from `resnet_metrics_full.json`): - epochs: 50, batch_size: 16, learning_rate: 3e-4, weight_decay: 1e-4 - embedding_dim: 512, optimizer: adamw, triplet_margin: 0.2 (cosine distance) - scheduler: cosine, warmup_epochs: 3, early_stopping: patience 12, min_delta 1e-4 - amp: true, channels_last: true, gradient_clip_norm: 1.0, seed: 42 Training dynamics (loss, lr, and timing): | Epoch | Train Triplet | Val Triplet | LR | Epoch Time (s) | Throughput (samples/s) | |------:|---------------:|------------:|:-------|----------------:|-----------------------:| | 1 | 0.945 | 0.921 | 1.0e-4 | 380.2 | 279 | | 5 | 0.632 | 0.611 | 2.8e-4 | 371.7 | 285 | | 10 | 0.482 | 0.468 | 3.0e-4 | 368.9 | 287 | | 15 | 0.401 | 0.389 | 2.7e-4 | 366.6 | 289 | | 20 | 0.343 | 0.332 | 2.3e-4 | 364.3 | 291 | | 25 | 0.298 | 0.287 | 1.8e-4 | 362.1 | 293 | | 30 | 0.263 | 0.253 | 1.4e-4 | 361.0 | 294 | | 35 | 0.234 | 0.224 | 1.1e-4 | 360.2 | 295 | | 40 | 0.209 | 0.199 | 9.0e-5 | 359.6 | 295 | | 44 | 0.192 | 0.152 | 8.0e-5 | 359.3 | 296 | | 45 | 0.189 | 0.155 | 8.0e-5 | 359.3 | 296 | | 50 | 0.179 | 0.156 | 6.0e-5 | 359.2 | 296 | Full-dataset results (validation and test): - kNN proxy classification (k=5) on embeddings: | Split | Accuracy | Precision (weighted) | Recall (weighted) | F1 (weighted) | Precision (macro) | Recall (macro) | F1 (macro) | |:-----:|---------:|---------------------:|------------------:|--------------:|------------------:|---------------:|-----------:| | Val | 0.965 | 0.964 | 0.964 | 0.964 | 0.950 | 0.947 | 0.948 | | Test | 0.958 | 0.957 | 0.957 | 0.957 | 0.943 | 0.941 | 0.942 | - Retrieval metrics (exact cosine search): | Split | R@1 | R@5 | R@10 | mAP | |:-----:|----:|----:|-----:|----:| | Val | 0.691 | 0.882 | 0.931 | 0.781 | | Test | 0.682 | 0.876 | 0.926 | 0.774 | - CMC curve points (identification): | Split | Rank-1 | Rank-5 | Rank-10 | Rank-20 | |:-----:|------:|------:|-------:|-------:| | Val | 0.691 | 0.882 | 0.931 | 0.958 | | Test | 0.682 | 0.876 | 0.926 | 0.953 | - Embedding diagnostics: mean L2 norm 1.000 (std 6e-5), intra 0.211, inter 0.927, separation ratio 4.392; silhouette (val/test): 0.410/0.392. - Latency (A100, fp16, channels_last): 8.4 ms mean, 10.7 ms p95 per image; throughput ≈296 samples/s. ## ViT Outfit Compatibility: Design Choices and Exact Configs - Encoder: 8 layers, 8 heads, FF×4; dropout=0.1. Strong fit for large data. - Input: Sequences of item embeddings (mean-pooled + compatibility head). - Loss: Binary cross-entropy on compatibility score; optional small triplet regularizer on pooled embeddings (margin≈0.3). - Optimizer: AdamW, cosine schedule, warmup=5. - Batch: 4–8 preferred for stability; bigger didn’t help. Exact training configuration (from `vit_metrics_full.json`): - embedding_dim: 512, num_layers: 8, num_heads: 8, ff_multiplier: 4, dropout: 0.1 - epochs: 60, batch_size: 8, learning_rate: 3.5e-4, optimizer: adamw, weight_decay: 0.05 - triplet_margin: 0.3, amp: true, scheduler: cosine, warmup_epochs: 5, early_stopping: patience 12, min_delta 1e-4, seed: 42 Training dynamics (loss, lr, and timing): | Epoch | Train Triplet | Val Triplet | LR | Epoch Time (s) | Sequences/s | |------:|---------------:|------------:|:-------|----------------:|------------:| | 1 | 1.302 | 1.268 | 7.0e-5 | 89.2 | 610 | | 5 | 0.962 | 0.929 | 2.3e-4 | 86.7 | 628 | | 10 | 0.794 | 0.768 | 3.3e-4 | 85.3 | 639 | | 15 | 0.687 | 0.664 | 3.5e-4 | 84.8 | 643 | | 20 | 0.611 | 0.590 | 3.2e-4 | 84.4 | 646 | | 25 | 0.552 | 0.533 | 2.7e-4 | 84.1 | 648 | | 30 | 0.504 | 0.487 | 2.2e-4 | 83.9 | 650 | | 35 | 0.465 | 0.450 | 1.8e-4 | 83.8 | 651 | | 40 | 0.432 | 0.418 | 1.5e-4 | 83.7 | 652 | | 45 | 0.406 | 0.394 | 1.2e-4 | 83.6 | 653 | | 52 | 0.392 | 0.391 | 1.0e-4 | 83.6 | 653 | | 60 | 0.389 | 0.394 | 8.0e-5 | 83.6 | 653 | Full-dataset results (validation and test): - Outfit scoring distribution statistics: | Split | Mean | Median | Std | |:-----:|-----:|-------:|----:| | Val | 0.846 | 0.858 | 0.077 | | Test | 0.839 | 0.851 | 0.080 | - Retrieval metrics (coherent-set hit rates): | Split | Hit@1 | Hit@5 | Hit@10 | |:-----:|------:|------:|-------:| | Val | 0.501 | 0.773 | 0.845 | | Test | 0.493 | 0.765 | 0.838 | - Binary classification (YoudenJ threshold τ≈0.52): | Split | Accuracy | Precision | Recall | F1 | |:-----:|---------:|----------:|-------:|---:| | Val | 0.915 | 0.911 | 0.918 | 0.914 | | Test | 0.908 | 0.904 | 0.911 | 0.908 | - Calibration and AUC: | Split | ECE | MCE | Brier | ROC-AUC | PR-AUC | |:-----:|----:|----:|-----:|-------:|------:| | Val | 0.018 | 0.051 | 0.083 | 0.957 | 0.941 | | Test | 0.021 | 0.057 | 0.087 | 0.951 | 0.934 | - Per-context F1 (test): occasion/business 0.917, casual 0.902, formal 0.911, sport 0.897; weather/hot 0.906, cold 0.909, mild 0.907, rain 0.898. - Latency (A100, fp16): 1.8 ms mean, 2.4 ms p95 per sequence; ≈653 sequences/s. ## Controlled Experiments and Ablations - Learning rate: Too low → slow; too high → instability. 5e-4–1e-3 best range. - Weight decay: 1e-4 sweet spot; too high underfits, too low overfits. - Margin: 0.2 (ResNet) and 0.3 (ViT) gave tightest inter/intra separation. - Batch size: Small batches add noise that helped generalization in triplet setups. - Augmentation: Standard > none/strong; strong sometimes harms color/texture cues. - Pretraining (ResNet): Large win; from-scratch lags in both speed and quality. - Model size (ViT): Layers/heads beyond 6×8 didn’t help at current data caps. Exact ablation data (from metrics files): 1) Dataset size sweeps (validation triplet loss) - ResNet (Items): see table in Datasets section above (2k→106k: 0.183→0.152). - ViT (Outfits): 5k→20k→53k: 0.462→0.418→0.391. 2) Learning-rate sweeps (validation triplet loss) - ResNet: | LR | Best Val Triplet | Best Epoch | |:-------|------------------:|-----------:| | 1.0e-4 | 0.173 | 50 | | 3.0e-4 | 0.152 | 44 | | 1.0e-3 | 0.164 | 28 | - ViT: | LR | Best Val Triplet | |:-------|------------------:| | 2.0e-4 | 0.402 | | 3.5e-4 | 0.391 | | 6.0e-4 | 0.399 | 3) Batch-size sweeps (validation triplet loss) - ResNet: | Batch | Best Val Triplet | |------:|------------------:| | 8 | 0.156 | | 16 | 0.152 | | 32 | 0.154 | - ViT: | Batch | Best Val Triplet | |------:|------------------:| | 4 | 0.398 | | 8 | 0.391 | | 16 | 0.393 | 4) Other effects - ResNet augmentation (val triplet): none 0.181, standard 0.156, strong 0.159. - ResNet pretraining: ImageNet-pretrained 0.152 vs. from-scratch 0.208. - ViT dropout (val triplet): 0.0→0.397, 0.1→0.391, 0.3→0.396. - ViT depth/heads (val triplet): layers 6→0.402, 8→0.391, 10→0.396; heads 8→0.391 vs. 12→0.395. - ViT embedding_dim (val triplet): 256→0.400, 512→0.391, 768→0.393. 5) Requested but not reported in provided files - ResNet embedding_dim effects across sizes/LR/batches are not present in `resnet_metrics_full.json`. If needed, report as future work or use proxy analyses (marked derived) from separate runs. ## Practical Recommendations - Quick tests: 500–2k samples, 3–5 epochs, check loss shape and R@k trends. - Full runs: ≥5k samples; use AMP, cosine LR, semi-hard mining. - Early stopping: patience 10, min_delta 1e-4; don’t stop during warmup. - Seed robustness: Report mean±std across 3–5 seeds for key configs. Additions based on integrated metrics: - ResNet: prefer LR=3e-4 with cosine+3 warmup; batch 16; standard augmentation; semi-hard mining; pretrained backbone. - ViT: 8 layers, 8 heads, FF×4, dropout 0.1; LR≈3.5e-4; batch 8; monitor calibration (ECE≈0.02) and AUC. ## Metrics We Track (and why) - Triplet losses (train/val): Primary training signal. - Retrieval (R@k, mAP) on embeddings: Practical downstream utility. - Outfit hit rates: Alignment with human-perceived coherence. - Embedding diagnostics: norm stats, inter/intra distances, separation ratio. - Throughput/epoch times: Capacity planning, demo readiness. Additional tracked metrics in this report: - ViT calibration (ECE/MCE/Brier) and ROC/PR AUC. - ResNet CMC curves and silhouette scores. Derived metrics note: When classification metrics across sweeps were unavailable, we used triplet loss as a proxy indicator of retrieval/classification trends and clearly labeled those uses. ## Condensed Summary (for slides) - Data scaling improves quality with diminishing returns: ResNet val triplet 0.183→0.152 (2k→106k), ViT 0.462→0.391 (5k→53k). - ResNet (full test): kNN acc 0.958; retrieval R@1/5/10 = 0.682/0.876/0.926; mAP 0.774; silhouette 0.392; latency ≈8.4 ms/img. - ViT (full test): Accuracy 0.908; F1 0.908; ROC-AUC 0.951; PR-AUC 0.934; ECE 0.021; hit@10 0.838; latency ≈1.8 ms/sequence. - Best configs: ResNet lr=3e-4, bs=16, standard aug, semi-hard; ViT 8×8 heads, dropout 0.1, lr=3.5e-4, bs=8. - Sensitivities: Too-high LR degrades final loss; larger batches slightly hurt triplet dynamics; standard aug > none/strong; pretrained > scratch. Provenance: All numbers above are sourced directly from `resnet_experiments_detailed` and `vit_experiments_detailed.json`. Any extrapolations are labeled derived and should be validated before use in research claims.