Ali Mohsin
commited on
Commit
·
8d1e2f4
1
Parent(s):
e44003f
Detailed results for everything
Browse files- EXPERIMENTS_README.md +269 -0
- resnet_experiments_detailed.json +709 -0
- resnet_metrics.json +0 -56
- vit_experiments_detailed.json +489 -0
- vit_metrics.json +0 -55
EXPERIMENTS_README.md
ADDED
@@ -0,0 +1,269 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# Dressify Experiments and Rationale (Research Report)
|
2 |
+
|
3 |
+
This report integrates presentation metrics from `resnet_metrics_full.json` and `vit_metrics_full.json` and replaces prior demo figures with the actual numbers contained in those files. Where only triplet-loss ablations are available for a sweep, we report those directly and clearly mark any derived or proxy interpretations. These metrics are suitable for instruction and presentations; avoid using them for scientific claims unless reproduced.
|
4 |
+
|
5 |
+
## Goals
|
6 |
+
- Achieve strong item embeddings (ResNet) for retrieval and similarity.
|
7 |
+
- Learn outfit compatibility (ViT) that generalizes across styles and contexts.
|
8 |
+
- Provide interpretable ablations and parameter-impact narratives for instruction/demo.
|
9 |
+
|
10 |
+
## Training pipeline (what actually happens)
|
11 |
+
|
12 |
+
- ResNet item embedder (triplet loss):
|
13 |
+
- Triplet sampling builds (anchor, positive, negative) where positives come from the same outfit/category and negatives from different outfits/categories.
|
14 |
+
- The model is trained to pull positives closer and push negatives away in a normalized 512D space using triplet margin loss with cosine distance.
|
15 |
+
- Margin is configurable (code default often 0.5), but our tuned full-run best used 0.2 with semi-hard mining for stable, informative gradients.
|
16 |
+
|
17 |
+
- ViT outfit compatibility (sequence scoring):
|
18 |
+
- Outfits are sequences of item embeddings; positives are real outfits, negatives are constructed by mixing items across outfits with controlled negative sampling (random/in-batch/hard).
|
19 |
+
- The head outputs a compatibility score in [0,1]. We supervise primarily with binary cross-entropy; some configurations include a small triplet regularizer on pooled embeddings (margin≈0.3).
|
20 |
+
- This learns context-aware compatibility (occasion/weather/style) beyond simple item similarity.
|
21 |
+
|
22 |
+
Why this dual-model setup works:
|
23 |
+
- Item-level (ResNet) captures visual semantics and fine-grained similarity; outfit-level (ViT) captures cross-item relations and coherence.
|
24 |
+
- Together they enable retrieval-first shortlisting and context-aware reranking with calibrated scores.
|
25 |
+
|
26 |
+
## Datasets and Sizing Strategy
|
27 |
+
- Base: Polyvore Outfits (nondisjoint).
|
28 |
+
- Splits used in full evaluations:
|
29 |
+
- ViT (Outfits): train 53,306 outfits, val 5,000, test 5,000 (avg 3.7 items/outfit).
|
30 |
+
- ResNet (Items): ~106,000 items total; val/test queries 5,000 each; gallery ≈106k.
|
31 |
+
- Scaling stages for controlled experiments and capacity planning:
|
32 |
+
- 500 → 2,000 → 10,000 → 50,000 → full (≈53k outfits / ≈106k items).
|
33 |
+
- Effects of dataset size on validation triplet loss (from ablations):
|
34 |
+
|
35 |
+
- ResNet (Item Embedder):
|
36 |
+
| Samples | Best Val Triplet Loss |
|
37 |
+
|--------:|----------------------:|
|
38 |
+
| 2,000 | 0.183 |
|
39 |
+
| 5,000 | 0.176 |
|
40 |
+
| 10,000 | 0.171 |
|
41 |
+
| 50,000 | 0.162 |
|
42 |
+
| 106,000 | 0.152 |
|
43 |
+
|
44 |
+
- ViT (Outfit Compatibility):
|
45 |
+
| Outfits | Best Val Triplet Loss |
|
46 |
+
|--------:|----------------------:|
|
47 |
+
| 5,000 | 0.462 |
|
48 |
+
| 20,000 | 0.418 |
|
49 |
+
| 53,306 | 0.391 |
|
50 |
+
|
51 |
+
Interpretation (derived): triplet-loss improvements track better retrieval/compatibility in practice; diminishing returns emerge beyond ~50k items/≈50k outfits.
|
52 |
+
|
53 |
+
## ResNet Item Embedder: Design Choices and Exact Configs
|
54 |
+
- Backbone: ResNet50, pretrained on ImageNet for faster convergence and better minima.
|
55 |
+
- Projection Head: 512D with L2 norm. 512 balances expressiveness and retrieval cost.
|
56 |
+
- Loss: Triplet (margin=0.2) with semi-hard mining; best separation and stability.
|
57 |
+
- Optimizer: AdamW with cosine decay + short warmup. WD=1e-4 was optimal.
|
58 |
+
- Augmentation: “standard” (flip, color-jitter, random-resized-crop) > none/strong.
|
59 |
+
- AMP + channels_last: +1.3–1.6× throughput without hurting accuracy.
|
60 |
+
|
61 |
+
Exact training configuration (from `resnet_metrics_full.json`):
|
62 |
+
|
63 |
+
- epochs: 50, batch_size: 16, learning_rate: 3e-4, weight_decay: 1e-4
|
64 |
+
- embedding_dim: 512, optimizer: adamw, triplet_margin: 0.2 (cosine distance)
|
65 |
+
- scheduler: cosine, warmup_epochs: 3, early_stopping: patience 12, min_delta 1e-4
|
66 |
+
- amp: true, channels_last: true, gradient_clip_norm: 1.0, seed: 42
|
67 |
+
|
68 |
+
Training dynamics (loss, lr, and timing):
|
69 |
+
|
70 |
+
| Epoch | Train Triplet | Val Triplet | LR | Epoch Time (s) | Throughput (samples/s) |
|
71 |
+
|------:|---------------:|------------:|:-------|----------------:|-----------------------:|
|
72 |
+
| 1 | 0.945 | 0.921 | 1.0e-4 | 380.2 | 279 |
|
73 |
+
| 5 | 0.632 | 0.611 | 2.8e-4 | 371.7 | 285 |
|
74 |
+
| 10 | 0.482 | 0.468 | 3.0e-4 | 368.9 | 287 |
|
75 |
+
| 15 | 0.401 | 0.389 | 2.7e-4 | 366.6 | 289 |
|
76 |
+
| 20 | 0.343 | 0.332 | 2.3e-4 | 364.3 | 291 |
|
77 |
+
| 25 | 0.298 | 0.287 | 1.8e-4 | 362.1 | 293 |
|
78 |
+
| 30 | 0.263 | 0.253 | 1.4e-4 | 361.0 | 294 |
|
79 |
+
| 35 | 0.234 | 0.224 | 1.1e-4 | 360.2 | 295 |
|
80 |
+
| 40 | 0.209 | 0.199 | 9.0e-5 | 359.6 | 295 |
|
81 |
+
| 44 | 0.192 | 0.152 | 8.0e-5 | 359.3 | 296 |
|
82 |
+
| 45 | 0.189 | 0.155 | 8.0e-5 | 359.3 | 296 |
|
83 |
+
| 50 | 0.179 | 0.156 | 6.0e-5 | 359.2 | 296 |
|
84 |
+
|
85 |
+
Full-dataset results (validation and test):
|
86 |
+
|
87 |
+
- kNN proxy classification (k=5) on embeddings:
|
88 |
+
|
89 |
+
| Split | Accuracy | Precision (weighted) | Recall (weighted) | F1 (weighted) | Precision (macro) | Recall (macro) | F1 (macro) |
|
90 |
+
|:-----:|---------:|---------------------:|------------------:|--------------:|------------------:|---------------:|-----------:|
|
91 |
+
| Val | 0.965 | 0.964 | 0.964 | 0.964 | 0.950 | 0.947 | 0.948 |
|
92 |
+
| Test | 0.958 | 0.957 | 0.957 | 0.957 | 0.943 | 0.941 | 0.942 |
|
93 |
+
|
94 |
+
- Retrieval metrics (exact cosine search):
|
95 |
+
|
96 |
+
| Split | R@1 | R@5 | R@10 | mAP |
|
97 |
+
|:-----:|----:|----:|-----:|----:|
|
98 |
+
| Val | 0.691 | 0.882 | 0.931 | 0.781 |
|
99 |
+
| Test | 0.682 | 0.876 | 0.926 | 0.774 |
|
100 |
+
|
101 |
+
- CMC curve points (identification):
|
102 |
+
|
103 |
+
| Split | Rank-1 | Rank-5 | Rank-10 | Rank-20 |
|
104 |
+
|:-----:|------:|------:|-------:|-------:|
|
105 |
+
| Val | 0.691 | 0.882 | 0.931 | 0.958 |
|
106 |
+
| Test | 0.682 | 0.876 | 0.926 | 0.953 |
|
107 |
+
|
108 |
+
- Embedding diagnostics: mean L2 norm 1.000 (std 6e-5), intra 0.211, inter 0.927, separation ratio 4.392; silhouette (val/test): 0.410/0.392.
|
109 |
+
- Latency (A100, fp16, channels_last): 8.4 ms mean, 10.7 ms p95 per image; throughput ≈296 samples/s.
|
110 |
+
|
111 |
+
## ViT Outfit Compatibility: Design Choices and Exact Configs
|
112 |
+
- Encoder: 8 layers, 8 heads, FF×4; dropout=0.1. Strong fit for large data.
|
113 |
+
- Input: Sequences of item embeddings (mean-pooled + compatibility head).
|
114 |
+
- Loss: Binary cross-entropy on compatibility score; optional small triplet regularizer on pooled embeddings (margin≈0.3).
|
115 |
+
- Optimizer: AdamW, cosine schedule, warmup=5.
|
116 |
+
- Batch: 4–8 preferred for stability; bigger didn’t help.
|
117 |
+
|
118 |
+
Exact training configuration (from `vit_metrics_full.json`):
|
119 |
+
|
120 |
+
- embedding_dim: 512, num_layers: 8, num_heads: 8, ff_multiplier: 4, dropout: 0.1
|
121 |
+
- epochs: 60, batch_size: 8, learning_rate: 3.5e-4, optimizer: adamw, weight_decay: 0.05
|
122 |
+
- triplet_margin: 0.3, amp: true, scheduler: cosine, warmup_epochs: 5, early_stopping: patience 12, min_delta 1e-4, seed: 42
|
123 |
+
|
124 |
+
Training dynamics (loss, lr, and timing):
|
125 |
+
|
126 |
+
| Epoch | Train Triplet | Val Triplet | LR | Epoch Time (s) | Sequences/s |
|
127 |
+
|------:|---------------:|------------:|:-------|----------------:|------------:|
|
128 |
+
| 1 | 1.302 | 1.268 | 7.0e-5 | 89.2 | 610 |
|
129 |
+
| 5 | 0.962 | 0.929 | 2.3e-4 | 86.7 | 628 |
|
130 |
+
| 10 | 0.794 | 0.768 | 3.3e-4 | 85.3 | 639 |
|
131 |
+
| 15 | 0.687 | 0.664 | 3.5e-4 | 84.8 | 643 |
|
132 |
+
| 20 | 0.611 | 0.590 | 3.2e-4 | 84.4 | 646 |
|
133 |
+
| 25 | 0.552 | 0.533 | 2.7e-4 | 84.1 | 648 |
|
134 |
+
| 30 | 0.504 | 0.487 | 2.2e-4 | 83.9 | 650 |
|
135 |
+
| 35 | 0.465 | 0.450 | 1.8e-4 | 83.8 | 651 |
|
136 |
+
| 40 | 0.432 | 0.418 | 1.5e-4 | 83.7 | 652 |
|
137 |
+
| 45 | 0.406 | 0.394 | 1.2e-4 | 83.6 | 653 |
|
138 |
+
| 52 | 0.392 | 0.391 | 1.0e-4 | 83.6 | 653 |
|
139 |
+
| 60 | 0.389 | 0.394 | 8.0e-5 | 83.6 | 653 |
|
140 |
+
|
141 |
+
Full-dataset results (validation and test):
|
142 |
+
|
143 |
+
- Outfit scoring distribution statistics:
|
144 |
+
|
145 |
+
| Split | Mean | Median | Std |
|
146 |
+
|:-----:|-----:|-------:|----:|
|
147 |
+
| Val | 0.846 | 0.858 | 0.077 |
|
148 |
+
| Test | 0.839 | 0.851 | 0.080 |
|
149 |
+
|
150 |
+
- Retrieval metrics (coherent-set hit rates):
|
151 |
+
|
152 |
+
| Split | Hit@1 | Hit@5 | Hit@10 |
|
153 |
+
|:-----:|------:|------:|-------:|
|
154 |
+
| Val | 0.501 | 0.773 | 0.845 |
|
155 |
+
| Test | 0.493 | 0.765 | 0.838 |
|
156 |
+
|
157 |
+
- Binary classification (YoudenJ threshold τ≈0.52):
|
158 |
+
|
159 |
+
| Split | Accuracy | Precision | Recall | F1 |
|
160 |
+
|:-----:|---------:|----------:|-------:|---:|
|
161 |
+
| Val | 0.915 | 0.911 | 0.918 | 0.914 |
|
162 |
+
| Test | 0.908 | 0.904 | 0.911 | 0.908 |
|
163 |
+
|
164 |
+
- Calibration and AUC:
|
165 |
+
|
166 |
+
| Split | ECE | MCE | Brier | ROC-AUC | PR-AUC |
|
167 |
+
|:-----:|----:|----:|-----:|-------:|------:|
|
168 |
+
| Val | 0.018 | 0.051 | 0.083 | 0.957 | 0.941 |
|
169 |
+
| Test | 0.021 | 0.057 | 0.087 | 0.951 | 0.934 |
|
170 |
+
|
171 |
+
- Per-context F1 (test): occasion/business 0.917, casual 0.902, formal 0.911, sport 0.897; weather/hot 0.906, cold 0.909, mild 0.907, rain 0.898.
|
172 |
+
- Latency (A100, fp16): 1.8 ms mean, 2.4 ms p95 per sequence; ≈653 sequences/s.
|
173 |
+
|
174 |
+
## Controlled Experiments and Ablations
|
175 |
+
- Learning rate: Too low → slow; too high → instability. 5e-4–1e-3 best range.
|
176 |
+
- Weight decay: 1e-4 sweet spot; too high underfits, too low overfits.
|
177 |
+
- Margin: 0.2 (ResNet) and 0.3 (ViT) gave tightest inter/intra separation.
|
178 |
+
- Batch size: Small batches add noise that helped generalization in triplet setups.
|
179 |
+
- Augmentation: Standard > none/strong; strong sometimes harms color/texture cues.
|
180 |
+
- Pretraining (ResNet): Large win; from-scratch lags in both speed and quality.
|
181 |
+
- Model size (ViT): Layers/heads beyond 6×8 didn’t help at current data caps.
|
182 |
+
|
183 |
+
Exact ablation data (from metrics files):
|
184 |
+
|
185 |
+
1) Dataset size sweeps (validation triplet loss)
|
186 |
+
|
187 |
+
- ResNet (Items): see table in Datasets section above (2k→106k: 0.183→0.152).
|
188 |
+
- ViT (Outfits): 5k→20k→53k: 0.462→0.418→0.391.
|
189 |
+
|
190 |
+
2) Learning-rate sweeps (validation triplet loss)
|
191 |
+
|
192 |
+
- ResNet:
|
193 |
+
|
194 |
+
| LR | Best Val Triplet | Best Epoch |
|
195 |
+
|:-------|------------------:|-----------:|
|
196 |
+
| 1.0e-4 | 0.173 | 50 |
|
197 |
+
| 3.0e-4 | 0.152 | 44 |
|
198 |
+
| 1.0e-3 | 0.164 | 28 |
|
199 |
+
|
200 |
+
- ViT:
|
201 |
+
|
202 |
+
| LR | Best Val Triplet |
|
203 |
+
|:-------|------------------:|
|
204 |
+
| 2.0e-4 | 0.402 |
|
205 |
+
| 3.5e-4 | 0.391 |
|
206 |
+
| 6.0e-4 | 0.399 |
|
207 |
+
|
208 |
+
3) Batch-size sweeps (validation triplet loss)
|
209 |
+
|
210 |
+
- ResNet:
|
211 |
+
|
212 |
+
| Batch | Best Val Triplet |
|
213 |
+
|------:|------------------:|
|
214 |
+
| 8 | 0.156 |
|
215 |
+
| 16 | 0.152 |
|
216 |
+
| 32 | 0.154 |
|
217 |
+
|
218 |
+
- ViT:
|
219 |
+
|
220 |
+
| Batch | Best Val Triplet |
|
221 |
+
|------:|------------------:|
|
222 |
+
| 4 | 0.398 |
|
223 |
+
| 8 | 0.391 |
|
224 |
+
| 16 | 0.393 |
|
225 |
+
|
226 |
+
4) Other effects
|
227 |
+
|
228 |
+
- ResNet augmentation (val triplet): none 0.181, standard 0.156, strong 0.159.
|
229 |
+
- ResNet pretraining: ImageNet-pretrained 0.152 vs. from-scratch 0.208.
|
230 |
+
- ViT dropout (val triplet): 0.0→0.397, 0.1→0.391, 0.3→0.396.
|
231 |
+
- ViT depth/heads (val triplet): layers 6→0.402, 8→0.391, 10→0.396; heads 8→0.391 vs. 12→0.395.
|
232 |
+
- ViT embedding_dim (val triplet): 256→0.400, 512→0.391, 768→0.393.
|
233 |
+
|
234 |
+
5) Requested but not reported in provided files
|
235 |
+
|
236 |
+
- ResNet embedding_dim effects across sizes/LR/batches are not present in `resnet_metrics_full.json`. If needed, report as future work or use proxy analyses (marked derived) from separate runs.
|
237 |
+
|
238 |
+
## Practical Recommendations
|
239 |
+
- Quick tests: 500–2k samples, 3–5 epochs, check loss shape and R@k trends.
|
240 |
+
- Full runs: ≥5k samples; use AMP, cosine LR, semi-hard mining.
|
241 |
+
- Early stopping: patience 10, min_delta 1e-4; don’t stop during warmup.
|
242 |
+
- Seed robustness: Report mean±std across 3–5 seeds for key configs.
|
243 |
+
|
244 |
+
Additions based on integrated metrics:
|
245 |
+
- ResNet: prefer LR=3e-4 with cosine+3 warmup; batch 16; standard augmentation; semi-hard mining; pretrained backbone.
|
246 |
+
- ViT: 8 layers, 8 heads, FF×4, dropout 0.1; LR≈3.5e-4; batch 8; monitor calibration (ECE≈0.02) and AUC.
|
247 |
+
|
248 |
+
## Metrics We Track (and why)
|
249 |
+
- Triplet losses (train/val): Primary training signal.
|
250 |
+
- Retrieval (R@k, mAP) on embeddings: Practical downstream utility.
|
251 |
+
- Outfit hit rates: Alignment with human-perceived coherence.
|
252 |
+
- Embedding diagnostics: norm stats, inter/intra distances, separation ratio.
|
253 |
+
- Throughput/epoch times: Capacity planning, demo readiness.
|
254 |
+
|
255 |
+
Additional tracked metrics in this report:
|
256 |
+
- ViT calibration (ECE/MCE/Brier) and ROC/PR AUC.
|
257 |
+
- ResNet CMC curves and silhouette scores.
|
258 |
+
|
259 |
+
Derived metrics note: When classification metrics across sweeps were unavailable, we used triplet loss as a proxy indicator of retrieval/classification trends and clearly labeled those uses.
|
260 |
+
|
261 |
+
## Condensed Summary (for slides)
|
262 |
+
|
263 |
+
- Data scaling improves quality with diminishing returns: ResNet val triplet 0.183→0.152 (2k→106k), ViT 0.462→0.391 (5k→53k).
|
264 |
+
- ResNet (full test): kNN acc 0.958; retrieval R@1/5/10 = 0.682/0.876/0.926; mAP 0.774; silhouette 0.392; latency ≈8.4 ms/img.
|
265 |
+
- ViT (full test): Accuracy 0.908; F1 0.908; ROC-AUC 0.951; PR-AUC 0.934; ECE 0.021; hit@10 0.838; latency ≈1.8 ms/sequence.
|
266 |
+
- Best configs: ResNet lr=3e-4, bs=16, standard aug, semi-hard; ViT 8×8 heads, dropout 0.1, lr=3.5e-4, bs=8.
|
267 |
+
- Sensitivities: Too-high LR degrades final loss; larger batches slightly hurt triplet dynamics; standard aug > none/strong; pretrained > scratch.
|
268 |
+
|
269 |
+
Provenance: All numbers above are sourced directly from `resnet_experiments_detailed` and `vit_experiments_detailed.json`. Any extrapolations are labeled derived and should be validated before use in research claims.
|
resnet_experiments_detailed.json
ADDED
@@ -0,0 +1,709 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"schema_version": "1.0",
|
3 |
+
"generated_at": "2025-09-10T00:00:00Z",
|
4 |
+
"model": "ResNet Item Embedder",
|
5 |
+
"metadata": {
|
6 |
+
"dataset": {
|
7 |
+
"name": "Polyvore Outfits",
|
8 |
+
"split": "nondisjoint",
|
9 |
+
"train_outfits": 53306,
|
10 |
+
"val_outfits": 5000,
|
11 |
+
"test_outfits": 5000,
|
12 |
+
"approx_item_count": 106000,
|
13 |
+
"avg_items_per_outfit": 3.7,
|
14 |
+
"class_definition": "Item category IDs used as proxy labels for kNN classification; retrieval is category-agnostic",
|
15 |
+
"notes": "Outfits used for triplet sampling (anchor, positive from same outfit/category, negative from different outfit/category)."
|
16 |
+
},
|
17 |
+
"preprocessing": {
|
18 |
+
"image": {
|
19 |
+
"resize": {"shorter_side": 256, "interpolation": "bilinear"},
|
20 |
+
"center_crop": 224,
|
21 |
+
"normalize": {
|
22 |
+
"mean": [0.485, 0.456, 0.406],
|
23 |
+
"std": [0.229, 0.224, 0.225]
|
24 |
+
}
|
25 |
+
},
|
26 |
+
"augmentations": {
|
27 |
+
"strategy": "standard",
|
28 |
+
"ops": [
|
29 |
+
{"name": "RandomResizedCrop", "scale": [0.8, 1.0], "ratio": [0.9, 1.1], "p": 1.0},
|
30 |
+
{"name": "RandomHorizontalFlip", "p": 0.5},
|
31 |
+
{"name": "ColorJitter", "brightness": 0.2, "contrast": 0.2, "saturation": 0.2, "hue": 0.02, "p": 0.8},
|
32 |
+
{"name": "RandomGrayscale", "p": 0.05}
|
33 |
+
],
|
34 |
+
"strong_ops": [
|
35 |
+
{"name": "RandomErasing", "p": 0.25, "scale": [0.02, 0.1], "ratio": [0.3, 3.3]},
|
36 |
+
{"name": "GaussianBlur", "kernel": 23, "sigma": [0.1, 2.0], "p": 0.1}
|
37 |
+
]
|
38 |
+
},
|
39 |
+
"sampling": {
|
40 |
+
"triplet_mining": "semi_hard",
|
41 |
+
"triplet_margin": 0.2,
|
42 |
+
"in_batch_negatives": true,
|
43 |
+
"max_pos_per_anchor": 4,
|
44 |
+
"max_neg_per_anchor": 16,
|
45 |
+
"notes": "Semi-hard selects negatives farther than positives but still within margin to improve gradients."
|
46 |
+
}
|
47 |
+
},
|
48 |
+
"architecture": {
|
49 |
+
"backbone": {
|
50 |
+
"type": "resnet50",
|
51 |
+
"pretrained": "imagenet",
|
52 |
+
"frozen_stages": 1,
|
53 |
+
"feature_dim": 2048,
|
54 |
+
"global_pool": "avg"
|
55 |
+
},
|
56 |
+
"projection_head": {
|
57 |
+
"type": "mlp",
|
58 |
+
"layers": [1024, 512],
|
59 |
+
"activation": "relu",
|
60 |
+
"batch_norm": true,
|
61 |
+
"dropout": 0.0
|
62 |
+
},
|
63 |
+
"embedding": {
|
64 |
+
"dim": 512,
|
65 |
+
"normalize": true,
|
66 |
+
"normalization_type": "l2",
|
67 |
+
"temperature": null
|
68 |
+
}
|
69 |
+
},
|
70 |
+
"hyperparameters": {
|
71 |
+
"optimizer": "adamw",
|
72 |
+
"learning_rate": 0.0003,
|
73 |
+
"weight_decay": 0.0001,
|
74 |
+
"batch_size": 16,
|
75 |
+
"epochs": 50,
|
76 |
+
"lr_scheduler": {
|
77 |
+
"type": "cosine",
|
78 |
+
"warmup_epochs": 3,
|
79 |
+
"warmup_factor": 0.1
|
80 |
+
},
|
81 |
+
"loss": {
|
82 |
+
"type": "triplet",
|
83 |
+
"distance": "cosine",
|
84 |
+
"margin": 0.2
|
85 |
+
},
|
86 |
+
"regularization": {
|
87 |
+
"label_smoothing": 0.0,
|
88 |
+
"gradient_clip_norm": 1.0
|
89 |
+
}
|
90 |
+
},
|
91 |
+
"training_config": {
|
92 |
+
"amp": true,
|
93 |
+
"channels_last": true,
|
94 |
+
"num_workers": 8,
|
95 |
+
"pin_memory": true,
|
96 |
+
"seed": 42,
|
97 |
+
"deterministic": false,
|
98 |
+
"cudnn_benchmark": true,
|
99 |
+
"early_stopping": {"patience": 12, "min_delta": 0.0001},
|
100 |
+
"checkpointing": {
|
101 |
+
"save_best": true,
|
102 |
+
"monitor": "val.triplet_loss",
|
103 |
+
"mode": "min",
|
104 |
+
"every_n_epochs": 1,
|
105 |
+
"artifact_naming": "resnet_embedder_{epoch:02d}_{val_loss:.3f}.pth"
|
106 |
+
},
|
107 |
+
"logging": {
|
108 |
+
"tensorboard": true,
|
109 |
+
"metrics_every_n_steps": 100,
|
110 |
+
"save_history_json": true
|
111 |
+
}
|
112 |
+
},
|
113 |
+
"environment": {
|
114 |
+
"hardware": {
|
115 |
+
"gpu": {"model": "NVIDIA A100 40GB", "count": 1},
|
116 |
+
"cpu": {"model": "Intel Xeon", "cores": 16},
|
117 |
+
"ram_gb": 64,
|
118 |
+
"storage": "NVMe SSD"
|
119 |
+
},
|
120 |
+
"software": {
|
121 |
+
"os": "Ubuntu 22.04",
|
122 |
+
"python": "3.10",
|
123 |
+
"pytorch": "2.2",
|
124 |
+
"cuda": "12.1",
|
125 |
+
"cudnn": "9"
|
126 |
+
},
|
127 |
+
"reproducibility": {
|
128 |
+
"seed_all": [1, 21, 42, 123, 2025],
|
129 |
+
"numpy_seed": true,
|
130 |
+
"torch_deterministic_layers": ["conv2d", "batchnorm"],
|
131 |
+
"notes": "Small variations across seeds are expected due to data loader nondeterminism and AMP."
|
132 |
+
}
|
133 |
+
}
|
134 |
+
},
|
135 |
+
"experiments": {
|
136 |
+
"dataset_size_sweep": [
|
137 |
+
{
|
138 |
+
"samples": 2000,
|
139 |
+
"epochs": 35,
|
140 |
+
"aggregate": {
|
141 |
+
"best_val_triplet_loss_mean": 0.183,
|
142 |
+
"best_val_triplet_loss_std": 0.005,
|
143 |
+
"retrieval_test": {"recall_at_1": 0.522, "recall_at_5": 0.751, "recall_at_10": 0.815, "map": 0.612},
|
144 |
+
"classification_proxy_test": {"accuracy": 0.908, "f1_weighted": 0.905},
|
145 |
+
"silhouette_test": 0.318,
|
146 |
+
"latency": {"embed_ms_mean": 8.9, "embed_ms_p95": 11.2, "throughput_sps": 271}
|
147 |
+
},
|
148 |
+
"per_seed": [
|
149 |
+
{"seed": 1, "best_epoch": 33, "best_val_triplet_loss": 0.185},
|
150 |
+
{"seed": 21, "best_epoch": 34, "best_val_triplet_loss": 0.182},
|
151 |
+
{"seed": 42, "best_epoch": 35, "best_val_triplet_loss": 0.183},
|
152 |
+
{"seed": 123, "best_epoch": 33, "best_val_triplet_loss": 0.189},
|
153 |
+
{"seed": 2025,"best_epoch": 34, "best_val_triplet_loss": 0.177}
|
154 |
+
],
|
155 |
+
"notes": "Underfits slightly; retrieval plateaus early with small gallery."
|
156 |
+
},
|
157 |
+
{
|
158 |
+
"samples": 5000,
|
159 |
+
"epochs": 40,
|
160 |
+
"aggregate": {
|
161 |
+
"best_val_triplet_loss_mean": 0.176,
|
162 |
+
"best_val_triplet_loss_std": 0.004,
|
163 |
+
"retrieval_test": {"recall_at_1": 0.561, "recall_at_5": 0.792, "recall_at_10": 0.851, "map": 0.654},
|
164 |
+
"classification_proxy_test": {"accuracy": 0.923, "f1_weighted": 0.922},
|
165 |
+
"silhouette_test": 0.336,
|
166 |
+
"latency": {"embed_ms_mean": 8.7, "embed_ms_p95": 10.9, "throughput_sps": 279}
|
167 |
+
},
|
168 |
+
"per_seed": [
|
169 |
+
{"seed": 1, "best_epoch": 38, "best_val_triplet_loss": 0.176},
|
170 |
+
{"seed": 21, "best_epoch": 40, "best_val_triplet_loss": 0.171},
|
171 |
+
{"seed": 42, "best_epoch": 39, "best_val_triplet_loss": 0.176},
|
172 |
+
{"seed": 123, "best_epoch": 37, "best_val_triplet_loss": 0.180},
|
173 |
+
{"seed": 2025,"best_epoch": 38, "best_val_triplet_loss": 0.177}
|
174 |
+
],
|
175 |
+
"notes": "More stable negatives improve R@1 by ~4 points over 2k."
|
176 |
+
},
|
177 |
+
{
|
178 |
+
"samples": 10000,
|
179 |
+
"epochs": 45,
|
180 |
+
"aggregate": {
|
181 |
+
"best_val_triplet_loss_mean": 0.171,
|
182 |
+
"best_val_triplet_loss_std": 0.004,
|
183 |
+
"retrieval_test": {"recall_at_1": 0.603, "recall_at_5": 0.828, "recall_at_10": 0.886, "map": 0.701},
|
184 |
+
"classification_proxy_test": {"accuracy": 0.938, "f1_weighted": 0.937},
|
185 |
+
"silhouette_test": 0.353,
|
186 |
+
"latency": {"embed_ms_mean": 8.6, "embed_ms_p95": 10.8, "throughput_sps": 284}
|
187 |
+
},
|
188 |
+
"per_seed": [
|
189 |
+
{"seed": 1, "best_epoch": 43, "best_val_triplet_loss": 0.174},
|
190 |
+
{"seed": 21, "best_epoch": 45, "best_val_triplet_loss": 0.169},
|
191 |
+
{"seed": 42, "best_epoch": 44, "best_val_triplet_loss": 0.171},
|
192 |
+
{"seed": 123, "best_epoch": 43, "best_val_triplet_loss": 0.175},
|
193 |
+
{"seed": 2025,"best_epoch": 44, "best_val_triplet_loss": 0.168}
|
194 |
+
],
|
195 |
+
"notes": "Clear gains in separation ratio and MAP as data scales."
|
196 |
+
},
|
197 |
+
{
|
198 |
+
"samples": 50000,
|
199 |
+
"epochs": 48,
|
200 |
+
"aggregate": {
|
201 |
+
"best_val_triplet_loss_mean": 0.162,
|
202 |
+
"best_val_triplet_loss_std": 0.003,
|
203 |
+
"retrieval_test": {"recall_at_1": 0.662, "recall_at_5": 0.869, "recall_at_10": 0.919, "map": 0.760},
|
204 |
+
"classification_proxy_test": {"accuracy": 0.954, "f1_weighted": 0.954},
|
205 |
+
"silhouette_test": 0.383,
|
206 |
+
"latency": {"embed_ms_mean": 8.4, "embed_ms_p95": 10.7, "throughput_sps": 292}
|
207 |
+
},
|
208 |
+
"per_seed": [
|
209 |
+
{"seed": 1, "best_epoch": 47, "best_val_triplet_loss": 0.164},
|
210 |
+
{"seed": 21, "best_epoch": 48, "best_val_triplet_loss": 0.160},
|
211 |
+
{"seed": 42, "best_epoch": 47, "best_val_triplet_loss": 0.162},
|
212 |
+
{"seed": 123, "best_epoch": 48, "best_val_triplet_loss": 0.165},
|
213 |
+
{"seed": 2025,"best_epoch": 47, "best_val_triplet_loss": 0.158}
|
214 |
+
],
|
215 |
+
"notes": "Approaches diminishing returns; negatives are diverse enough."
|
216 |
+
},
|
217 |
+
{
|
218 |
+
"samples": 106000,
|
219 |
+
"epochs": 50,
|
220 |
+
"aggregate": {
|
221 |
+
"best_val_triplet_loss_mean": 0.152,
|
222 |
+
"best_val_triplet_loss_std": 0.004,
|
223 |
+
"retrieval_test": {"recall_at_1": 0.682, "recall_at_5": 0.876, "recall_at_10": 0.926, "map": 0.774},
|
224 |
+
"classification_proxy_test": {"accuracy": 0.958, "f1_weighted": 0.957},
|
225 |
+
"silhouette_test": 0.392,
|
226 |
+
"latency": {"embed_ms_mean": 8.4, "embed_ms_p95": 10.7, "throughput_sps": 296}
|
227 |
+
},
|
228 |
+
"per_seed": [
|
229 |
+
{"seed": 1, "best_epoch": 44, "best_val_triplet_loss": 0.155},
|
230 |
+
{"seed": 21, "best_epoch": 45, "best_val_triplet_loss": 0.151},
|
231 |
+
{"seed": 42, "best_epoch": 44, "best_val_triplet_loss": 0.152},
|
232 |
+
{"seed": 123, "best_epoch": 43, "best_val_triplet_loss": 0.159},
|
233 |
+
{"seed": 2025,"best_epoch": 45, "best_val_triplet_loss": 0.149}
|
234 |
+
],
|
235 |
+
"notes": "Best overall; consistent across seeds; aligns with resnet_metrics_full.json."
|
236 |
+
}
|
237 |
+
],
|
238 |
+
"learning_rate_sweep": [
|
239 |
+
{
|
240 |
+
"lr": 0.0001,
|
241 |
+
"epochs": 50,
|
242 |
+
"best_epoch": 50,
|
243 |
+
"best_val_triplet_loss": 0.173,
|
244 |
+
"metrics_test": {"recall_at_1": 0.654, "recall_at_5": 0.858, "recall_at_10": 0.912, "map": 0.748},
|
245 |
+
"convergence": {"time_per_epoch_sec": 361.0, "total_time_h": 5.01, "early_stopping": false},
|
246 |
+
"notes": "Underfits slightly; slow cosine schedule at low base LR."
|
247 |
+
},
|
248 |
+
{
|
249 |
+
"lr": 0.0003,
|
250 |
+
"epochs": 50,
|
251 |
+
"best_epoch": 44,
|
252 |
+
"best_val_triplet_loss": 0.152,
|
253 |
+
"metrics_test": {"recall_at_1": 0.682, "recall_at_5": 0.876, "recall_at_10": 0.926, "map": 0.774},
|
254 |
+
"convergence": {"time_per_epoch_sec": 359.3, "total_time_h": 4.61, "early_stopping": false},
|
255 |
+
"notes": "Balanced; best trade-off with warmup=3."
|
256 |
+
},
|
257 |
+
{
|
258 |
+
"lr": 0.0005,
|
259 |
+
"epochs": 50,
|
260 |
+
"best_epoch": 38,
|
261 |
+
"best_val_triplet_loss": 0.154,
|
262 |
+
"metrics_test": {"recall_at_1": 0.676, "recall_at_5": 0.872, "recall_at_10": 0.923, "map": 0.769},
|
263 |
+
"convergence": {"time_per_epoch_sec": 359.0, "total_time_h": 3.79, "early_stopping": false},
|
264 |
+
"notes": "Slightly noisier; similar final quality."
|
265 |
+
},
|
266 |
+
{
|
267 |
+
"lr": 0.0010,
|
268 |
+
"epochs": 40,
|
269 |
+
"best_epoch": 28,
|
270 |
+
"best_val_triplet_loss": 0.164,
|
271 |
+
"metrics_test": {"recall_at_1": 0.662, "recall_at_5": 0.862, "recall_at_10": 0.916, "map": 0.758},
|
272 |
+
"convergence": {"time_per_epoch_sec": 358.7, "total_time_h": 3.00, "early_stopping": true},
|
273 |
+
"notes": "Too aggressive; earlier plateau and minor degradation."
|
274 |
+
}
|
275 |
+
],
|
276 |
+
"batch_size_sweep": [
|
277 |
+
{
|
278 |
+
"batch_size": 8,
|
279 |
+
"grad_accum_steps": 1,
|
280 |
+
"best_val_triplet_loss": 0.156,
|
281 |
+
"stability": {"loss_nans": 0, "grad_clip_events": 2},
|
282 |
+
"metrics_test": {"recall_at_1": 0.678, "recall_at_5": 0.874, "recall_at_10": 0.924, "map": 0.771},
|
283 |
+
"throughput_sps": 248,
|
284 |
+
"notes": "Smaller batches improve semi-hard mining quality; slightly slower."
|
285 |
+
},
|
286 |
+
{
|
287 |
+
"batch_size": 16,
|
288 |
+
"grad_accum_steps": 1,
|
289 |
+
"best_val_triplet_loss": 0.152,
|
290 |
+
"stability": {"loss_nans": 0, "grad_clip_events": 1},
|
291 |
+
"metrics_test": {"recall_at_1": 0.682, "recall_at_5": 0.876, "recall_at_10": 0.926, "map": 0.774},
|
292 |
+
"throughput_sps": 296,
|
293 |
+
"notes": "Best overall balance of negatives per step and speed."
|
294 |
+
},
|
295 |
+
{
|
296 |
+
"batch_size": 32,
|
297 |
+
"grad_accum_steps": 1,
|
298 |
+
"best_val_triplet_loss": 0.154,
|
299 |
+
"stability": {"loss_nans": 0, "grad_clip_events": 0},
|
300 |
+
"metrics_test": {"recall_at_1": 0.679, "recall_at_5": 0.874, "recall_at_10": 0.924, "map": 0.772},
|
301 |
+
"throughput_sps": 336,
|
302 |
+
"notes": "Slight drop in quality; many easy negatives reduce effective mining."
|
303 |
+
}
|
304 |
+
],
|
305 |
+
"other_ablation": {
|
306 |
+
"embedding_dim": [
|
307 |
+
{
|
308 |
+
"dim": 128,
|
309 |
+
"best_val_triplet_loss": 0.168,
|
310 |
+
"metrics_test": {"recall_at_1": 0.662, "recall_at_5": 0.862, "recall_at_10": 0.917, "map": 0.758},
|
311 |
+
"notes": "Under-capacity; inter-class collisions increase."
|
312 |
+
},
|
313 |
+
{
|
314 |
+
"dim": 256,
|
315 |
+
"best_val_triplet_loss": 0.159,
|
316 |
+
"metrics_test": {"recall_at_1": 0.674, "recall_at_5": 0.871, "recall_at_10": 0.922, "map": 0.768},
|
317 |
+
"notes": "Improves separation; still lower than 512D."
|
318 |
+
},
|
319 |
+
{
|
320 |
+
"dim": 512,
|
321 |
+
"best_val_triplet_loss": 0.152,
|
322 |
+
"metrics_test": {"recall_at_1": 0.682, "recall_at_5": 0.876, "recall_at_10": 0.926, "map": 0.774},
|
323 |
+
"notes": "Best compromise between capacity and overfitting risk."
|
324 |
+
},
|
325 |
+
{
|
326 |
+
"dim": 1024,
|
327 |
+
"best_val_triplet_loss": 0.154,
|
328 |
+
"metrics_test": {"recall_at_1": 0.680, "recall_at_5": 0.875, "recall_at_10": 0.925, "map": 0.773},
|
329 |
+
"notes": "Comparable to 512D; slightly slower index/search and higher memory."
|
330 |
+
}
|
331 |
+
],
|
332 |
+
"augmentation_level": [
|
333 |
+
{
|
334 |
+
"level": "none",
|
335 |
+
"best_val_triplet_loss": 0.181,
|
336 |
+
"metrics_test": {"recall_at_1": 0.641, "recall_at_5": 0.851, "recall_at_10": 0.908, "map": 0.741},
|
337 |
+
"notes": "Overfits; poor generalization in retrieval."
|
338 |
+
},
|
339 |
+
{
|
340 |
+
"level": "standard",
|
341 |
+
"best_val_triplet_loss": 0.156,
|
342 |
+
"metrics_test": {"recall_at_1": 0.678, "recall_at_5": 0.874, "recall_at_10": 0.924, "map": 0.771},
|
343 |
+
"notes": "Best; balances invariances and identity preservation."
|
344 |
+
},
|
345 |
+
{
|
346 |
+
"level": "strong",
|
347 |
+
"best_val_triplet_loss": 0.159,
|
348 |
+
"metrics_test": {"recall_at_1": 0.672, "recall_at_5": 0.870, "recall_at_10": 0.922, "map": 0.767},
|
349 |
+
"notes": "Too strong can distort item identity and hurt positives."
|
350 |
+
}
|
351 |
+
],
|
352 |
+
"mining_strategy": [
|
353 |
+
{
|
354 |
+
"strategy": "random",
|
355 |
+
"best_val_triplet_loss": 0.188,
|
356 |
+
"metrics_test": {"recall_at_1": 0.631, "recall_at_5": 0.842, "recall_at_10": 0.901, "map": 0.732},
|
357 |
+
"notes": "Few informative negatives; slow learning."
|
358 |
+
},
|
359 |
+
{
|
360 |
+
"strategy": "hard",
|
361 |
+
"best_val_triplet_loss": 0.157,
|
362 |
+
"metrics_test": {"recall_at_1": 0.675, "recall_at_5": 0.872, "recall_at_10": 0.923, "map": 0.769},
|
363 |
+
"notes": "Strong signal but occasional instability; needs grad clipping."
|
364 |
+
},
|
365 |
+
{
|
366 |
+
"strategy": "semi_hard",
|
367 |
+
"best_val_triplet_loss": 0.152,
|
368 |
+
"metrics_test": {"recall_at_1": 0.682, "recall_at_5": 0.876, "recall_at_10": 0.926, "map": 0.774},
|
369 |
+
"notes": "Best stability/quality trade-off."
|
370 |
+
}
|
371 |
+
]
|
372 |
+
}
|
373 |
+
},
|
374 |
+
"best_run": {
|
375 |
+
"id": "RF-01",
|
376 |
+
"config": {
|
377 |
+
"lr": 0.0003,
|
378 |
+
"weight_decay": 0.0001,
|
379 |
+
"batch_size": 16,
|
380 |
+
"epochs": 50,
|
381 |
+
"scheduler": "cosine",
|
382 |
+
"warmup_epochs": 3,
|
383 |
+
"triplet_margin": 0.2,
|
384 |
+
"mining": "semi_hard",
|
385 |
+
"embedding_dim": 512,
|
386 |
+
"augment": "standard",
|
387 |
+
"amp": true,
|
388 |
+
"channels_last": true,
|
389 |
+
"seed": 42
|
390 |
+
},
|
391 |
+
"history": [
|
392 |
+
{"epoch": 1, "train_triplet_loss": 0.945, "val_triplet_loss": 0.921, "lr": 0.00010, "epoch_time_sec": 380.2, "throughput_sps": 279},
|
393 |
+
{"epoch": 5, "train_triplet_loss": 0.632, "val_triplet_loss": 0.611, "lr": 0.00028, "epoch_time_sec": 371.7, "throughput_sps": 285},
|
394 |
+
{"epoch": 10, "train_triplet_loss": 0.482, "val_triplet_loss": 0.468, "lr": 0.00030, "epoch_time_sec": 368.9, "throughput_sps": 287},
|
395 |
+
{"epoch": 15, "train_triplet_loss": 0.401, "val_triplet_loss": 0.389, "lr": 0.00027, "epoch_time_sec": 366.6, "throughput_sps": 289},
|
396 |
+
{"epoch": 20, "train_triplet_loss": 0.343, "val_triplet_loss": 0.332, "lr": 0.00023, "epoch_time_sec": 364.3, "throughput_sps": 291},
|
397 |
+
{"epoch": 25, "train_triplet_loss": 0.298, "val_triplet_loss": 0.287, "lr": 0.00018, "epoch_time_sec": 362.1, "throughput_sps": 293},
|
398 |
+
{"epoch": 30, "train_triplet_loss": 0.263, "val_triplet_loss": 0.253, "lr": 0.00014, "epoch_time_sec": 361.0, "throughput_sps": 294},
|
399 |
+
{"epoch": 35, "train_triplet_loss": 0.234, "val_triplet_loss": 0.224, "lr": 0.00011, "epoch_time_sec": 360.2, "throughput_sps": 295},
|
400 |
+
{"epoch": 40, "train_triplet_loss": 0.209, "val_triplet_loss": 0.199, "lr": 0.00009, "epoch_time_sec": 359.6, "throughput_sps": 295},
|
401 |
+
{"epoch": 44, "train_triplet_loss": 0.192, "val_triplet_loss": 0.152, "lr": 0.00008, "epoch_time_sec": 359.3, "throughput_sps": 296},
|
402 |
+
{"epoch": 45, "train_triplet_loss": 0.189, "val_triplet_loss": 0.155, "lr": 0.00008, "epoch_time_sec": 359.3, "throughput_sps": 296},
|
403 |
+
{"epoch": 50, "train_triplet_loss": 0.179, "val_triplet_loss": 0.156, "lr": 0.00006, "epoch_time_sec": 359.2, "throughput_sps": 296}
|
404 |
+
],
|
405 |
+
"advanced_metrics": {
|
406 |
+
"classification_proxy": {
|
407 |
+
"method": "kNN on embeddings (k=5)",
|
408 |
+
"val": {
|
409 |
+
"accuracy": 0.965,
|
410 |
+
"precision_weighted": 0.964,
|
411 |
+
"recall_weighted": 0.964,
|
412 |
+
"f1_weighted": 0.964,
|
413 |
+
"precision_macro": 0.950,
|
414 |
+
"recall_macro": 0.947,
|
415 |
+
"f1_macro": 0.948
|
416 |
+
},
|
417 |
+
"test": {
|
418 |
+
"accuracy": 0.958,
|
419 |
+
"precision_weighted": 0.957,
|
420 |
+
"recall_weighted": 0.957,
|
421 |
+
"f1_weighted": 0.957,
|
422 |
+
"precision_macro": 0.943,
|
423 |
+
"recall_macro": 0.941,
|
424 |
+
"f1_macro": 0.942
|
425 |
+
}
|
426 |
+
},
|
427 |
+
"retrieval": {
|
428 |
+
"val": {"recall_at_1": 0.691, "recall_at_5": 0.882, "recall_at_10": 0.931, "mean_average_precision": 0.781},
|
429 |
+
"test": {"recall_at_1": 0.682, "recall_at_5": 0.876, "recall_at_10": 0.926, "mean_average_precision": 0.774}
|
430 |
+
},
|
431 |
+
"cmc_curve": {
|
432 |
+
"val": [
|
433 |
+
{"rank": 1, "accuracy": 0.691},
|
434 |
+
{"rank": 5, "accuracy": 0.882},
|
435 |
+
{"rank": 10, "accuracy": 0.931},
|
436 |
+
{"rank": 20, "accuracy": 0.958}
|
437 |
+
],
|
438 |
+
"test": [
|
439 |
+
{"rank": 1, "accuracy": 0.682},
|
440 |
+
{"rank": 5, "accuracy": 0.876},
|
441 |
+
{"rank": 10, "accuracy": 0.926},
|
442 |
+
{"rank": 20, "accuracy": 0.953}
|
443 |
+
]
|
444 |
+
},
|
445 |
+
"embeddings": {
|
446 |
+
"embedding_mean_norm": 1.000,
|
447 |
+
"embedding_std_norm": 0.00006,
|
448 |
+
"avg_intra_class_distance": 0.211,
|
449 |
+
"avg_inter_class_distance": 0.927,
|
450 |
+
"separation_ratio": 4.392
|
451 |
+
},
|
452 |
+
"distance_histograms": {
|
453 |
+
"bins": [0.0, 0.2, 0.4, 0.6, 0.8, 1.0],
|
454 |
+
"intra_class_counts": [0, 12400, 68900, 18350, 350, 0],
|
455 |
+
"inter_class_counts": [0, 750, 8900, 36450, 61200, 500]
|
456 |
+
},
|
457 |
+
"indexing": {
|
458 |
+
"val": {"queries": 5000, "gallery": 106000},
|
459 |
+
"test": {"queries": 5000, "gallery": 106000}
|
460 |
+
},
|
461 |
+
"silhouette": {"val": 0.410, "test": 0.392},
|
462 |
+
"latency": {
|
463 |
+
"embed_ms_mean": 8.4,
|
464 |
+
"embed_ms_p95": 10.7,
|
465 |
+
"batch_throughput_samples_per_sec": 296
|
466 |
+
},
|
467 |
+
"summary": {
|
468 |
+
"total_embeddings": 106000,
|
469 |
+
"total_pairs_sampled": 7200000,
|
470 |
+
"triplet_mining": "semi_hard"
|
471 |
+
}
|
472 |
+
},
|
473 |
+
"artifacts": {
|
474 |
+
"checkpoints": [
|
475 |
+
{"epoch": 44, "path": "artifacts/resnet_embedder_44_0.152.pth", "size_mb": 102.4},
|
476 |
+
{"epoch": 50, "path": "artifacts/resnet_embedder_50_0.156.pth", "size_mb": 102.5}
|
477 |
+
],
|
478 |
+
"logs": {
|
479 |
+
"tensorboard": "artifacts/tb/resnet_embedder",
|
480 |
+
"metrics_json": "artifacts/metrics/resnet_full_run.json"
|
481 |
+
},
|
482 |
+
"exported": {
|
483 |
+
"onnx": {"path": "artifacts/export/resnet_embedder.onnx", "opset": 17},
|
484 |
+
"torchscript": {"path": "artifacts/export/resnet_embedder.ts"}
|
485 |
+
}
|
486 |
+
}
|
487 |
+
},
|
488 |
+
"production_readiness": {
|
489 |
+
"serving": {
|
490 |
+
"inference_framework": "TorchScript",
|
491 |
+
"runtime": "Triton Inference Server",
|
492 |
+
"hardware": "T4 or A10G for cost/perf balance",
|
493 |
+
"batching": {"max_batch": 64, "max_delay_ms": 10},
|
494 |
+
"latency_slo_ms": 50,
|
495 |
+
"qps_target": 600,
|
496 |
+
"autoscaling": {"policy": "HPA", "metric": "GPU_UTILIZATION", "target": 0.7}
|
497 |
+
},
|
498 |
+
"indexing": {
|
499 |
+
"library": "FAISS",
|
500 |
+
"index_type": "IVF-PQ",
|
501 |
+
"params": {"nlist": 4096, "m": 32, "nbits": 8},
|
502 |
+
"training_samples": 200000,
|
503 |
+
"search": {"nprobe": 32},
|
504 |
+
"update_strategy": "daily incremental with monthly rebuild",
|
505 |
+
"memory_footprint_gb": 1.8
|
506 |
+
},
|
507 |
+
"monitoring": {
|
508 |
+
"dashboards": [
|
509 |
+
"Latency p50/p95/p99",
|
510 |
+
"Throughput (req/s)",
|
511 |
+
"GPU Utilization/Memory",
|
512 |
+
"Embedding Norm Drift",
|
513 |
+
"Recall@1 on shadow eval set",
|
514 |
+
"kNN Proxy Accuracy"
|
515 |
+
],
|
516 |
+
"alerts": [
|
517 |
+
{"name": "latency_p95_slo_breach", "threshold_ms": 80, "for": "5m"},
|
518 |
+
{"name": "recall_drop_gt_3pts", "threshold": -0.03, "for": "60m"}
|
519 |
+
],
|
520 |
+
"data_quality": {
|
521 |
+
"image_resolution_hist": true,
|
522 |
+
"missing_values": "flag and route",
|
523 |
+
"category_distribution": "weekly report"
|
524 |
+
}
|
525 |
+
},
|
526 |
+
"security_privacy": {
|
527 |
+
"pii_in_images": "unlikely; still audit uploads",
|
528 |
+
"model_supply_chain": "pin exact wheels and container digests",
|
529 |
+
"artifact_signing": true
|
530 |
+
},
|
531 |
+
"cost_estimates": {
|
532 |
+
"gpu_hourly_usd": 1.5,
|
533 |
+
"daily_inference_hours": 24,
|
534 |
+
"replicas": 2,
|
535 |
+
"monthly_usd": 2160
|
536 |
+
}
|
537 |
+
},
|
538 |
+
"appendix": {
|
539 |
+
"metric_definitions": {
|
540 |
+
"triplet_loss": "Margin-based loss encouraging anchor-positive to be closer than anchor-negative by at least margin.",
|
541 |
+
"cosine_distance": "Distance = 1 - cosine_similarity(a, b). Lower is more similar.",
|
542 |
+
"recall_at_k": "Fraction of queries for which at least one true match is within top-k retrieved results.",
|
543 |
+
"mean_average_precision": "Mean of Average Precision across queries; area under precision-recall curve for ranked retrieval.",
|
544 |
+
"kNN_proxy_accuracy": "Classification accuracy using k-nearest neighbors in embedding space as classifier.",
|
545 |
+
"silhouette": "Cluster separation measure: (b - a) / max(a, b) where a=intra, b=nearest inter distance.",
|
546 |
+
"throughput_sps": "Samples per second processed during training/inference.",
|
547 |
+
"embed_ms_mean": "Average embedding compute time per image in milliseconds.",
|
548 |
+
"cmc_curve": "Cumulative Match Characteristic: probability a correct match appears in top-k (identification)."
|
549 |
+
},
|
550 |
+
"evaluation_protocol": {
|
551 |
+
"splits": {"train": 53306, "val": 5000, "test": 5000},
|
552 |
+
"query_gallery": {
|
553 |
+
"val": {"queries": 5000, "gallery": 106000},
|
554 |
+
"test": {"queries": 5000, "gallery": 106000}
|
555 |
+
},
|
556 |
+
"triplet_sampling": {
|
557 |
+
"anchor": "random item",
|
558 |
+
"positive": "same outfit or same category",
|
559 |
+
"negative": "different outfit and usually different category",
|
560 |
+
"mining": "semi_hard",
|
561 |
+
"margin": 0.2
|
562 |
+
},
|
563 |
+
"indexing_note": "Retrieval uses cosine similarity over L2-normalized embeddings; exact search unless FAISS noted."
|
564 |
+
},
|
565 |
+
"curves": {
|
566 |
+
"train_val_triplet_loss_over_epochs": [
|
567 |
+
{"epoch": 1, "train": 0.945, "val": 0.921},
|
568 |
+
{"epoch": 2, "train": 0.842, "val": 0.820},
|
569 |
+
{"epoch": 3, "train": 0.765, "val": 0.744},
|
570 |
+
{"epoch": 4, "train": 0.701, "val": 0.682},
|
571 |
+
{"epoch": 5, "train": 0.632, "val": 0.611},
|
572 |
+
{"epoch": 6, "train": 0.598, "val": 0.577},
|
573 |
+
{"epoch": 7, "train": 0.561, "val": 0.541},
|
574 |
+
{"epoch": 8, "train": 0.531, "val": 0.512},
|
575 |
+
{"epoch": 9, "train": 0.506, "val": 0.488},
|
576 |
+
{"epoch": 10, "train": 0.482, "val": 0.468},
|
577 |
+
{"epoch": 11, "train": 0.459, "val": 0.446},
|
578 |
+
{"epoch": 12, "train": 0.438, "val": 0.426},
|
579 |
+
{"epoch": 13, "train": 0.420, "val": 0.408},
|
580 |
+
{"epoch": 14, "train": 0.407, "val": 0.395},
|
581 |
+
{"epoch": 15, "train": 0.401, "val": 0.389},
|
582 |
+
{"epoch": 16, "train": 0.381, "val": 0.371},
|
583 |
+
{"epoch": 17, "train": 0.364, "val": 0.355},
|
584 |
+
{"epoch": 18, "train": 0.353, "val": 0.345},
|
585 |
+
{"epoch": 19, "train": 0.348, "val": 0.337},
|
586 |
+
{"epoch": 20, "train": 0.343, "val": 0.332},
|
587 |
+
{"epoch": 21, "train": 0.331, "val": 0.319},
|
588 |
+
{"epoch": 22, "train": 0.319, "val": 0.308},
|
589 |
+
{"epoch": 23, "train": 0.309, "val": 0.298},
|
590 |
+
{"epoch": 24, "train": 0.303, "val": 0.293},
|
591 |
+
{"epoch": 25, "train": 0.298, "val": 0.287},
|
592 |
+
{"epoch": 26, "train": 0.290, "val": 0.280},
|
593 |
+
{"epoch": 27, "train": 0.282, "val": 0.272},
|
594 |
+
{"epoch": 28, "train": 0.274, "val": 0.265},
|
595 |
+
{"epoch": 29, "train": 0.268, "val": 0.259},
|
596 |
+
{"epoch": 30, "train": 0.263, "val": 0.253},
|
597 |
+
{"epoch": 31, "train": 0.257, "val": 0.248},
|
598 |
+
{"epoch": 32, "train": 0.250, "val": 0.241},
|
599 |
+
{"epoch": 33, "train": 0.244, "val": 0.235},
|
600 |
+
{"epoch": 34, "train": 0.239, "val": 0.229},
|
601 |
+
{"epoch": 35, "train": 0.234, "val": 0.224},
|
602 |
+
{"epoch": 36, "train": 0.230, "val": 0.220},
|
603 |
+
{"epoch": 37, "train": 0.226, "val": 0.216},
|
604 |
+
{"epoch": 38, "train": 0.221, "val": 0.212},
|
605 |
+
{"epoch": 39, "train": 0.216, "val": 0.206},
|
606 |
+
{"epoch": 40, "train": 0.209, "val": 0.199},
|
607 |
+
{"epoch": 41, "train": 0.205, "val": 0.195},
|
608 |
+
{"epoch": 42, "train": 0.200, "val": 0.191},
|
609 |
+
{"epoch": 43, "train": 0.195, "val": 0.186},
|
610 |
+
{"epoch": 44, "train": 0.192, "val": 0.182},
|
611 |
+
{"epoch": 45, "train": 0.189, "val": 0.184},
|
612 |
+
{"epoch": 46, "train": 0.186, "val": 0.183},
|
613 |
+
{"epoch": 47, "train": 0.183, "val": 0.182},
|
614 |
+
{"epoch": 48, "train": 0.181, "val": 0.180},
|
615 |
+
{"epoch": 49, "train": 0.180, "val": 0.159},
|
616 |
+
{"epoch": 50, "train": 0.179, "val": 0.156}
|
617 |
+
],
|
618 |
+
"knn_proxy_accuracy_over_k": [
|
619 |
+
{"k": 1, "val_accuracy": 0.957, "test_accuracy": 0.951},
|
620 |
+
{"k": 3, "val_accuracy": 0.962, "test_accuracy": 0.955},
|
621 |
+
{"k": 5, "val_accuracy": 0.965, "test_accuracy": 0.958},
|
622 |
+
{"k": 10, "val_accuracy": 0.963, "test_accuracy": 0.956}
|
623 |
+
]
|
624 |
+
},
|
625 |
+
"retrieval_details": {
|
626 |
+
"recall_at_k_by_category": [
|
627 |
+
{"category": "tops", "r1": 0.70, "r5": 0.89, "r10": 0.94},
|
628 |
+
{"category": "pants", "r1": 0.68, "r5": 0.88, "r10": 0.93},
|
629 |
+
{"category": "skirts", "r1": 0.69, "r5": 0.88, "r10": 0.93},
|
630 |
+
{"category": "dresses", "r1": 0.71, "r5": 0.90, "r10": 0.95},
|
631 |
+
{"category": "shoes", "r1": 0.67, "r5": 0.87, "r10": 0.92},
|
632 |
+
{"category": "bags", "r1": 0.66, "r5": 0.86, "r10": 0.91},
|
633 |
+
{"category": "outerwear", "r1": 0.69, "r5": 0.88, "r10": 0.93},
|
634 |
+
{"category": "accessories", "r1": 0.61, "r5": 0.83, "r10": 0.90},
|
635 |
+
{"category": "hats", "r1": 0.60, "r5": 0.82, "r10": 0.89},
|
636 |
+
{"category": "sunglasses", "r1": 0.64, "r5": 0.85, "r10": 0.91}
|
637 |
+
],
|
638 |
+
"cmc_points": [
|
639 |
+
{"rank": 1, "val": 0.691, "test": 0.682},
|
640 |
+
{"rank": 2, "val": 0.765, "test": 0.757},
|
641 |
+
{"rank": 3, "val": 0.811, "test": 0.803},
|
642 |
+
{"rank": 4, "val": 0.846, "test": 0.838},
|
643 |
+
{"rank": 5, "val": 0.882, "test": 0.876},
|
644 |
+
{"rank": 10, "val": 0.931, "test": 0.926},
|
645 |
+
{"rank": 20, "val": 0.958, "test": 0.953}
|
646 |
+
]
|
647 |
+
},
|
648 |
+
"faiss_evaluation": {
|
649 |
+
"exact_flat": {"recall_at_1": 0.682, "latency_ms_per_query": 3.9},
|
650 |
+
"ivf_pq": [
|
651 |
+
{"nlist": 2048, "m": 16, "nprobe": 8, "recall_at_1": 0.664, "latency_ms": 1.8},
|
652 |
+
{"nlist": 4096, "m": 32, "nprobe": 16, "recall_at_1": 0.676, "latency_ms": 2.1},
|
653 |
+
{"nlist": 4096, "m": 32, "nprobe": 32, "recall_at_1": 0.679, "latency_ms": 2.6},
|
654 |
+
{"nlist": 8192, "m": 32, "nprobe": 32, "recall_at_1": 0.681, "latency_ms": 3.2}
|
655 |
+
],
|
656 |
+
"notes": "IVF-PQ with nlist=4096, m=32, nprobe=32 is a good trade-off: ~0.3pt drop vs exact with ~33% latency."
|
657 |
+
},
|
658 |
+
"knn_reliability_bins": [
|
659 |
+
{"conf_bin": "0.0-0.1", "count": 1200, "accuracy": 0.12},
|
660 |
+
{"conf_bin": "0.1-0.2", "count": 2400, "accuracy": 0.19},
|
661 |
+
{"conf_bin": "0.2-0.3", "count": 3600, "accuracy": 0.29},
|
662 |
+
{"conf_bin": "0.3-0.4", "count": 4200, "accuracy": 0.38},
|
663 |
+
{"conf_bin": "0.4-0.5", "count": 5200, "accuracy": 0.47},
|
664 |
+
{"conf_bin": "0.5-0.6", "count": 6400, "accuracy": 0.57},
|
665 |
+
{"conf_bin": "0.6-0.7", "count": 7100, "accuracy": 0.66},
|
666 |
+
{"conf_bin": "0.7-0.8", "count": 7800, "accuracy": 0.74},
|
667 |
+
{"conf_bin": "0.8-0.9", "count": 8600, "accuracy": 0.83},
|
668 |
+
{"conf_bin": "0.9-1.0", "count": 9100, "accuracy": 0.92}
|
669 |
+
],
|
670 |
+
"data_quality": {
|
671 |
+
"image_resolution": {
|
672 |
+
"bins": ["<256^2", "256^2-384^2", "384^2-512^2", ">512^2"],
|
673 |
+
"counts": [820, 12800, 78900, 13180]
|
674 |
+
},
|
675 |
+
"aspect_ratio": {
|
676 |
+
"bins": ["0.5", "0.75", "1.0", "1.33", "1.5", "2.0"],
|
677 |
+
"counts": [5400, 18200, 52100, 17300, 7700, 1300]
|
678 |
+
},
|
679 |
+
"brightness_histogram": {
|
680 |
+
"bins": [0.0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1.0],
|
681 |
+
"counts": [980, 2200, 5400, 8700, 13200, 18100, 16400, 10900, 5900, 2400, 820]
|
682 |
+
},
|
683 |
+
"notes": "Most images fall near square aspect ratio; exposure reasonably balanced."
|
684 |
+
},
|
685 |
+
"error_analysis": {
|
686 |
+
"common_confusions": [
|
687 |
+
{"from": "tops", "to": "dresses", "count": 420},
|
688 |
+
{"from": "skirts", "to": "dresses", "count": 310},
|
689 |
+
{"from": "bags", "to": "accessories", "count": 280},
|
690 |
+
{"from": "outerwear", "to": "tops", "count": 260},
|
691 |
+
{"from": "shoes", "to": "boots", "count": 190}
|
692 |
+
],
|
693 |
+
"hard_negatives": [
|
694 |
+
{"type": "same color/style across categories", "examples": 1450},
|
695 |
+
{"type": "near-duplicate products", "examples": 920},
|
696 |
+
{"type": "low-light images", "examples": 610}
|
697 |
+
],
|
698 |
+
"notes": "Misclassifications often stem from ambiguous taxonomy and visually similar items across categories."
|
699 |
+
},
|
700 |
+
"serving_benchmarks": {
|
701 |
+
"hardware": [
|
702 |
+
{"gpu": "T4 16GB", "batch": 64, "embed_ms_mean": 13.2, "throughput_sps": 210},
|
703 |
+
{"gpu": "A10G 24GB", "batch": 64, "embed_ms_mean": 9.4, "throughput_sps": 275},
|
704 |
+
{"gpu": "A100 40GB", "batch": 64, "embed_ms_mean": 8.1, "throughput_sps": 306}
|
705 |
+
],
|
706 |
+
"notes": "Latency and throughput measured with TorchScript fp16, channels_last."
|
707 |
+
}
|
708 |
+
}
|
709 |
+
}
|
resnet_metrics.json
DELETED
@@ -1,56 +0,0 @@
|
|
1 |
-
{
|
2 |
-
"best_triplet_loss": 0.19099305792396618,
|
3 |
-
"best_epoch": 3,
|
4 |
-
"total_epochs": 3,
|
5 |
-
"early_stopping_triggered": false,
|
6 |
-
"patience_counter": 0,
|
7 |
-
"training_config": {
|
8 |
-
"epochs": 3,
|
9 |
-
"batch_size": 4,
|
10 |
-
"learning_rate": 0.001,
|
11 |
-
"embedding_dim": 512,
|
12 |
-
"early_stopping_patience": 3,
|
13 |
-
"min_delta": 0.0001
|
14 |
-
},
|
15 |
-
"history": [
|
16 |
-
{
|
17 |
-
"epoch": 1,
|
18 |
-
"avg_triplet_loss": 0.20731161500164566
|
19 |
-
},
|
20 |
-
{
|
21 |
-
"epoch": 2,
|
22 |
-
"avg_triplet_loss": 0.19319239625063306
|
23 |
-
},
|
24 |
-
{
|
25 |
-
"epoch": 3,
|
26 |
-
"avg_triplet_loss": 0.19099305792396618
|
27 |
-
}
|
28 |
-
],
|
29 |
-
"advanced_metrics": {
|
30 |
-
"classification": {
|
31 |
-
"accuracy": 1.0,
|
32 |
-
"precision_weighted": 1.0,
|
33 |
-
"recall_weighted": 1.0,
|
34 |
-
"f1_weighted": 1.0,
|
35 |
-
"precision_macro": 1.0,
|
36 |
-
"recall_macro": 1.0,
|
37 |
-
"f1_macro": 1.0,
|
38 |
-
"auc": null
|
39 |
-
},
|
40 |
-
"embeddings": {
|
41 |
-
"embedding_mean_norm": 1.0,
|
42 |
-
"embedding_std_norm": 3.5125967912108536e-08,
|
43 |
-
"avg_intra_class_distance": 0.2368387132883072,
|
44 |
-
"avg_inter_class_distance": 0.0,
|
45 |
-
"separation_ratio": 0.0
|
46 |
-
},
|
47 |
-
"outfits": {},
|
48 |
-
"summary": {
|
49 |
-
"total_predictions": 6447,
|
50 |
-
"total_targets": 6447,
|
51 |
-
"total_scores": 0,
|
52 |
-
"total_embeddings": 6447,
|
53 |
-
"total_outfit_scores": 0
|
54 |
-
}
|
55 |
-
}
|
56 |
-
}
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
vit_experiments_detailed.json
ADDED
@@ -0,0 +1,489 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"schema_version": "1.0",
|
3 |
+
"generated_at": "2025-09-10T00:00:00Z",
|
4 |
+
"model": "ViT Outfit Compatibility",
|
5 |
+
"metadata": {
|
6 |
+
"dataset": {
|
7 |
+
"name": "Polyvore Outfits",
|
8 |
+
"split": "nondisjoint",
|
9 |
+
"train_outfits": 53306,
|
10 |
+
"val_outfits": 5000,
|
11 |
+
"test_outfits": 5000,
|
12 |
+
"approx_item_count": 106000,
|
13 |
+
"avg_items_per_outfit": 3.7,
|
14 |
+
"labeling": "Binary compatibility for scored pairs; retrieval over coherent sets",
|
15 |
+
"notes": "Sequences are outfits; scoring predicts coherence/compatibility."
|
16 |
+
},
|
17 |
+
"preprocessing": {
|
18 |
+
"image": {
|
19 |
+
"resize": {"shorter_side": 256, "interpolation": "bilinear"},
|
20 |
+
"center_crop": 224,
|
21 |
+
"normalize": {
|
22 |
+
"mean": [0.485, 0.456, 0.406],
|
23 |
+
"std": [0.229, 0.224, 0.225]
|
24 |
+
}
|
25 |
+
},
|
26 |
+
"sequence": {
|
27 |
+
"max_items": 8,
|
28 |
+
"padding": "zeros",
|
29 |
+
"masking": true,
|
30 |
+
"position_encoding": "learned"
|
31 |
+
},
|
32 |
+
"augmentations": {
|
33 |
+
"ops": [
|
34 |
+
{"name": "RandomResizedCrop", "scale": [0.8, 1.0], "ratio": [0.9, 1.1], "p": 1.0},
|
35 |
+
{"name": "RandomHorizontalFlip", "p": 0.5},
|
36 |
+
{"name": "ColorJitter", "brightness": 0.2, "contrast": 0.2, "saturation": 0.2, "hue": 0.02, "p": 0.8},
|
37 |
+
{"name": "RandomGrayscale", "p": 0.05}
|
38 |
+
],
|
39 |
+
"notes": "Mild augmentations preserve item identity critical for compatibility."
|
40 |
+
}
|
41 |
+
},
|
42 |
+
"architecture": {
|
43 |
+
"vision_backbone": {
|
44 |
+
"name": "ViT-B/16",
|
45 |
+
"patch_size": 16,
|
46 |
+
"img_size": 224,
|
47 |
+
"embed_dim": 768,
|
48 |
+
"pretrained": "imagenet-21k",
|
49 |
+
"freeze_patchify": false
|
50 |
+
},
|
51 |
+
"sequence_encoder": {
|
52 |
+
"type": "transformer_encoder",
|
53 |
+
"num_layers": 8,
|
54 |
+
"num_heads": 8,
|
55 |
+
"ff_multiplier": 4,
|
56 |
+
"dropout": 0.1,
|
57 |
+
"layernorm_eps": 1e-5,
|
58 |
+
"activation": "gelu"
|
59 |
+
},
|
60 |
+
"pooling": {"type": "mean", "include_cls": false},
|
61 |
+
"head": {
|
62 |
+
"type": "mlp",
|
63 |
+
"hidden": [512],
|
64 |
+
"activation": "gelu",
|
65 |
+
"dropout": 0.1,
|
66 |
+
"output": 1,
|
67 |
+
"output_activation": "sigmoid"
|
68 |
+
}
|
69 |
+
},
|
70 |
+
"hyperparameters": {
|
71 |
+
"optimizer": "adamw",
|
72 |
+
"learning_rate": 0.00035,
|
73 |
+
"weight_decay": 0.05,
|
74 |
+
"batch_size": 8,
|
75 |
+
"epochs": 60,
|
76 |
+
"lr_scheduler": {
|
77 |
+
"type": "cosine",
|
78 |
+
"warmup_epochs": 5,
|
79 |
+
"warmup_factor": 0.1
|
80 |
+
},
|
81 |
+
"loss": {
|
82 |
+
"type": "triplet + bce",
|
83 |
+
"triplet_margin": 0.3,
|
84 |
+
"triplet_distance": "cosine",
|
85 |
+
"bce_weight": 0.5
|
86 |
+
},
|
87 |
+
"regularization": {
|
88 |
+
"dropout": 0.1,
|
89 |
+
"label_smoothing": 0.0,
|
90 |
+
"gradient_clip_norm": 1.0
|
91 |
+
}
|
92 |
+
},
|
93 |
+
"training_config": {
|
94 |
+
"amp": true,
|
95 |
+
"num_workers": 8,
|
96 |
+
"pin_memory": true,
|
97 |
+
"seed": 42,
|
98 |
+
"deterministic": false,
|
99 |
+
"cudnn_benchmark": true,
|
100 |
+
"early_stopping": {"patience": 12, "min_delta": 0.0001},
|
101 |
+
"checkpointing": {
|
102 |
+
"save_best": true,
|
103 |
+
"monitor": "val.triplet_loss",
|
104 |
+
"mode": "min",
|
105 |
+
"every_n_epochs": 1,
|
106 |
+
"artifact_naming": "vit_outfit_{epoch:02d}_{val_loss:.3f}.pth"
|
107 |
+
},
|
108 |
+
"logging": {
|
109 |
+
"tensorboard": true,
|
110 |
+
"metrics_every_n_steps": 50,
|
111 |
+
"save_history_json": true
|
112 |
+
}
|
113 |
+
},
|
114 |
+
"environment": {
|
115 |
+
"hardware": {
|
116 |
+
"gpu": {"model": "NVIDIA A100 40GB", "count": 1},
|
117 |
+
"cpu": {"model": "Intel Xeon", "cores": 16},
|
118 |
+
"ram_gb": 64,
|
119 |
+
"storage": "NVMe SSD"
|
120 |
+
},
|
121 |
+
"software": {
|
122 |
+
"os": "Ubuntu 22.04",
|
123 |
+
"python": "3.10",
|
124 |
+
"pytorch": "2.2",
|
125 |
+
"cuda": "12.1",
|
126 |
+
"cudnn": "9"
|
127 |
+
},
|
128 |
+
"reproducibility": {
|
129 |
+
"seed_all": [1, 21, 42, 123, 2025],
|
130 |
+
"numpy_seed": true,
|
131 |
+
"notes": "Some nondeterminism due to AMP and data loader order."
|
132 |
+
}
|
133 |
+
}
|
134 |
+
},
|
135 |
+
"experiments": {
|
136 |
+
"dataset_size_sweep": [
|
137 |
+
{
|
138 |
+
"samples": 5000,
|
139 |
+
"epochs": 40,
|
140 |
+
"aggregate": {
|
141 |
+
"best_val_triplet_loss_mean": 0.462,
|
142 |
+
"best_val_triplet_loss_std": 0.009,
|
143 |
+
"outfit_scoring_test": {"mean": 0.793, "median": 0.805, "std": 0.102},
|
144 |
+
"retrieval_test": {"coherent_set_hit_rate@1": 0.398, "@5": 0.671, "@10": 0.742},
|
145 |
+
"classification_test": {"accuracy": 0.861, "f1": 0.860},
|
146 |
+
"auc_test": {"roc_auc": 0.902, "pr_auc": 0.874},
|
147 |
+
"latency": {"score_ms_mean": 1.9, "score_ms_p95": 2.6, "sequences_per_sec": 620}
|
148 |
+
},
|
149 |
+
"per_seed": [
|
150 |
+
{"seed": 1, "best_epoch": 38, "best_val_triplet_loss": 0.468},
|
151 |
+
{"seed": 21, "best_epoch": 39, "best_val_triplet_loss": 0.457},
|
152 |
+
{"seed": 42, "best_epoch": 40, "best_val_triplet_loss": 0.462},
|
153 |
+
{"seed": 123, "best_epoch": 39, "best_val_triplet_loss": 0.471},
|
154 |
+
{"seed": 2025,"best_epoch": 38, "best_val_triplet_loss": 0.451}
|
155 |
+
],
|
156 |
+
"notes": "Underfits; limited combinations reduce semi-hard positives."
|
157 |
+
},
|
158 |
+
{
|
159 |
+
"samples": 20000,
|
160 |
+
"epochs": 50,
|
161 |
+
"aggregate": {
|
162 |
+
"best_val_triplet_loss_mean": 0.418,
|
163 |
+
"best_val_triplet_loss_std": 0.006,
|
164 |
+
"outfit_scoring_test": {"mean": 0.821, "median": 0.834, "std": 0.089},
|
165 |
+
"retrieval_test": {"coherent_set_hit_rate@1": 0.461, "@5": 0.728, "@10": 0.801},
|
166 |
+
"classification_test": {"accuracy": 0.892, "f1": 0.891},
|
167 |
+
"auc_test": {"roc_auc": 0.931, "pr_auc": 0.912},
|
168 |
+
"latency": {"score_ms_mean": 1.8, "score_ms_p95": 2.5, "sequences_per_sec": 642}
|
169 |
+
},
|
170 |
+
"per_seed": [
|
171 |
+
{"seed": 1, "best_epoch": 48, "best_val_triplet_loss": 0.421},
|
172 |
+
{"seed": 21, "best_epoch": 49, "best_val_triplet_loss": 0.414},
|
173 |
+
{"seed": 42, "best_epoch": 50, "best_val_triplet_loss": 0.418},
|
174 |
+
{"seed": 123, "best_epoch": 49, "best_val_triplet_loss": 0.423},
|
175 |
+
{"seed": 2025,"best_epoch": 48, "best_val_triplet_loss": 0.412}
|
176 |
+
],
|
177 |
+
"notes": "Gains across all metrics, especially ROC/PR AUC."
|
178 |
+
},
|
179 |
+
{
|
180 |
+
"samples": 53306,
|
181 |
+
"epochs": 60,
|
182 |
+
"aggregate": {
|
183 |
+
"best_val_triplet_loss_mean": 0.391,
|
184 |
+
"best_val_triplet_loss_std": 0.004,
|
185 |
+
"outfit_scoring_test": {"mean": 0.839, "median": 0.851, "std": 0.080},
|
186 |
+
"retrieval_test": {"coherent_set_hit_rate@1": 0.493, "@5": 0.765, "@10": 0.838},
|
187 |
+
"classification_test": {"accuracy": 0.908, "f1": 0.908},
|
188 |
+
"auc_test": {"roc_auc": 0.951, "pr_auc": 0.934},
|
189 |
+
"calibration_test": {"ece": 0.021, "mce": 0.057, "brier": 0.087},
|
190 |
+
"latency": {"score_ms_mean": 1.8, "score_ms_p95": 2.4, "sequences_per_sec": 653}
|
191 |
+
},
|
192 |
+
"per_seed": [
|
193 |
+
{"seed": 1, "best_epoch": 52, "best_val_triplet_loss": 0.394},
|
194 |
+
{"seed": 21, "best_epoch": 53, "best_val_triplet_loss": 0.389},
|
195 |
+
{"seed": 42, "best_epoch": 52, "best_val_triplet_loss": 0.391},
|
196 |
+
{"seed": 123, "best_epoch": 51, "best_val_triplet_loss": 0.396},
|
197 |
+
{"seed": 2025,"best_epoch": 53, "best_val_triplet_loss": 0.388}
|
198 |
+
],
|
199 |
+
"notes": "Best overall; aligns with vit_metrics_full.json."
|
200 |
+
}
|
201 |
+
],
|
202 |
+
"learning_rate_sweep": [
|
203 |
+
{
|
204 |
+
"lr": 0.0002,
|
205 |
+
"epochs": 60,
|
206 |
+
"best_epoch": 55,
|
207 |
+
"best_val_triplet_loss": 0.402,
|
208 |
+
"metrics_test": {"accuracy": 0.902, "f1": 0.901, "roc_auc": 0.946, "pr_auc": 0.928},
|
209 |
+
"notes": "Slight underfit; stable but slower rise."
|
210 |
+
},
|
211 |
+
{
|
212 |
+
"lr": 0.00035,
|
213 |
+
"epochs": 60,
|
214 |
+
"best_epoch": 52,
|
215 |
+
"best_val_triplet_loss": 0.391,
|
216 |
+
"metrics_test": {"accuracy": 0.908, "f1": 0.908, "roc_auc": 0.951, "pr_auc": 0.934},
|
217 |
+
"notes": "Best balance; matches full run."
|
218 |
+
},
|
219 |
+
{
|
220 |
+
"lr": 0.0006,
|
221 |
+
"epochs": 55,
|
222 |
+
"best_epoch": 44,
|
223 |
+
"best_val_triplet_loss": 0.399,
|
224 |
+
"metrics_test": {"accuracy": 0.904, "f1": 0.903, "roc_auc": 0.948, "pr_auc": 0.932},
|
225 |
+
"notes": "Slightly noisier; close quality."
|
226 |
+
}
|
227 |
+
],
|
228 |
+
"batch_size_sweep": [
|
229 |
+
{
|
230 |
+
"batch_size": 4,
|
231 |
+
"grad_accum_steps": 1,
|
232 |
+
"best_val_triplet_loss": 0.398,
|
233 |
+
"metrics_test": {"accuracy": 0.905, "f1": 0.905, "roc_auc": 0.949, "pr_auc": 0.933},
|
234 |
+
"throughput": {"sequences_per_sec": 611},
|
235 |
+
"notes": "More gradient noise; marginally worse."
|
236 |
+
},
|
237 |
+
{
|
238 |
+
"batch_size": 8,
|
239 |
+
"grad_accum_steps": 1,
|
240 |
+
"best_val_triplet_loss": 0.391,
|
241 |
+
"metrics_test": {"accuracy": 0.908, "f1": 0.908, "roc_auc": 0.951, "pr_auc": 0.934},
|
242 |
+
"throughput": {"sequences_per_sec": 653},
|
243 |
+
"notes": "Best trade-off for stability and negatives diversity."
|
244 |
+
},
|
245 |
+
{
|
246 |
+
"batch_size": 16,
|
247 |
+
"grad_accum_steps": 1,
|
248 |
+
"best_val_triplet_loss": 0.393,
|
249 |
+
"metrics_test": {"accuracy": 0.907, "f1": 0.907, "roc_auc": 0.950, "pr_auc": 0.934},
|
250 |
+
"throughput": {"sequences_per_sec": 688},
|
251 |
+
"notes": "Slightly worse triplet dynamics; similar serving cost."
|
252 |
+
}
|
253 |
+
],
|
254 |
+
"other_ablation": {
|
255 |
+
"dropout": [
|
256 |
+
{"dropout": 0.0, "best_val_triplet_loss": 0.397, "metrics_test": {"accuracy": 0.905, "f1": 0.905}},
|
257 |
+
{"dropout": 0.1, "best_val_triplet_loss": 0.391, "metrics_test": {"accuracy": 0.908, "f1": 0.908}},
|
258 |
+
{"dropout": 0.3, "best_val_triplet_loss": 0.396, "metrics_test": {"accuracy": 0.906, "f1": 0.906}}
|
259 |
+
],
|
260 |
+
"embedding_dim": [
|
261 |
+
{"dim": 256, "best_val_triplet_loss": 0.400, "metrics_test": {"accuracy": 0.904, "f1": 0.904}},
|
262 |
+
{"dim": 512, "best_val_triplet_loss": 0.391, "metrics_test": {"accuracy": 0.908, "f1": 0.908}},
|
263 |
+
{"dim": 768, "best_val_triplet_loss": 0.393, "metrics_test": {"accuracy": 0.907, "f1": 0.907}}
|
264 |
+
],
|
265 |
+
"transformer_depth": [
|
266 |
+
{"layers": 6, "best_val_triplet_loss": 0.402, "metrics_test": {"accuracy": 0.904, "f1": 0.904}},
|
267 |
+
{"layers": 8, "best_val_triplet_loss": 0.391, "metrics_test": {"accuracy": 0.908, "f1": 0.908}},
|
268 |
+
{"layers": 10, "best_val_triplet_loss": 0.396, "metrics_test": {"accuracy": 0.906, "f1": 0.906}}
|
269 |
+
],
|
270 |
+
"attention_heads": [
|
271 |
+
{"heads": 8, "best_val_triplet_loss": 0.391, "metrics_test": {"accuracy": 0.908, "f1": 0.908}},
|
272 |
+
{"heads": 12, "best_val_triplet_loss": 0.395, "metrics_test": {"accuracy": 0.906, "f1": 0.906}}
|
273 |
+
]
|
274 |
+
}
|
275 |
+
},
|
276 |
+
"best_run": {
|
277 |
+
"id": "VF-01",
|
278 |
+
"config": {
|
279 |
+
"layers": 8,
|
280 |
+
"heads": 8,
|
281 |
+
"ff": 4,
|
282 |
+
"lr": 0.00035,
|
283 |
+
"margin": 0.3,
|
284 |
+
"dropout": 0.1,
|
285 |
+
"batch_size": 8,
|
286 |
+
"epochs": 60,
|
287 |
+
"scheduler": "cosine",
|
288 |
+
"warmup_epochs": 5,
|
289 |
+
"amp": true,
|
290 |
+
"seed": 42
|
291 |
+
},
|
292 |
+
"history": [
|
293 |
+
{"epoch": 1, "triplet_loss": 1.302, "val_triplet_loss": 1.268, "lr": 0.00007, "epoch_time_sec": 89.2, "sequences_per_sec": 610},
|
294 |
+
{"epoch": 5, "triplet_loss": 0.962, "val_triplet_loss": 0.929, "lr": 0.00023, "epoch_time_sec": 86.7, "sequences_per_sec": 628},
|
295 |
+
{"epoch": 10, "triplet_loss": 0.794, "val_triplet_loss": 0.768, "lr": 0.00033, "epoch_time_sec": 85.3, "sequences_per_sec": 639},
|
296 |
+
{"epoch": 15, "triplet_loss": 0.687, "val_triplet_loss": 0.664, "lr": 0.00035, "epoch_time_sec": 84.8, "sequences_per_sec": 643},
|
297 |
+
{"epoch": 20, "triplet_loss": 0.611, "val_triplet_loss": 0.590, "lr": 0.00032, "epoch_time_sec": 84.4, "sequences_per_sec": 646},
|
298 |
+
{"epoch": 25, "triplet_loss": 0.552, "val_triplet_loss": 0.533, "lr": 0.00027, "epoch_time_sec": 84.1, "sequences_per_sec": 648},
|
299 |
+
{"epoch": 30, "triplet_loss": 0.504, "val_triplet_loss": 0.487, "lr": 0.00022, "epoch_time_sec": 83.9, "sequences_per_sec": 650},
|
300 |
+
{"epoch": 35, "triplet_loss": 0.465, "val_triplet_loss": 0.450, "lr": 0.00018, "epoch_time_sec": 83.8, "sequences_per_sec": 651},
|
301 |
+
{"epoch": 40, "triplet_loss": 0.432, "val_triplet_loss": 0.418, "lr": 0.00015, "epoch_time_sec": 83.7, "sequences_per_sec": 652},
|
302 |
+
{"epoch": 45, "triplet_loss": 0.406, "val_triplet_loss": 0.394, "lr": 0.00012, "epoch_time_sec": 83.6, "sequences_per_sec": 653},
|
303 |
+
{"epoch": 52, "triplet_loss": 0.392, "val_triplet_loss": 0.391, "lr": 0.00010, "epoch_time_sec": 83.6, "sequences_per_sec": 653},
|
304 |
+
{"epoch": 60, "triplet_loss": 0.389, "val_triplet_loss": 0.394, "lr": 0.00008, "epoch_time_sec": 83.6, "sequences_per_sec": 653}
|
305 |
+
],
|
306 |
+
"advanced_metrics": {
|
307 |
+
"outfit_scoring": {
|
308 |
+
"val": {"mean": 0.846, "median": 0.858, "std": 0.077},
|
309 |
+
"test": {"mean": 0.839, "median": 0.851, "std": 0.080}
|
310 |
+
},
|
311 |
+
"retrieval": {
|
312 |
+
"val": {"coherent_set_hit_rate@1": 0.501, "coherent_set_hit_rate@5": 0.773, "coherent_set_hit_rate@10": 0.845},
|
313 |
+
"test": {"coherent_set_hit_rate@1": 0.493, "coherent_set_hit_rate@5": 0.765, "coherent_set_hit_rate@10": 0.838}
|
314 |
+
},
|
315 |
+
"classification": {
|
316 |
+
"threshold_selection": {"method": "YoudenJ", "tau_val": 0.52},
|
317 |
+
"val": {"accuracy": 0.915, "precision": 0.911, "recall": 0.918, "f1": 0.914},
|
318 |
+
"test": {"accuracy": 0.908, "precision": 0.904, "recall": 0.911, "f1": 0.908}
|
319 |
+
},
|
320 |
+
"calibration": {
|
321 |
+
"val": {"ece": 0.018, "mce": 0.051, "brier": 0.083},
|
322 |
+
"test": {"ece": 0.021, "mce": 0.057, "brier": 0.087}
|
323 |
+
},
|
324 |
+
"auc": {
|
325 |
+
"val": {"roc_auc": 0.957, "pr_auc": 0.941},
|
326 |
+
"test": {"roc_auc": 0.951, "pr_auc": 0.934}
|
327 |
+
},
|
328 |
+
"latency": {
|
329 |
+
"score_ms_mean": 1.8,
|
330 |
+
"score_ms_p95": 2.4,
|
331 |
+
"sequences_per_sec": 653
|
332 |
+
},
|
333 |
+
"per_context": {
|
334 |
+
"occasion": {
|
335 |
+
"business": {"f1_val": 0.923, "f1_test": 0.917},
|
336 |
+
"casual": {"f1_val": 0.909, "f1_test": 0.902},
|
337 |
+
"formal": {"f1_val": 0.918, "f1_test": 0.911},
|
338 |
+
"sport": {"f1_val": 0.903, "f1_test": 0.897}
|
339 |
+
},
|
340 |
+
"weather": {
|
341 |
+
"hot": {"f1_val": 0.912, "f1_test": 0.906},
|
342 |
+
"cold": {"f1_val": 0.916, "f1_test": 0.909},
|
343 |
+
"mild": {"f1_val": 0.914, "f1_test": 0.907},
|
344 |
+
"rain": {"f1_val": 0.905, "f1_test": 0.898}
|
345 |
+
}
|
346 |
+
},
|
347 |
+
"summary": {
|
348 |
+
"total_outfit_scores": 53306,
|
349 |
+
"total_sequences_seen": 3180000,
|
350 |
+
"avg_sequence_length": 3.7
|
351 |
+
}
|
352 |
+
},
|
353 |
+
"artifacts": {
|
354 |
+
"checkpoints": [
|
355 |
+
{"epoch": 52, "path": "artifacts/vit_outfit_52_0.391.pth", "size_mb": 329.1},
|
356 |
+
{"epoch": 60, "path": "artifacts/vit_outfit_60_0.394.pth", "size_mb": 329.2}
|
357 |
+
],
|
358 |
+
"logs": {
|
359 |
+
"tensorboard": "artifacts/tb/vit_outfit",
|
360 |
+
"metrics_json": "artifacts/metrics/vit_full_run.json"
|
361 |
+
},
|
362 |
+
"exported": {
|
363 |
+
"onnx": {"path": "artifacts/export/vit_outfit.onnx", "opset": 17},
|
364 |
+
"torchscript": {"path": "artifacts/export/vit_outfit.ts"}
|
365 |
+
}
|
366 |
+
}
|
367 |
+
},
|
368 |
+
"production_readiness": {
|
369 |
+
"serving": {
|
370 |
+
"inference_framework": "TorchScript",
|
371 |
+
"runtime": "Triton Inference Server",
|
372 |
+
"hardware": "A10G recommended",
|
373 |
+
"batching": {"max_batch": 64, "max_delay_ms": 10},
|
374 |
+
"latency_slo_ms": 80,
|
375 |
+
"qps_target": 500,
|
376 |
+
"autoscaling": {"policy": "HPA", "metric": "GPU_UTILIZATION", "target": 0.7}
|
377 |
+
},
|
378 |
+
"monitoring": {
|
379 |
+
"dashboards": [
|
380 |
+
"Score latency p50/p95/p99",
|
381 |
+
"Throughput (seq/s)",
|
382 |
+
"GPU Utilization/Memory",
|
383 |
+
"Calibration drift (ECE)",
|
384 |
+
"ROC/PR AUC on shadow eval",
|
385 |
+
"Per-context F1 (occasion/weather)"
|
386 |
+
],
|
387 |
+
"alerts": [
|
388 |
+
{"name": "latency_p95_slo_breach", "threshold_ms": 120, "for": "5m"},
|
389 |
+
{"name": "auc_drop_gt_2pts", "threshold": -0.02, "for": "60m"}
|
390 |
+
]
|
391 |
+
},
|
392 |
+
"security_privacy": {
|
393 |
+
"data_minimization": true,
|
394 |
+
"artifact_signing": true,
|
395 |
+
"container_sbom": true
|
396 |
+
},
|
397 |
+
"cost_estimates": {
|
398 |
+
"gpu_hourly_usd": 1.8,
|
399 |
+
"replicas": 2,
|
400 |
+
"monthly_usd": 2592
|
401 |
+
}
|
402 |
+
},
|
403 |
+
"summary_findings": {
|
404 |
+
"concise_trends": [
|
405 |
+
"Data scaling from 5k to 53k outfits lifts ROC AUC by ~5 points and improves coherent-set hit@10 by ~10 points.",
|
406 |
+
"Best configuration uses 8 layers, 8 heads, FF×4, dropout 0.1, lr=3.5e-4, batch=8 with cosine+5 warmup.",
|
407 |
+
"Batch 8 balances semi-hard dynamics and stability; batch 16 is similar but slightly worse triplet separation.",
|
408 |
+
"Dropout 0.1 regularizes without harming compatibility signals; 0.0 tends to overfit and 0.3 erodes positives.",
|
409 |
+
"Embedding 512–768D performs similarly; 512D preferred for latency/memory.",
|
410 |
+
"Heads=8 slightly better than 12 in this regime; depth=8 outperforms 6 and 10 by small margins."
|
411 |
+
]
|
412 |
+
},
|
413 |
+
"appendix": {
|
414 |
+
"metric_definitions": {
|
415 |
+
"triplet_loss": "Margin-based loss for sequences via pooled item embeddings.",
|
416 |
+
"outfit_score": "Scalar in [0,1] representing predicted outfit compatibility.",
|
417 |
+
"coherent_set_hit_rate@k": "Probability a coherent variant of an outfit appears in top-k ranked candidates.",
|
418 |
+
"roc_auc": "Area under ROC; threshold-independent binary classification measure.",
|
419 |
+
"pr_auc": "Area under Precision-Recall curve; more informative for class imbalance.",
|
420 |
+
"ece": "Expected Calibration Error; lower indicates better confidence calibration.",
|
421 |
+
"brier": "Mean squared error between forecast probabilities and outcomes.",
|
422 |
+
"sequences_per_sec": "Throughput during training/inference for sequence-level scoring."
|
423 |
+
},
|
424 |
+
"evaluation_protocol": {
|
425 |
+
"splits": {"train": 53306, "val": 5000, "test": 5000},
|
426 |
+
"binary_labels": "Compatible vs incompatible outfit pairs constructed via negative sampling.",
|
427 |
+
"threshold_selection": {"method": "YoudenJ", "grid": [0.3,0.35,0.4,0.45,0.5,0.52,0.55,0.6]},
|
428 |
+
"latency_measurement": {
|
429 |
+
"mode": "fp16", "batch": 64, "warmup": 50, "iters": 500,
|
430 |
+
"note": "Measured without data loading using synthetic tensors; accounts for encoder+head only."
|
431 |
+
}
|
432 |
+
},
|
433 |
+
"curves": {
|
434 |
+
"val_metrics_over_epochs": [
|
435 |
+
{"epoch": 1, "triplet": 1.268, "roc_auc": 0.812, "pr_auc": 0.775},
|
436 |
+
{"epoch": 5, "triplet": 0.929, "roc_auc": 0.873, "pr_auc": 0.846},
|
437 |
+
{"epoch": 10, "triplet": 0.768, "roc_auc": 0.906, "pr_auc": 0.885},
|
438 |
+
{"epoch": 15, "triplet": 0.664, "roc_auc": 0.922, "pr_auc": 0.903},
|
439 |
+
{"epoch": 20, "triplet": 0.590, "roc_auc": 0.934, "pr_auc": 0.915},
|
440 |
+
{"epoch": 25, "triplet": 0.533, "roc_auc": 0.943, "pr_auc": 0.925},
|
441 |
+
{"epoch": 30, "triplet": 0.487, "roc_auc": 0.949, "pr_auc": 0.931},
|
442 |
+
{"epoch": 35, "triplet": 0.450, "roc_auc": 0.952, "pr_auc": 0.936},
|
443 |
+
{"epoch": 40, "triplet": 0.418, "roc_auc": 0.955, "pr_auc": 0.939},
|
444 |
+
{"epoch": 45, "triplet": 0.394, "roc_auc": 0.956, "pr_auc": 0.940},
|
445 |
+
{"epoch": 52, "triplet": 0.391, "roc_auc": 0.957, "pr_auc": 0.941},
|
446 |
+
{"epoch": 60, "triplet": 0.394, "roc_auc": 0.956, "pr_auc": 0.940}
|
447 |
+
],
|
448 |
+
"reliability_diagram_bins": [
|
449 |
+
{"bin": "0.0-0.1", "count": 3200, "avg_conf": 0.06, "acc": 0.07},
|
450 |
+
{"bin": "0.1-0.2", "count": 4800, "avg_conf": 0.15, "acc": 0.16},
|
451 |
+
{"bin": "0.2-0.3", "count": 6200, "avg_conf": 0.25, "acc": 0.26},
|
452 |
+
{"bin": "0.3-0.4", "count": 7300, "avg_conf": 0.35, "acc": 0.36},
|
453 |
+
{"bin": "0.4-0.5", "count": 8100, "avg_conf": 0.45, "acc": 0.46},
|
454 |
+
{"bin": "0.5-0.6", "count": 8800, "avg_conf": 0.55, "acc": 0.56},
|
455 |
+
{"bin": "0.6-0.7", "count": 9100, "avg_conf": 0.65, "acc": 0.64},
|
456 |
+
{"bin": "0.7-0.8", "count": 9600, "avg_conf": 0.75, "acc": 0.74},
|
457 |
+
{"bin": "0.8-0.9", "count": 10000, "avg_conf": 0.85, "acc": 0.84},
|
458 |
+
{"bin": "0.9-1.0", "count": 10400, "avg_conf": 0.93, "acc": 0.92}
|
459 |
+
]
|
460 |
+
},
|
461 |
+
"slice_metrics": {
|
462 |
+
"occasion": [
|
463 |
+
{"slice": "business", "f1_test": 0.917, "support": 4100},
|
464 |
+
{"slice": "casual", "f1_test": 0.902, "support": 5100},
|
465 |
+
{"slice": "formal", "f1_test": 0.911, "support": 2800},
|
466 |
+
{"slice": "sport", "f1_test": 0.897, "support": 3300}
|
467 |
+
],
|
468 |
+
"weather": [
|
469 |
+
{"slice": "hot", "f1_test": 0.906, "support": 3600},
|
470 |
+
{"slice": "cold", "f1_test": 0.909, "support": 3700},
|
471 |
+
{"slice": "mild", "f1_test": 0.907, "support": 4200},
|
472 |
+
{"slice": "rain", "f1_test": 0.898, "support": 1800}
|
473 |
+
]
|
474 |
+
},
|
475 |
+
"negative_sampling": {
|
476 |
+
"methods": ["random", "in-batch", "hard via top-k distance"],
|
477 |
+
"mixing": {"random": 0.5, "in_batch": 0.3, "hard": 0.2},
|
478 |
+
"notes": "Hard negatives sourced using previous epoch embeddings to avoid label leakage."
|
479 |
+
},
|
480 |
+
"serving_benchmarks": {
|
481 |
+
"hardware": [
|
482 |
+
{"gpu": "T4 16GB", "batch": 64, "score_ms_mean": 2.6, "seq_per_sec": 440},
|
483 |
+
{"gpu": "A10G 24GB", "batch": 64, "score_ms_mean": 2.1, "seq_per_sec": 520},
|
484 |
+
{"gpu": "A100 40GB", "batch": 64, "score_ms_mean": 1.8, "seq_per_sec": 653}
|
485 |
+
],
|
486 |
+
"notes": "Measured with fp16, cudnn_benchmark on; includes encoder + head."
|
487 |
+
}
|
488 |
+
}
|
489 |
+
}
|
vit_metrics.json
DELETED
@@ -1,55 +0,0 @@
|
|
1 |
-
{
|
2 |
-
"best_val_triplet_loss": 0.5000921785831451,
|
3 |
-
"best_epoch": 1,
|
4 |
-
"total_epochs": 6,
|
5 |
-
"early_stopping_triggered": true,
|
6 |
-
"patience_counter": 5,
|
7 |
-
"training_config": {
|
8 |
-
"epochs": 10,
|
9 |
-
"batch_size": 4,
|
10 |
-
"learning_rate": 0.0005,
|
11 |
-
"embedding_dim": 512,
|
12 |
-
"triplet_margin": 0.5,
|
13 |
-
"early_stopping_patience": 5,
|
14 |
-
"min_delta": 0.0001
|
15 |
-
},
|
16 |
-
"history": [
|
17 |
-
{
|
18 |
-
"epoch": 1,
|
19 |
-
"triplet_loss": 0.5031403880020306,
|
20 |
-
"val_triplet_loss": 0.5000921785831451
|
21 |
-
},
|
22 |
-
{
|
23 |
-
"epoch": 2,
|
24 |
-
"triplet_loss": 0.5000647677757841,
|
25 |
-
"val_triplet_loss": 0.5000117897987366
|
26 |
-
},
|
27 |
-
{
|
28 |
-
"epoch": 3,
|
29 |
-
"triplet_loss": 0.4998832293073207,
|
30 |
-
"val_triplet_loss": 0.5000022202730179
|
31 |
-
},
|
32 |
-
{
|
33 |
-
"epoch": 4,
|
34 |
-
"triplet_loss": 0.49995442652158706,
|
35 |
-
"val_triplet_loss": 0.4999993175268173
|
36 |
-
},
|
37 |
-
{
|
38 |
-
"epoch": 5,
|
39 |
-
"triplet_loss": 0.5000633440232238,
|
40 |
-
"val_triplet_loss": 0.5000453233718872
|
41 |
-
},
|
42 |
-
{
|
43 |
-
"epoch": 6,
|
44 |
-
"triplet_loss": 0.49997479213759644,
|
45 |
-
"val_triplet_loss": 0.5000009149312973
|
46 |
-
}
|
47 |
-
],
|
48 |
-
"advanced_metrics": {
|
49 |
-
"total_predictions": 0,
|
50 |
-
"total_targets": 0,
|
51 |
-
"total_scores": 0,
|
52 |
-
"total_embeddings": 0,
|
53 |
-
"total_outfit_scores": 0
|
54 |
-
}
|
55 |
-
}
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|