# LogSAD:基于视觉和语言基础模型的无训练异常检测方法详解 ## 项目概述 LogSAD(Towards Training-free Anomaly Detection with Vision and Language Foundation Models)是一个发表在CVPR 2025的无需训练的异常检测方法。该方法通过结合多个预训练的视觉和语言基础模型,实现了对MVTec LOCO数据集的逻辑异常和结构异常检测。 ## 整体架构与流程 ### 核心理念 LogSAD的核心思想是利用预训练模型的强大表示能力,通过多模态特征融合和逻辑推理来检测异常,无需对特定数据集进行训练。 ### 系统架构 ``` 输入图像 (448x448) ↓ ┌─────────────────────────────────────────────────┐ │ 多模态特征提取层 │ │ ├─ CLIP ViT-L-14 (图像+文本特征) │ │ ├─ DINOv2 ViT-L-14 (图像特征) │ │ └─ SAM ViT-H (实例分割) │ └─────────────────────────────────────────────────┘ ↓ ┌─────────────────────────────────────────────────┐ │ 特征处理与融合层 │ │ ├─ K-means聚类分割 │ │ ├─ 文本引导的语义分割 │ │ └─ 多尺度特征融合 │ └─────────────────────────────────────────────────┘ ↓ ┌─────────────────────────────────────────────────┐ │ 异常检测层 │ │ ├─ 结构异常检测 (PatchCore) │ │ ├─ 逻辑异常检测 (直方图匹配) │ │ └─ 实例匹配检测 (Hungarian算法) │ └─────────────────────────────────────────────────┘ ↓ 最终异常分数 ``` ## 预训练模型详解 ### 1. CLIP ViT-L-14 模型 **作用**:视觉-语言理解的核心 - **模型**:`hf-hub:laion/CLIP-ViT-L-14-DataComp.XL-s13B-b90K` - **输入尺寸**:448×448 - **特征提取层**:[6, 12, 18, 24] - **特征维度**:1024维 - **输出特征尺寸**:32×32 → 64×64(插值) **具体实现**: ```python # model_ensemble.py:96-97 self.model_clip, _, _ = open_clip.create_model_and_transforms('hf-hub:laion/CLIP-ViT-L-14-DataComp.XL-s13B-b90K') self.feature_list = [6, 12, 18, 24] ``` **协作机制**: - 提供图像的语义特征表示 - 通过文本提示编码不同物体的语义信息 - 用于语义分割和异常分类 ### 2. DINOv2 ViT-L-14 模型 **作用**:提供更丰富的视觉特征 - **模型**:`dinov2_vitl14` - **特征提取层**:[6, 12, 18, 24] - **特征维度**:1024维 - **输出特征尺寸**:32×32 → 64×64(插值) **具体实现**: ```python # model_ensemble.py:181-186 from dinov2.dinov2.hub.backbones import dinov2_vitl14 self.model_dinov2 = dinov2_vitl14() self.feature_list_dinov2 = [6, 12, 18, 24] ``` **协作机制**: - 为某些类别(splicing_connectors, breakfast_box, juice_bottle)提供更强的视觉特征 - 与CLIP特征互补,提高检测精度 ### 3. SAM (Segment Anything Model) **作用**:实例分割 - **模型**:ViT-H版本 - **检查点**:`./checkpoint/sam_vit_h_4b8939.pth` - **功能**:自动生成物体mask **具体实现**: ```python # model_ensemble.py:102-103 self.model_sam = sam_model_registry["vit_h"](checkpoint = "./checkpoint/sam_vit_h_4b8939.pth") self.mask_generator = SamAutomaticMaskGenerator(model = self.model_sam) ``` **协作机制**: - 提供精确的物体边界 - 用于实例级别的异常检测 - 与语义分割结果融合 ## 数据处理与尺寸变换详解 ### 图像预处理流程 1. **输入尺寸标准化**: ```python # evaluation.py:184 datamodule = MVTecLoco(root=dataset_path, eval_batch_size=1, image_size=(448, 448), category=category) ``` 2. **归一化处理**: ```python # model_ensemble.py:88-92 self.transform = v2.Compose([ v2.Normalize(mean=(0.48145466, 0.4578275, 0.40821073), std=(0.26862954, 0.26130258, 0.27577711)), ]) ``` 3. **特征图尺寸变换**: ```python # model_ensemble.py:155-156 self.feat_size = 64 # 目标特征图大小 self.ori_feat_size = 32 # 原始特征图大小 ``` ### 详细的Resize流程 **CLIP特征处理**: ```python # model_ensemble.py:245-255 # 1. 从32x32插值到64x64 patch_tokens_clip = patch_tokens_clip.view(1, self.ori_feat_size, self.ori_feat_size, -1).permute(0, 3, 1, 2) patch_tokens_clip = F.interpolate(patch_tokens_clip, size=(self.feat_size, self.feat_size), mode=self.inter_mode, align_corners=self.align_corners) patch_tokens_clip = patch_tokens_clip.permute(0, 2, 3, 1).view(-1, self.vision_width * len(self.feature_list)) ``` **DINOv2特征处理**: ```python # model_ensemble.py:253-263 # 相同的插值流程 patch_tokens_dinov2 = F.interpolate(patch_tokens_dinov2, size=(self.feat_size, self.feat_size), mode=self.inter_mode, align_corners=self.align_corners) ``` **插值参数**: - **插值模式**:双线性插值(`bilinear`) - **对齐角点**:`align_corners=True` - **抗锯齿**:`antialias=True` ## SAM多Mask处理机制 ### SAM生成多个Mask的处理 **Mask生成**: ```python # model_ensemble.py:394 masks = self.mask_generator.generate(raw_image) sorted_masks = sorted(masks, key=(lambda x: x['area']), reverse=True) ``` **Mask融合策略**: ```python # model_ensemble.py:347-367 def merge_segmentations(a, b, background_class): """将SAM mask与语义分割结果融合""" # 通过投票机制确定每个SAM区域的语义标签 for label_a in unique_labels_a: mask_a = (a == label_a) labels_b = b[mask_a] if labels_b.size > 0: count_b = np.bincount(labels_b, minlength=unique_labels_b.max() + 1) label_map[label_a] = np.argmax(count_b) # 多数投票 ``` **多Mask协作流程**: 1. SAM生成所有可能的实例mask 2. K-means聚类生成语义分割mask 3. 文本引导生成patch级别的语义mask 4. 通过投票机制融合不同来源的mask 5. 过滤小区域噪声(阈值:32像素) ## Ground Truth多Mask处理机制 ### MVTec LOCO数据集的Mask组织结构 **文件结构**: ``` dataset/ ├── test/category/image_filename.png # 测试图像 ├── ground_truth/category/image_filename/ # 对应的GT mask目录 │ ├── 000.png # 第一个异常区域mask │ ├── 001.png # 第二个异常区域mask │ ├── 002.png # 第三个异常区域mask │ └── ... # 更多异常区域mask ``` **数据加载时的多Mask聚合**: ```python # anomalib/data/image/mvtec_loco.py:142-148 mask_samples = ( mask_samples.groupby(["path", "split", "label", "image_folder"])["image_path"] .agg(list) # 将同一图像的多个mask路径聚合成列表 .reset_index() .rename(columns={"image_path": "mask_path"}) ) ``` ### 多Mask融合策略 **步骤1:Mask路径处理**: ```python # anomalib/data/image/mvtec_loco.py:279-280 if isinstance(mask_path, str): mask_path = [mask_path] # 确保mask_path是列表格式 ``` **步骤2:语义Mask堆叠**: ```python # anomalib/data/image/mvtec_loco.py:281-285 semantic_mask = ( Mask(torch.zeros(image.shape[-2:])).to(torch.uint8) # 正常图像:零mask if label_index == LabelName.NORMAL else Mask(torch.stack([self._read_mask(path) for path in mask_path])) # 异常图像:堆叠所有mask ) ``` **步骤3:二值Mask生成**: ```python # anomalib/data/image/mvtec_loco.py:287 binary_mask = Mask(semantic_mask.view(-1, *semantic_mask.shape[-2:]).int().any(dim=0).to(torch.uint8)) ``` ### 关键融合机制解析 **维度变换**: - 输入:多个mask,每个形状为 (H, W) - 堆叠后:(N, H, W),其中N为mask数量 - `view(-1, H, W)`:重塑为 (N, H, W) - `any(dim=0)`:沿第一维度求或运算,得到 (H, W) **融合逻辑**: ```python # 伪代码示例 mask1 = [[0, 1, 0], mask2 = [[0, 0, 1], [1, 0, 1], [0, 1, 0], [0, 1, 0]] [1, 0, 0]] # 堆叠:shape (2, 3, 3) stacked = torch.stack([mask1, mask2]) # any操作:逐像素求或 result = [[0, 1, 1], # max(0,0), max(1,0), max(0,1) [1, 1, 1], # max(1,0), max(0,1), max(1,0) [1, 1, 0]] # max(0,1), max(1,0), max(0,0) ``` ### 数据加载完整流程 **MVTec LOCO数据项结构**: ```python # 正常样本 item = { "image_path": "/path/to/normal_image.png", "label": 0, "image": torch.Tensor(...), "mask": torch.zeros(H, W), # 零mask "mask_path": [], # 空列表 "semantic_mask": torch.zeros(H, W) # 零mask } # 异常样本 item = { "image_path": "/path/to/abnormal_image.png", "label": 1, "image": torch.Tensor(...), "mask": torch.Tensor(...), # 融合后的二值mask "mask_path": [ # 多个mask路径列表 "/path/to/ground_truth/image/000.png", "/path/to/ground_truth/image/001.png", "/path/to/ground_truth/image/002.png" ], "semantic_mask": torch.Tensor(...) # 原始多mask堆叠,shape (N, H, W) } ``` ### 评估时的Mask使用 **重要特性**:LogSAD在推理过程中**不使用**ground truth mask,完全基于输入图像进行异常检测。Ground truth mask仅用于: 1. **性能评估**:计算AUROC、F1等指标 2. **可视化对比**:与预测结果对比 3. **指标计算**:像素级和语义级异常检测性能 **验证机制**: ```python # anomalib/data/image/mvtec_loco.py:158-174 # 验证mask文件与图像文件的对应关系 image_stems = samples.loc[samples.label_index == LabelName.ABNORMAL]["image_path"].apply(lambda x: Path(x).stem) mask_parent_stems = samples.loc[samples.label_index == LabelName.ABNORMAL]["mask_path"].apply( lambda x: {Path(mask_path).parent.stem for mask_path in x}, ) # 确保 image: '005.png' 对应 mask: '005/000.png', '005/001.png' 等 ``` ### 多Mask场景的实际应用 **典型场景**: 1. **Splicing Connectors**:连接器、电缆、夹具可能分别标注 2. **Juice Bottle**:液体、标签、瓶身缺陷可能分别标注 3. **Breakfast Box**:不同食物的缺失可能分别标注 4. **Screw Bag**:不同螺丝、螺母、垫圈的异常分别标注 **处理优势**: - 保留了详细的异常区域信息 - 支持多类型异常的联合评估 - 便于细粒度的性能分析 - 兼容传统二值异常检测评估 ## 关键特判逻辑详解 代码中存在**5个主要特判分支**,分别对应不同的数据集类别: ### 1. Pushpins类别特判 **位置**:`model_ensemble.py:432-479` **逻辑**: ```python if self.class_name == 'pushpins': # 1. 物体计数检测 pushpins_count = num_labels - 1 if self.few_shot_inited and pushpins_count != self.pushpins_count: self.anomaly_flag = True # 2. Patch直方图匹配 clip_patch_hist = np.bincount(patch_mask.reshape(-1), minlength=self.patch_query_obj.shape[0]) patch_hist_similarity = (clip_patch_hist @ self.patch_token_hist.T) score = 1 - patch_hist_similarity.max() ``` **检测异常类型**: - 推钉数量异常(标准数量:15个) - 颜色分布异常 ### 2. Splicing Connectors类别特判 **位置**:`model_ensemble.py:481-615` **复杂逻辑**: ```python elif self.class_name == 'splicing_connectors': # 1. 连接组件检测 if count != 1: self.anomaly_flag = True # 2. 电缆颜色与夹具数量匹配检测 foreground_pixel_count = np.sum(erode_binary) / self.splicing_connectors_count[idx_color] ratio = foreground_pixel_count / self.foreground_pixel_hist_splicing_connectors if ratio > 1.2 or ratio < 0.8: self.anomaly_flag = True # 3. 左右对称性检测 ratio = np.sum(left_count) / (np.sum(right_count) + 1e-5) if ratio > 1.2 or ratio < 0.8: self.anomaly_flag = True # 4. 距离检测 distance = np.sqrt((x1/w - x2/w)**2 + (y1/h - y2/h)**2) ratio = distance / self.splicing_connectors_distance if ratio < 0.6 or ratio > 1.4: self.anomaly_flag = True ``` **检测异常类型**: - 电缆断裂或缺失 - 颜色与夹具数量不匹配(黄色2夹、蓝色3夹、红色5夹) - 左右夹具不对称 - 电缆长度异常 ### 3. Screw Bag类别特判 **位置**:`model_ensemble.py:617-670` **逻辑**: ```python elif self.class_name == 'screw_bag': # 前景像素统计异常检测 foreground_pixel_count = np.sum(np.bincount(kmeans_mask.reshape(-1))[:len(self.foreground_label_idx[self.class_name])]) ratio = foreground_pixel_count / self.foreground_pixel_hist_screw_bag if ratio < 0.94 or ratio > 1.06: self.anomaly_flag = True ``` **检测异常类型**: - 螺丝、螺母、垫圈数量异常 - 前景像素比例异常(阈值:±6%) ### 4. Juice Bottle类别特判 **位置**:`model_ensemble.py:715-771` **逻辑**: ```python elif self.class_name == 'juice_bottle': # 液体与水果匹配检测 liquid_idx = (liquid_feature @ query_liquid.T).argmax(-1).squeeze(0).item() fruit_idx = (fruit_feature @ query_fruit.T).argmax(-1).squeeze(0).item() if liquid_idx != fruit_idx: self.anomaly_flag = True ``` **检测异常类型**: - 液体颜色与标签水果不匹配 - 标签错位 ### 5. Breakfast Box类别特判 **位置**:`model_ensemble.py:672-713` **逻辑**: ```python elif self.class_name == 'breakfast_box': # 主要依靠patch直方图匹配 sam_patch_hist = np.bincount(patch_merge_sam.reshape(-1), minlength=self.patch_query_obj.shape[0]) patch_hist_similarity = (sam_patch_hist @ self.patch_token_hist.T) score = 1 - patch_hist_similarity.max() ``` **检测异常类型**: - 食物分布异常 - 缺失或多余物品 ## Few-shot与Full-data模式区别 ### 数据处理差异 **Few-shot模式**(`model_ensemble_few_shot.py`): ```python # 直接使用所有few-shot样本 FEW_SHOT_SAMPLES = [0, 1, 2, 3] # 固定4个样本 self.k_shot = few_shot_samples.size(0) ``` **Full-data模式**(`model_ensemble.py`): ```python # 使用完整训练集构建coreset FEW_SHOT_SAMPLES = range(len(datamodule.train_data)) # 所有训练样本 self.k_shot = 4 if self.total_size > 4 else self.total_size ``` ### Coreset子采样机制 **Few-shot模式**:无coreset,直接使用原始特征 ```python # model_ensemble_few_shot.py:852 self.mem_patch_feature_clip_coreset = patch_tokens_clip self.mem_patch_feature_dinov2_coreset = patch_tokens_dinov2 ``` **Full-data模式**:使用K-Center Greedy算法进行coreset子采样 ```python # model_ensemble.py:892-896 clip_sampler = KCenterGreedy(embedding=mem_patch_feature_clip_coreset, sampling_ratio=0.25) mem_patch_feature_clip_coreset = clip_sampler.sample_coreset() dinov2_sampler = KCenterGreedy(embedding=mem_patch_feature_dinov2_coreset, sampling_ratio=0.25) mem_patch_feature_dinov2_coreset = dinov2_sampler.sample_coreset() ``` ### 统计信息差异 **Few-shot模式**: ```python # model_ensemble_few_shot.py:185 self.stats = pickle.load(open("memory_bank/statistic_scores_model_ensemble_few_shot_val.pkl", "rb")) ``` **Full-data模式**: ```python # model_ensemble.py:188 self.stats = pickle.load(open("memory_bank/statistic_scores_model_ensemble_val.pkl", "rb")) ``` ### 计算流程差异 **Few-shot模式流程**: 1. 直接计算4个样本的特征 2. 无需coreset计算 3. 直接进行异常检测 **Full-data模式流程**: 1. 计算所有训练样本特征(`compute_coreset.py`) 2. 使用K-Center Greedy算法选择代表性特征 3. 保存coreset到`memory_bank/`目录 4. 加载预计算的coreset进行异常检测 ## 实现细节与优化 ### 内存优化策略 **批处理机制**: ```python # model_ensemble.py:926-928 for i in range(self.total_size//self.k_shot): self.process(class_name, few_shot_samples[self.k_shot*i : min(self.k_shot*(i+1), self.total_size)], few_shot_paths[self.k_shot*i : min(self.k_shot*(i+1), self.total_size)]) ``` **特征缓存**: - 预计算的coreset特征保存在`memory_bank/`目录 - 统计信息预计算并缓存 ### 多模态特征融合 **特征层选择策略**: - **聚类特征**:使用CLIP的第0、1层(`cluster_feature_id = [0, 1]`) - **检测特征**:使用第6、12、18、24层的完整特征 **不同类别的模型选择**: ```python # model_ensemble.py:290-310 if self.class_name in ['pushpins', 'screw_bag']: # 使用CLIP特征进行PatchCore检测 len_feature_list = len(self.feature_list) for patch_feature, mem_patch_feature in zip(patch_tokens_clip.chunk(len_feature_list, dim=-1), mem_patch_feature_clip_coreset.chunk(len_feature_list, dim=-1)): if self.class_name in ['splicing_connectors', 'breakfast_box', 'juice_bottle']: # 使用DINOv2特征进行PatchCore检测 len_feature_list = len(self.feature_list_dinov2) for patch_feature, mem_patch_feature in zip(patch_tokens_dinov2.chunk(len_feature_list, dim=-1), mem_patch_feature_dinov2_coreset.chunk(len_feature_list, dim=-1)): ``` ## 文本提示工程 ### 语义查询词典 **物体级别查询**: ```python # model_ensemble.py:123-136 self.query_words_dict = { "breakfast_box": ['orange', "nectarine", "cereals", "banana chips", 'almonds', 'white box', 'black background'], "juice_bottle": ['bottle', ['black background', 'background']], "pushpins": [['pushpin', 'pin'], ['plastic box', 'black background']], "screw_bag": [['screw'], 'plastic bag', 'background'], "splicing_connectors": [['splicing connector', 'splice connector',], ['cable', 'wire'], ['grid']], } ``` **Patch级别查询**: ```python # model_ensemble.py:138-145 self.patch_query_words_dict = { "juice_bottle": [['glass'], ['liquid in bottle'], ['fruit'], ['label', 'tag'], ['black background', 'background']], "screw_bag": [['hex screw', 'hexagon bolt'], ['hex nut', 'hexagon nut'], ['ring washer', 'ring gasket'], ['plastic bag', 'background']], # ... } ``` ### 文本编码策略 **多模板编码**: ```python # prompt_ensemble.py:98-120 def encode_obj_text(model, query_words, tokenizer, device): for qw in query_words: if type(qw) == list: for qw2 in qw: token_input.extend([temp(qw2) for temp in openai_imagenet_template]) else: token_input = [temp(qw) for temp in openai_imagenet_template] ``` 使用82个不同的ImageNet模板进行文本增强,提高文本特征的鲁棒性。 ## 性能评估 ### 评估指标 **图像级别指标**: - F1-Max(Image) - AUROC(Image) **异常类型指标**: - F1-Max(Logical):逻辑异常 - AUROC(Logical):逻辑异常 - F1-Max(Structural):结构异常 - AUROC(Structural):结构异常 ### 评估流程 **数据分离**: ```python # evaluation.py:222-227 if 'logical' not in image_path[0]: image_metric_structure.update(output["pred_score"].cpu(), data["label"]) if 'structural' not in image_path[0]: image_metric_logical.update(output["pred_score"].cpu(), data["label"]) ``` **分数融合**: ```python # model_ensemble.py:227-231 standard_structural_score = (structural_score - self.stats[self.class_name]["structural_scores"]["mean"]) / self.stats[self.class_name]["structural_scores"]["unbiased_std"] standard_instance_hungarian_match_score = (instance_hungarian_match_score - self.stats[self.class_name]["instance_hungarian_match_scores"]["mean"]) / self.stats[self.class_name]["instance_hungarian_match_scores"]["unbiased_std"] pred_score = max(standard_instance_hungarian_match_score, standard_structural_score) pred_score = sigmoid(pred_score) ``` ## 总结 LogSAD通过巧妙结合多个预训练模型的优势,实现了无需训练的异常检测: 1. **多模态协作**:CLIP提供语义理解、DINOv2提供视觉特征、SAM提供精确分割 2. **逻辑推理**:通过领域知识编码的特判逻辑检测复杂的逻辑异常 3. **特征融合**:多尺度特征提取和融合提高检测精度 4. **高效优化**:Coreset子采样和特征缓存机制保证实用性 该方法在MVTec LOCO数据集上取得了优异的性能,展示了预训练模型在异常检测任务中的巨大潜力。