sebastiansarasti
/

clip_fashion

Model card Files Files and versions Community

sebastiansarasti commited on Dec 17, 2024

Commit

d1b54e9

·

verified ·

1 Parent(s): 448c982

adding comments to update

Files changed (1) hide show

README.md +21 -1

README.md CHANGED Viewed

@@ -10,4 +10,24 @@ base_model:
 pipeline_tag: image-to-text
 tags:
 - CLIP
----

 pipeline_tag: image-to-text
 tags:
 - CLIP
+---
+# CLIP Model based on DistilBERT and ViT
+This repository contains a CLIP (Contrastive Language-Image Pretraining) model that combines the power of two state-of-the-art architectures:
+- **DistilBERT** (based on `distilbert-base-uncased`): A smaller, faster, and lighter version of BERT.
+- **Vision Transformer (ViT)** (based on `google/vit-base-patch16-224`): A powerful vision transformer architecture for image processing.
+The model is trained to learn joint representations of images and text, enabling a variety of multimodal tasks such as image-text matching, zero-shot classification, and cross-modal retrieval.
+## Model Overview
+CLIP combines a text encoder and an image encoder to map both images and texts into a shared embedding space. By training the model on a large number of image-text pairs, it can perform various downstream tasks without needing task-specific fine-tuning.
+### Components:
+- **Text Encoder**: `distilbert-base-uncased` is used to encode the textual input into a dense vector.
+- **Image Encoder**: `google/vit-base-patch16-224` processes image data by dividing images into patches and learning their contextual relationships.
+### Future work:
+Train over larger datasets and with more computer resources