sebastiansarasti commited on
Commit
d1b54e9
·
verified ·
1 Parent(s): 448c982

adding comments to update

Browse files
Files changed (1) hide show
  1. README.md +21 -1
README.md CHANGED
@@ -10,4 +10,24 @@ base_model:
10
  pipeline_tag: image-to-text
11
  tags:
12
  - CLIP
13
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
10
  pipeline_tag: image-to-text
11
  tags:
12
  - CLIP
13
+ ---
14
+
15
+ # CLIP Model based on DistilBERT and ViT
16
+
17
+ This repository contains a CLIP (Contrastive Language-Image Pretraining) model that combines the power of two state-of-the-art architectures:
18
+ - **DistilBERT** (based on `distilbert-base-uncased`): A smaller, faster, and lighter version of BERT.
19
+ - **Vision Transformer (ViT)** (based on `google/vit-base-patch16-224`): A powerful vision transformer architecture for image processing.
20
+
21
+ The model is trained to learn joint representations of images and text, enabling a variety of multimodal tasks such as image-text matching, zero-shot classification, and cross-modal retrieval.
22
+
23
+ ## Model Overview
24
+
25
+ CLIP combines a text encoder and an image encoder to map both images and texts into a shared embedding space. By training the model on a large number of image-text pairs, it can perform various downstream tasks without needing task-specific fine-tuning.
26
+
27
+ ### Components:
28
+ - **Text Encoder**: `distilbert-base-uncased` is used to encode the textual input into a dense vector.
29
+ - **Image Encoder**: `google/vit-base-patch16-224` processes image data by dividing images into patches and learning their contextual relationships.
30
+
31
+ ### Future work:
32
+
33
+ Train over larger datasets and with more computer resources