adding comments to update
Browse files
README.md
CHANGED
@@ -10,4 +10,24 @@ base_model:
|
|
10 |
pipeline_tag: image-to-text
|
11 |
tags:
|
12 |
- CLIP
|
13 |
-
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
10 |
pipeline_tag: image-to-text
|
11 |
tags:
|
12 |
- CLIP
|
13 |
+
---
|
14 |
+
|
15 |
+
# CLIP Model based on DistilBERT and ViT
|
16 |
+
|
17 |
+
This repository contains a CLIP (Contrastive Language-Image Pretraining) model that combines the power of two state-of-the-art architectures:
|
18 |
+
- **DistilBERT** (based on `distilbert-base-uncased`): A smaller, faster, and lighter version of BERT.
|
19 |
+
- **Vision Transformer (ViT)** (based on `google/vit-base-patch16-224`): A powerful vision transformer architecture for image processing.
|
20 |
+
|
21 |
+
The model is trained to learn joint representations of images and text, enabling a variety of multimodal tasks such as image-text matching, zero-shot classification, and cross-modal retrieval.
|
22 |
+
|
23 |
+
## Model Overview
|
24 |
+
|
25 |
+
CLIP combines a text encoder and an image encoder to map both images and texts into a shared embedding space. By training the model on a large number of image-text pairs, it can perform various downstream tasks without needing task-specific fine-tuning.
|
26 |
+
|
27 |
+
### Components:
|
28 |
+
- **Text Encoder**: `distilbert-base-uncased` is used to encode the textual input into a dense vector.
|
29 |
+
- **Image Encoder**: `google/vit-base-patch16-224` processes image data by dividing images into patches and learning their contextual relationships.
|
30 |
+
|
31 |
+
### Future work:
|
32 |
+
|
33 |
+
Train over larger datasets and with more computer resources
|