--- language: - ja tags: - clip - japanese-stable-clip pipeline_tag: feature-extraction license: other extra_gated_prompt: >- By clicking "Agree", you agree to the [License Agreement](https://huggingface.co/stabilityai/japanese-stable-clip-vit-l-16/blob/main/LICENSE.md) and acknowledge Stability AI's [Privacy Policy](https://stability.ai/privacy-policy). extra_gated_fields: Name: text Email: text Country: country Organization or Affiliation: text Receive email updates and promotions on Stability AI products, services, and research?: type: select options: - 'Yes' - 'No' --- # Japanese Stable CLIP ViT-L/16 Please note: for commercial usage of this model, please see https://stability.ai/license 商用利用に関する日本語での問い合わせは partners-jp@stability.ai までお願い致します。 ## Model Details Japanese Stable CLIP is a Japanese [CLIP (Contrastive Language-Image Pre-Training)](https://arxiv.org/abs/2103.00020) model that enables to map both Japanese texts and images to the same embedding space. This model alone is capable of tasks such as zero-shot image classification and text-to-image retrieval. Furthermore, when combined with other components, it can be used as part of generative models, such as image-to-text and text-to-image generation. ## Usage
1. Install packages ``` pip install ftfy pillow requests transformers torch sentencepiece protobuf ``` 2. Run! ```python from typing import Union, List import ftfy, html, re, io import requests from PIL import Image import torch from transformers import AutoModel, AutoTokenizer, AutoImageProcessor, BatchFeature # taken from https://github.com/mlfoundations/open_clip/blob/main/src/open_clip/tokenizer.py#L65C8-L65C8 def basic_clean(text): text = ftfy.fix_text(text) text = html.unescape(html.unescape(text)) return text.strip() def whitespace_clean(text): text = re.sub(r"\s+", " ", text) text = text.strip() return text def tokenize( tokenizer, texts: Union[str, List[str]], max_seq_len: int = 77, ): """ This is a function that have the original clip's code has. https://github.com/openai/CLIP/blob/main/clip/clip.py#L195 """ if isinstance(texts, str): texts = [texts] texts = [whitespace_clean(basic_clean(text)) for text in texts] inputs = tokenizer( texts, max_length=max_seq_len - 1, padding="max_length", truncation=True, add_special_tokens=False, ) # add bos token at first place input_ids = [[tokenizer.bos_token_id] + ids for ids in inputs["input_ids"]] attention_mask = [[1] + am for am in inputs["attention_mask"]] position_ids = [list(range(0, len(input_ids[0])))] * len(texts) return BatchFeature( { "input_ids": torch.tensor(input_ids, dtype=torch.long), "attention_mask": torch.tensor(attention_mask, dtype=torch.long), "position_ids": torch.tensor(position_ids, dtype=torch.long), } ) device = "cuda" if torch.cuda.is_available() else "cpu" model_path = "stabilityai/japanese-stable-clip-vit-l-16" model = AutoModel.from_pretrained(model_path, trust_remote_code=True).to(device) tokenizer = AutoTokenizer.from_pretrained(model_path) processor = AutoImageProcessor.from_pretrained(model_path) # Run! image = Image.open(io.BytesIO(requests.get('https://images.pexels.com/photos/2253275/pexels-photo-2253275.jpeg?auto=compress&cs=tinysrgb&dpr=3&h=750&w=1260').content)) image = processor(images=image, return_tensors="pt").to(device) text = tokenize( tokenizer=tokenizer, texts=["犬", "猫", "象"], ).to(device) with torch.no_grad(): image_features = model.get_image_features(**image) text_features = model.get_text_features(**text) text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1) print("Label probs:", text_probs) # [[1.0, 0.0, 0.0]] ```
## Model Details * **Developed by**: [Stability AI](https://stability.ai/) * **Model type**: Contrastive Image-Text, Zero-Shot Image Classification * **Language(s)**: Japanese * **License**: [STABILITY AI COMMUNITY LICENSE](./LICENSE.md). | Model | ImageNet top-1 accuracy\* | | :-- | --: | | **Japanese Stable CLIP ViT-L/16** | 62.06 | | [rinna/japanese-cloob-vit-b-16](https://huggingface.co/rinna/japanese-cloob-vit-b-16) | 54.64 | | [laion/CLIP-ViT-H-14-frozen-xlm-roberta-large-laion5B-s13B-b90k](https://huggingface.co/laion/CLIP-ViT-H-14-frozen-xlm-roberta-large-laion5B-s13B-b90k) | 53 | | [rinna/japanese-clip-vit-b-16](https://huggingface.co/rinna/japanese-clip-vit-b-16) | 50.69 | \* Computed scores based on https://github.com/rinnakk/japanese-clip. ### Training The model uses a ViT-L/16 Transformer architecture as an image encoder and a 12-layer BERT as a text encoder with the Japanese tokenizer from [rinna/japanese-roberta-base](https://huggingface.co/rinna/japanese-roberta-base). During training, the image encoder was initialized from the [AugReg](https://arxiv.org/abs/2106.10270) [vit-large-patch16-224](https://huggingface.co/timm/vit_large_patch16_224.augreg_in21k_ft_in1k ) model and we applied [SigLIP (Sigmoid loss for Language-Image Pre-training)](https://arxiv.org/abs/2303.15343). ### Training Dataset The training dataset includes the following public datasets: - [CC12M](https://github.com/google-research-datasets/conceptual-12m) with captions translated into Japanese - [MS-COCO](https://cocodataset.org/#home) with [STAIR Captions](http://captions.stair.center/) ## Use and Limitations ### Intended Use This model is intended to be used by the open-source community in vision-language applications. ### Limitations and bias The training dataset may have contained offensive or inappropriate content even though we applied data filters. We recommend users exercise reasonable caution when using these models in production systems. Do not use the model for any applications that may cause harm or distress to individuals or groups. ## How to cite ```bibtex @misc{JapaneseStableCLIP, url = {[https://huggingface.co/stabilityai/japanese-stable-clip-vit-l-16](https://huggingface.co/stabilityai/japanese-stable-clip-vit-l-16)}, title = {Japanese Stable CLIP ViT-L/16}, author = {Shing, Makoto and Akiba, Takuya} } ``` ## Contact * For questions and comments about the model, please join [Stable Community Japan](https://discord.com/invite/StableJP). * For future announcements / information about Stability AI models, research, and events, please follow https://twitter.com/StabilityAI_JP. * For business and partnership inquiries, please contact partners-jp@stability.ai. ビジネスや協業に関するお問い合わせはsales-jp@stability.aiにご連絡ください。