# FLIP (Facial Language Image Pretrain)

This repository is the official implementation of [FaceCaption-15M](https://huggingface.co/datasets/OpenFace-CQUPT/FaceCaption-15M).

**Overview of FLIP architecture.**

![image-20240318101027127](https://img.yutangli.net/img/202403181010116.png)

 **(a). Same color represents shared parameters. “12x” stands for 12-layer transformer modules. (b), (c) and (d) FLIP-based model are applied to the tasks of text-image retrieval, facial attributes prediction and sketch less facial image retrieval, respectively.**

## Training

Coming soon......（Only for the datasets been published, the code of training is meaningful.）

```shell
python pretrain.py > log.log
```

## Pre-trained Models

We provide pretrained model weights for the [vit base version](https://huggingface.co/OpenFace-CQUPT/Facial-language-image-pretraining-model/tree/main/ckpt)

## Datasets

> **Coming soon......**

**Overview of our proposed FaceCaption-15M containing over 15 million facial image-text (right and left) pairs.**

![image-20240318100601414](https://img.yutangli.net/img/202403181006981.png)

**Comparisons with other popular facial image datasets.**

![image-20240318100734131](https://img.yutangli.net/img/202403181007778.png)

**Image quality score distribution.**

![image-20240318100849106](https://img.yutangli.net/img/202403181008178.png)

**Text distribution.**

![image-20240318100913176](https://img.yutangli.net/img/202403181009312.png)

## Results

### Task1: Text-Image Retrieval

**Comparison with other classical pretrained models. All pretrained model backbones are frozen, with only the linear layer being fine-tuned. † represents the model pretrained on the LAION-Face [86] dataset; * represents the model pretrained on the FaceCaption dataset constructed without using LLM text generation.**

![](https://img.yutangli.net/img/202403181015142.png)

### Task2: Facial Attributes Prediction

**Comparison with other classical models. † represents the model pre-trained on the original LAION-Face dataset.**

![image-20240318101126897](https://img.yutangli.net/img/202403181011115.png)

### Task3: Sketch Less Facial Image Retrieval

**Comparative results with different baseline methods. † represents the model pre-trained on the LAION-Face dataset.**

![image-20240318101633671](https://img.yutangli.net/img/202403181016876.png)

**Performance of early retrieval in SLFIR problem. Instead of showing the complete sketch, we visualized it using the percentage of sketch. A higher value indicates a better early retrieval performance.**

![image-20240318101704679](https://img.yutangli.net/img/202403181017013.png)

## Citations & Contacts

> Coming soon......