|
--- |
|
library_name: transformers |
|
tags: [] |
|
--- |
|
|
|
# Model Card for DeTi*k*Zify<sub>v2</sub> (8b) |
|
DeTi*k*Zify<sub>v2</sub> (8b) is a multimodal language model that automatically |
|
converts sketches and existing scientific figures into editable, |
|
semantics-preserving Ti*k*Z graphics programs. It is based on |
|
[LLaMA<sub>3.1</sub> (8b)](https://huggingface.co/meta-llama/Llama-3.1-8B) and |
|
the SigLIP vision encoder of [PaliGemma<sub>Mix-448</sub> |
|
(3b)](https://huggingface.co/google/paligemma-3b-mix-448). Check out the |
|
[DeTi*k*Zify](https://github.com/potamides/DeTikZify) project for more |
|
information and tips on how to best run the model. |
|
|
|
## Usage |
|
```python |
|
from operator import itemgetter |
|
|
|
from detikzify.model import load |
|
from detikzify.infer import DetikzifyPipeline |
|
|
|
image = "https://w.wiki/A7Cc" |
|
pipeline = DetikzifyPipeline(*load( |
|
model_name_or_path="nllg/detikzify-v2-8b", |
|
device_map="auto", |
|
torch_dtype="bfloat16", |
|
)) |
|
|
|
# generate a single TikZ program |
|
fig = pipeline.sample(image=image) |
|
|
|
# if it compiles, rasterize it and show it |
|
if fig.is_rasterizable: |
|
fig.rasterize().show() |
|
|
|
# run MCTS for 10 minutes and generate multiple TikZ programs |
|
figs = set() |
|
for score, fig in pipeline.simulate(image=image, timeout=600): |
|
figs.add((score, fig)) |
|
|
|
# save the best TikZ program |
|
best = sorted(figs, key=itemgetter(0))[-1][1] |
|
best.save("fig.tex") |
|
``` |
|
|
|
## Changes from DeTi*k*Zify<sub>v1</sub> |
|
We document all changes between DeTikZify<sub>v1</sub> and |
|
DeTikZify<sub>v2</sub> in our paper, "[TikZero: Zero-Shot Text-Guided Graphics |
|
Program Synthesis](https://arxiv.org/abs/2503.11509)". For convenience, they |
|
are also listed below. |
|
|
|
### Architecture |
|
Similar to DeTi*k*Zify<sub>v1</sub>, DeTi*k*Zify<sub>v2</sub> uses a SigLIP |
|
vision encoder. However, inspired by the continued ViT pretraining of |
|
[InternVL](https://arxiv.org/abs/2404.16821), we initialize the weights with |
|
the fine-tuned vision encoder of [PaliGemma<sub>Mix-448</sub> |
|
(3b)](https://arxiv.org/abs/2407.07726) and increase DeTi*k*Zify's |
|
resolution to 420x420 pixels. Further, the vision encoder is no longer kept |
|
frozen but fully fine-tuned with the rest of the model. |
|
|
|
### Training Data |
|
For pretraining, we switch from MetaFig to the much larger |
|
[ArXivCap](https://huggingface.co/datasets/MMInstruction/ArxivCap) dataset and |
|
extract 1 million (figure, caption, OCR) tuples for pretraining the modality |
|
connector. For fine-tuning, we create the new |
|
[DaTi*k*Z<sub>v3</sub>](https://huggingface.co/datasets/nllg/datikz-v3) dataset |
|
with over 450k Ti*k*Z drawings. |
|
|
|
We also train a new model called |
|
[UltraSketch](https://huggingface.co/nllg/ultrasketch) to generate synthetic |
|
sketches during training. It is based on |
|
[UltraEdit](https://arxiv.org/abs/2407.05282) and achieves a congruence |
|
coefficient (CC) of 0.74. Additionally, we generate synthetic sketches using |
|
image transformation. While these sketches are less diverse, they are better at |
|
preserving text rendering, achieving a similar CC of 0.75. When we average the |
|
sketch representations produced by both methods, the resulting CC increases to |
|
0.82, indicating that the methods are orthogonal and complement each other |
|
effectively. |
|
|
|
### Training & Inference |
|
We observe improved performance by extending the training to 5 epochs and |
|
increasing the learning rate to 5e-5. Fully fine-tuning the vision encoder |
|
means that we can no longer compute SelfSim as the cosine similarity between |
|
pooled outputs during inference, as the pooling head is not fine-tuned. |
|
However, by instead computing Earth Mover's Distance on the fine-tuned patch |
|
embeddings, it actually enhances the correlation with human judgments (0.456 |
|
segment-level and 0.911 system-level correlation). This means that |
|
DeTikZify<sub>v2</sub> also works well with our MCTS-based inference algorithm. |
|
|
|
# Evaluation |
|
Here is how DeTi*k*Zify<sub>v2</sub> (8b) compares to |
|
[DeTi<i>k</i>Zify<sub>DS</sub> |
|
(7b)](https://huggingface.co/nllg/detikzify-ds-7b), previously the best |
|
performing DeTi*k*Zify model, as evaluated on the test split of |
|
DaTi*k*Z<sub>v3</sub>. Scores are multiplied by 100. |
|
|
|
<table> |
|
<tr> |
|
<th></th> |
|
<th colspan="5">Reference Figures</th> |
|
<th colspan="5">Synthetic Sketches</th> |
|
</tr> |
|
<tr> |
|
<th>Model</th> |
|
<th>DSim<sub>↑</sub></th> |
|
<th>KID<sub>↓</sub></th> |
|
<th>cBLEU<sub>↑</sub></th> |
|
<th>TED<sub>↓</sub></th> |
|
<th>MTE<sub>↑</sub></th> |
|
<th>DSim<sub>↑</sub></th> |
|
<th>KID<sub>↓</sub></th> |
|
<th>cBLEU<sub>↑</sub></th> |
|
<th>TED<sub>↓</sub></th> |
|
<th>MTE<sub>↑</sub></th> |
|
</tr> |
|
<tr> |
|
<td>DeTi<i>k</i>Zify<sub>DS</sub> (7b)</td> |
|
<td>75.46 </td> |
|
<td> 0.842</td> |
|
<td> 2.953</td> |
|
<td>56.851</td> |
|
<td>84.019</td> |
|
<td>67.379</td> |
|
<td> 0.766</td> |
|
<td> 1.541</td> |
|
<td>59.589</td> |
|
<td>84.401</td> |
|
</tr> |
|
<tr> |
|
<td>DeTi<i>k</i>Zify<sub>v2</sub> (8b)</td> |
|
<td><b>80.503</b></td> |
|
<td><b> 0.626</b></td> |
|
<td><b> 6.105</b></td> |
|
<td><b>54.946</b></td> |
|
<td><b>93.326</b></td> |
|
<td><b>74.584</b></td> |
|
<td><b> 0.751</b></td> |
|
<td><b> 3.356</b></td> |
|
<td><b>58.32 </b></td> |
|
<td><b>93.858</b></td> |
|
</tr> |
|
</table> |