detikzify-v2-8b / README.md
potamides's picture
Update README.md
e73e2fd verified
---
library_name: transformers
tags: []
---
# Model Card for DeTi*k*Zify<sub>v2</sub> (8b)
DeTi*k*Zify<sub>v2</sub> (8b) is a multimodal language model that automatically
converts sketches and existing scientific figures into editable,
semantics-preserving Ti*k*Z graphics programs. It is based on
[LLaMA<sub>3.1</sub> (8b)](https://huggingface.co/meta-llama/Llama-3.1-8B) and
the SigLIP vision encoder of [PaliGemma<sub>Mix-448</sub>
(3b)](https://huggingface.co/google/paligemma-3b-mix-448). Check out the
[DeTi*k*Zify](https://github.com/potamides/DeTikZify) project for more
information and tips on how to best run the model.
## Usage
```python
from operator import itemgetter
from detikzify.model import load
from detikzify.infer import DetikzifyPipeline
image = "https://w.wiki/A7Cc"
pipeline = DetikzifyPipeline(*load(
model_name_or_path="nllg/detikzify-v2-8b",
device_map="auto",
torch_dtype="bfloat16",
))
# generate a single TikZ program
fig = pipeline.sample(image=image)
# if it compiles, rasterize it and show it
if fig.is_rasterizable:
fig.rasterize().show()
# run MCTS for 10 minutes and generate multiple TikZ programs
figs = set()
for score, fig in pipeline.simulate(image=image, timeout=600):
figs.add((score, fig))
# save the best TikZ program
best = sorted(figs, key=itemgetter(0))[-1][1]
best.save("fig.tex")
```
## Changes from DeTi*k*Zify<sub>v1</sub>
We document all changes between DeTikZify<sub>v1</sub> and
DeTikZify<sub>v2</sub> in our paper, "[TikZero: Zero-Shot Text-Guided Graphics
Program Synthesis](https://arxiv.org/abs/2503.11509)". For convenience, they
are also listed below.
### Architecture
Similar to DeTi*k*Zify<sub>v1</sub>, DeTi*k*Zify<sub>v2</sub> uses a SigLIP
vision encoder. However, inspired by the continued ViT pretraining of
[InternVL](https://arxiv.org/abs/2404.16821), we initialize the weights with
the fine-tuned vision encoder of [PaliGemma<sub>Mix-448</sub>
(3b)](https://arxiv.org/abs/2407.07726) and increase DeTi*k*Zify's
resolution to 420x420 pixels. Further, the vision encoder is no longer kept
frozen but fully fine-tuned with the rest of the model.
### Training Data
For pretraining, we switch from MetaFig to the much larger
[ArXivCap](https://huggingface.co/datasets/MMInstruction/ArxivCap) dataset and
extract 1 million (figure, caption, OCR) tuples for pretraining the modality
connector. For fine-tuning, we create the new
[DaTi*k*Z<sub>v3</sub>](https://huggingface.co/datasets/nllg/datikz-v3) dataset
with over 450k Ti*k*Z drawings.
We also train a new model called
[UltraSketch](https://huggingface.co/nllg/ultrasketch) to generate synthetic
sketches during training. It is based on
[UltraEdit](https://arxiv.org/abs/2407.05282) and achieves a congruence
coefficient (CC) of 0.74. Additionally, we generate synthetic sketches using
image transformation. While these sketches are less diverse, they are better at
preserving text rendering, achieving a similar CC of 0.75. When we average the
sketch representations produced by both methods, the resulting CC increases to
0.82, indicating that the methods are orthogonal and complement each other
effectively.
### Training & Inference
We observe improved performance by extending the training to 5 epochs and
increasing the learning rate to 5e-5. Fully fine-tuning the vision encoder
means that we can no longer compute SelfSim as the cosine similarity between
pooled outputs during inference, as the pooling head is not fine-tuned.
However, by instead computing Earth Mover's Distance on the fine-tuned patch
embeddings, it actually enhances the correlation with human judgments (0.456
segment-level and 0.911 system-level correlation). This means that
DeTikZify<sub>v2</sub> also works well with our MCTS-based inference algorithm.
# Evaluation
Here is how DeTi*k*Zify<sub>v2</sub> (8b) compares to
[DeTi<i>k</i>Zify<sub>DS</sub>
(7b)](https://huggingface.co/nllg/detikzify-ds-7b), previously the best
performing DeTi*k*Zify model, as evaluated on the test split of
DaTi*k*Z<sub>v3</sub>. Scores are multiplied by 100.
<table>
<tr>
<th></th>
<th colspan="5">Reference Figures</th>
<th colspan="5">Synthetic Sketches</th>
</tr>
<tr>
<th>Model</th>
<th>DSim<sub>&uarr;</sub></th>
<th>KID<sub>&darr;</sub></th>
<th>cBLEU<sub>&uarr;</sub></th>
<th>TED<sub>&darr;</sub></th>
<th>MTE<sub>&uarr;</sub></th>
<th>DSim<sub>&uarr;</sub></th>
<th>KID<sub>&darr;</sub></th>
<th>cBLEU<sub>&uarr;</sub></th>
<th>TED<sub>&darr;</sub></th>
<th>MTE<sub>&uarr;</sub></th>
</tr>
<tr>
<td>DeTi<i>k</i>Zify<sub>DS</sub> (7b)</td>
<td>75.46 </td>
<td> 0.842</td>
<td> 2.953</td>
<td>56.851</td>
<td>84.019</td>
<td>67.379</td>
<td> 0.766</td>
<td> 1.541</td>
<td>59.589</td>
<td>84.401</td>
</tr>
<tr>
<td>DeTi<i>k</i>Zify<sub>v2</sub> (8b)</td>
<td><b>80.503</b></td>
<td><b> 0.626</b></td>
<td><b> 6.105</b></td>
<td><b>54.946</b></td>
<td><b>93.326</b></td>
<td><b>74.584</b></td>
<td><b> 0.751</b></td>
<td><b> 3.356</b></td>
<td><b>58.32 </b></td>
<td><b>93.858</b></td>
</tr>
</table>