File size: 5,185 Bytes
84c17b4
 
 
 
 
03edfb2
f80fde0
 
 
 
 
03edfb2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e73e2fd
 
 
 
03edfb2
 
 
 
 
 
 
 
 
84c17b4
 
03edfb2
 
 
e73e2fd
 
 
03edfb2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
049ef09
 
03edfb2
4973505
03edfb2
 
 
 
 
f87e853
03edfb2
 
 
 
 
 
 
4973505
03edfb2
 
4973505
 
 
03edfb2
 
4973505
 
 
03edfb2
 
4973505
 
 
03edfb2
 
4973505
03edfb2
 
 
4973505
 
03edfb2
 
4973505
 
 
03edfb2
 
4973505
03edfb2
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
---
library_name: transformers
tags: []
---

# Model Card for DeTi*k*Zify<sub>v2</sub> (8b)
DeTi*k*Zify<sub>v2</sub> (8b) is a multimodal language model that automatically
converts sketches and existing scientific figures into editable,
semantics-preserving Ti*k*Z graphics programs. It is based on
[LLaMA<sub>3.1</sub> (8b)](https://huggingface.co/meta-llama/Llama-3.1-8B) and
the SigLIP vision encoder of [PaliGemma<sub>Mix-448</sub>
(3b)](https://huggingface.co/google/paligemma-3b-mix-448). Check out the
[DeTi*k*Zify](https://github.com/potamides/DeTikZify) project for more
information and tips on how to best run the model.

## Usage
```python
from operator import itemgetter

from detikzify.model import load
from detikzify.infer import DetikzifyPipeline

image = "https://w.wiki/A7Cc"
pipeline = DetikzifyPipeline(*load(
    model_name_or_path="nllg/detikzify-v2-8b",
    device_map="auto",
    torch_dtype="bfloat16",
))

# generate a single TikZ program
fig = pipeline.sample(image=image)

# if it compiles, rasterize it and show it
if fig.is_rasterizable:
    fig.rasterize().show()

# run MCTS for 10 minutes and generate multiple TikZ programs
figs = set()
for score, fig in pipeline.simulate(image=image, timeout=600):
    figs.add((score, fig))

# save the best TikZ program
best = sorted(figs, key=itemgetter(0))[-1][1]
best.save("fig.tex")
```

## Changes from DeTi*k*Zify<sub>v1</sub>
We document all changes between DeTikZify<sub>v1</sub> and
DeTikZify<sub>v2</sub> in our paper, "[TikZero: Zero-Shot Text-Guided Graphics
Program Synthesis](https://arxiv.org/abs/2503.11509)". For convenience, they
are also listed below.

### Architecture
Similar to DeTi*k*Zify<sub>v1</sub>, DeTi*k*Zify<sub>v2</sub> uses a SigLIP
vision encoder. However, inspired by the continued ViT pretraining of
[InternVL](https://arxiv.org/abs/2404.16821), we initialize the weights with
the fine-tuned vision encoder of [PaliGemma<sub>Mix-448</sub>
(3b)](https://arxiv.org/abs/2407.07726) and increase DeTi*k*Zify's
resolution to 420x420 pixels. Further, the vision encoder is no longer kept
frozen but fully fine-tuned with the rest of the model.

### Training Data
For pretraining, we switch from MetaFig to the much larger
[ArXivCap](https://huggingface.co/datasets/MMInstruction/ArxivCap) dataset and
extract 1 million (figure, caption, OCR) tuples for pretraining the modality
connector. For fine-tuning, we create the new
[DaTi*k*Z<sub>v3</sub>](https://huggingface.co/datasets/nllg/datikz-v3) dataset
with over 450k Ti*k*Z drawings.

We also train a new model called
[UltraSketch](https://huggingface.co/nllg/ultrasketch) to generate synthetic
sketches during training. It is based on
[UltraEdit](https://arxiv.org/abs/2407.05282) and achieves a congruence
coefficient (CC) of 0.74. Additionally, we generate synthetic sketches using
image transformation. While these sketches are less diverse, they are better at
preserving text rendering, achieving a similar CC of 0.75. When we average the
sketch representations produced by both methods, the resulting CC increases to
0.82, indicating that the methods are orthogonal and complement each other
effectively.

### Training & Inference
We observe improved performance by extending the training to 5 epochs and
increasing the learning rate to 5e-5. Fully fine-tuning the vision encoder
means that we can no longer compute SelfSim as the cosine similarity between
pooled outputs during inference, as the pooling head is not fine-tuned.
However, by instead computing Earth Mover's Distance on the fine-tuned patch
embeddings, it actually enhances the correlation with human judgments (0.456
segment-level and 0.911 system-level correlation). This means that
DeTikZify<sub>v2</sub> also works well with our MCTS-based inference algorithm.

# Evaluation
Here is how DeTi*k*Zify<sub>v2</sub> (8b) compares to
[DeTi<i>k</i>Zify<sub>DS</sub>
(7b)](https://huggingface.co/nllg/detikzify-ds-7b), previously the best
performing DeTi*k*Zify model, as evaluated on the test split of
DaTi*k*Z<sub>v3</sub>. Scores are multiplied by 100.

<table>
  <tr>
    <th></th>
    <th colspan="5">Reference Figures</th>
    <th colspan="5">Synthetic Sketches</th>
  </tr>
  <tr>
    <th>Model</th>
    <th>DSim<sub>&uarr;</sub></th>
    <th>KID<sub>&darr;</sub></th>
    <th>cBLEU<sub>&uarr;</sub></th>
    <th>TED<sub>&darr;</sub></th>
    <th>MTE<sub>&uarr;</sub></th>
    <th>DSim<sub>&uarr;</sub></th>
    <th>KID<sub>&darr;</sub></th>
    <th>cBLEU<sub>&uarr;</sub></th>
    <th>TED<sub>&darr;</sub></th>
    <th>MTE<sub>&uarr;</sub></th>
  </tr>
  <tr>
    <td>DeTi<i>k</i>Zify<sub>DS</sub> (7b)</td>
    <td>75.46 </td>
    <td> 0.842</td>
    <td> 2.953</td>
    <td>56.851</td>
    <td>84.019</td>
    <td>67.379</td>
    <td> 0.766</td>
    <td> 1.541</td>
    <td>59.589</td>
    <td>84.401</td>
  </tr>
  <tr>
    <td>DeTi<i>k</i>Zify<sub>v2</sub> (8b)</td>
    <td><b>80.503</b></td>
    <td><b> 0.626</b></td>
    <td><b> 6.105</b></td>
    <td><b>54.946</b></td>
    <td><b>93.326</b></td>
    <td><b>74.584</b></td>
    <td><b> 0.751</b></td>
    <td><b> 3.356</b></td>
    <td><b>58.32 </b></td>
    <td><b>93.858</b></td>
  </tr>
</table>