detikzify-v2-8b / README.md

Update README.md

e73e2fd verified 4 months ago

5.19 kB

	---
	library_name: transformers
	tags: []
	---

	# Model Card for DeTikZify<sub>v2</sub> (8b)
	DeTikZify<sub>v2</sub> (8b) is a multimodal language model that automatically
	converts sketches and existing scientific figures into editable,
	semantics-preserving TikZ graphics programs. It is based on
	[LLaMA<sub>3.1</sub> (8b)](https://huggingface.co/meta-llama/Llama-3.1-8B) and
	the SigLIP vision encoder of [PaliGemma<sub>Mix-448</sub>
	(3b)](https://huggingface.co/google/paligemma-3b-mix-448). Check out the
	[DeTikZify](https://github.com/potamides/DeTikZify) project for more
	information and tips on how to best run the model.

	## Usage
	```python
	from operator import itemgetter

	from detikzify.model import load
	from detikzify.infer import DetikzifyPipeline

	image = "https://w.wiki/A7Cc"
	pipeline = DetikzifyPipeline(*load(
	model_name_or_path="nllg/detikzify-v2-8b",
	device_map="auto",
	torch_dtype="bfloat16",
	))

	# generate a single TikZ program
	fig = pipeline.sample(image=image)

	# if it compiles, rasterize it and show it
	if fig.is_rasterizable:
	fig.rasterize().show()

	# run MCTS for 10 minutes and generate multiple TikZ programs
	figs = set()
	for score, fig in pipeline.simulate(image=image, timeout=600):
	figs.add((score, fig))

	# save the best TikZ program
	best = sorted(figs, key=itemgetter(0))[-1][1]
	best.save("fig.tex")
	```

	## Changes from DeTikZify<sub>v1</sub>
	We document all changes between DeTikZify<sub>v1</sub> and
	DeTikZify<sub>v2</sub> in our paper, "[TikZero: Zero-Shot Text-Guided Graphics
	Program Synthesis](https://arxiv.org/abs/2503.11509)". For convenience, they
	are also listed below.

	### Architecture
	Similar to DeTikZify<sub>v1</sub>, DeTikZify<sub>v2</sub> uses a SigLIP
	vision encoder. However, inspired by the continued ViT pretraining of
	[InternVL](https://arxiv.org/abs/2404.16821), we initialize the weights with
	the fine-tuned vision encoder of [PaliGemma<sub>Mix-448</sub>
	(3b)](https://arxiv.org/abs/2407.07726) and increase DeTikZify's
	resolution to 420x420 pixels. Further, the vision encoder is no longer kept
	frozen but fully fine-tuned with the rest of the model.

	### Training Data
	For pretraining, we switch from MetaFig to the much larger
	[ArXivCap](https://huggingface.co/datasets/MMInstruction/ArxivCap) dataset and
	extract 1 million (figure, caption, OCR) tuples for pretraining the modality
	connector. For fine-tuning, we create the new
	[DaTikZ<sub>v3</sub>](https://huggingface.co/datasets/nllg/datikz-v3) dataset
	with over 450k TikZ drawings.

	We also train a new model called
	[UltraSketch](https://huggingface.co/nllg/ultrasketch) to generate synthetic
	sketches during training. It is based on
	[UltraEdit](https://arxiv.org/abs/2407.05282) and achieves a congruence
	coefficient (CC) of 0.74. Additionally, we generate synthetic sketches using
	image transformation. While these sketches are less diverse, they are better at
	preserving text rendering, achieving a similar CC of 0.75. When we average the
	sketch representations produced by both methods, the resulting CC increases to
	0.82, indicating that the methods are orthogonal and complement each other
	effectively.

	### Training & Inference
	We observe improved performance by extending the training to 5 epochs and
	increasing the learning rate to 5e-5. Fully fine-tuning the vision encoder
	means that we can no longer compute SelfSim as the cosine similarity between
	pooled outputs during inference, as the pooling head is not fine-tuned.
	However, by instead computing Earth Mover's Distance on the fine-tuned patch
	embeddings, it actually enhances the correlation with human judgments (0.456
	segment-level and 0.911 system-level correlation). This means that
	DeTikZify<sub>v2</sub> also works well with our MCTS-based inference algorithm.

	# Evaluation
	Here is how DeTikZify<sub>v2</sub> (8b) compares to
	[DeTi<i>k</i>Zify<sub>DS</sub>
	(7b)](https://huggingface.co/nllg/detikzify-ds-7b), previously the best
	performing DeTikZify model, as evaluated on the test split of
	DaTikZ<sub>v3</sub>. Scores are multiplied by 100.

	<table>
	<tr>
	<th></th>
	<th colspan="5">Reference Figures</th>
	<th colspan="5">Synthetic Sketches</th>
	</tr>
	<tr>
	<th>Model</th>
	<th>DSim<sub>↑</sub></th>
	<th>KID<sub>↓</sub></th>
	<th>cBLEU<sub>↑</sub></th>
	<th>TED<sub>↓</sub></th>
	<th>MTE<sub>↑</sub></th>
	<th>DSim<sub>↑</sub></th>
	<th>KID<sub>↓</sub></th>
	<th>cBLEU<sub>↑</sub></th>
	<th>TED<sub>↓</sub></th>
	<th>MTE<sub>↑</sub></th>
	</tr>
	<tr>
	<td>DeTi<i>k</i>Zify<sub>DS</sub> (7b)</td>
	<td>75.46 </td>
	<td> 0.842</td>
	<td> 2.953</td>
	<td>56.851</td>
	<td>84.019</td>
	<td>67.379</td>
	<td> 0.766</td>
	<td> 1.541</td>
	<td>59.589</td>
	<td>84.401</td>
	</tr>
	<tr>
	<td>DeTi<i>k</i>Zify<sub>v2</sub> (8b)</td>
	<td><b>80.503</b></td>
	<td><b> 0.626</b></td>
	<td><b> 6.105</b></td>
	<td><b>54.946</b></td>
	<td><b>93.326</b></td>
	<td><b>74.584</b></td>
	<td><b> 0.751</b></td>
	<td><b> 3.356</b></td>
	<td><b>58.32 </b></td>
	<td><b>93.858</b></td>
	</tr>
	</table>