Model Card for DeTikZify_v2.5 (8b)

DeTikZify_v2.5 (8b) is a multimodal language model that automatically converts sketches and existing scientific figures into editable, semantics-preserving TikZ graphics programs. It builds on DeTikZify_v2 post-trained with reinforcement learning and self-computed rewards. This approach, which we call reinforcement learning from self-feedback (RLSF), allows the model to considerably improve itself without necessitating external reward functions. Check out the DeTikZify project for more information and tips on how to best run the model.

Usage

from operator import itemgetter

from detikzify.model import load
from detikzify.infer import DetikzifyPipeline

image = "https://w.wiki/A7Cc"
pipeline = DetikzifyPipeline(*load(
    model_name_or_path="nllg/detikzify-v2.5-8b",
    device_map="auto",
    torch_dtype="bfloat16",
))

# generate a single TikZ program
fig = pipeline.sample(image=image)

# if it compiles, rasterize it and show it
if fig.is_rasterizable:
    fig.rasterize().show()

# run MCTS for 10 minutes and generate multiple TikZ programs
figs = set()
for score, fig in pipeline.simulate(image=image, timeout=600):
    figs.add((score, fig))

# save the best TikZ program
best = sorted(figs, key=itemgetter(0))[-1][1]
best.save("fig.tex")

Reinforcement Learning from Self-Feedback

Background

DeTikZify employs an iterative inference algorithm based on Monte Carlo Tree Search (MCTS), enabling it to continuously refine its outputs without additional training. The reward scores required by MCTS are computed entirely using DeTikZify's vision encoder, by visually assessing the similarity between input figures and compiled generated outputs. External reward models are not only unnecessary but often have lower correlations with human judgments, as the vision encoder was fine-tuned end-to-end with the entire model, optimizing it for evaluating this specific task. We refer readers to the DeTikZify and TikZero papers for further details.

These self-computed rewards have been effective in enhancing model outputs during inference. With reinforcement learning algorithms like Group Relative Policy Optimization, this reward signal could also be used for the model to improve itself during a post-training step, i.e., reinforcement learning from self-feedback (RLSF).

Model Training

Our post-training setup does only require figures, and does not require aligned code as in supervised fine-tuning, granting us more flexibility in selecting training data. 50% of the training data comes from the subset of DaTikZ_v3, that was filtered out during the training of DeTikZify_v2. The remaining 50% is sampled from the SPIQA dataset, which contains image labels for figures extracted from arXiv. We exclude all figures from papers included in DaTikZ_v3. We sample this split so that 60% of these figures are labeled as schematics, 20% as plots, and 20% come from other categories. Since these figures are not necessarily created from TikZ, they may aid in enhancing the model's generalization capabilities. As with DeTikZify_v2, input figures are randomly converted into synthetic sketches using image transformations and UltraSketch.

Using this dataset, we post-train DeTikZify_v2 with RLSF, employing a batch size of 16. For each image, 32 outputs are generated, resulting in the model being trained on 512 outputs per step. We train for a total of 500 steps which takes 5 days to complete on eight Nvidia H200 GPUs. We keep the vision encoder frozen to mitigate reward hacking.

Experiments and Results

We evaluate DeTikZify_v2.5 (8b) on the test split of DaTikZ_v3 and compare it to DeTikZify_v2 (8b). The metrics employed include DreamSim (DSim), Kernel Inception Distance (KID), CrystalBLEU (cBLEU), TeX Edit Distance (TED), Mean Token Efficiency (MTE), and Mean Sampling Throughput (MST). Refer to the DeTikZify paper for further details. All scores except MST are multiplied by 100.

Sampling-based Inference

	Reference Figures					Synthetic Sketches
Model	DSim_↑	KID_↓	cBLEU_↑	TED_↓	MTE_↑	DSim_↑	KID_↓	cBLEU_↑	TED_↓	MTE_↑
DeTikZify_v2 (8b)	80.503	0.626	6.105	54.946	93.326	74.584	0.751	3.356	58.32	93.858
DeTikZify_v2.5 (8b)	84.6438	0.298	4.202	52.939	100	78.257	0.577	1.551	56.121	100

In sampling-based inference (i.e., accepting the first output that compiles successfully) using reference figures and synthetic sketch inputs, DeTikZify_v2.5 (8b) outperforms DeTikZify_v2 (8b) on most metrics, demonstrating that RLSF can effectively enhance performance. The considerably increased DreamSim scores indicate that DeTikZify_v2.5 (8b) generates outputs that are much more visually similar to the reference figures. Furthermore, it is much less likely to produce outputs that do not compile, as evidenced by its perfect MTE score. Interestingly, while it scores lower on the code-based metric CrystalBLEU, it performs better on the code-based TED. DeTikZify_v2.5 (8b) tends to generate more concise programs with less syntactic noise. While this likely reduces the n-gram overlap with the reference code, it also decreases the number of edits necessary to convert one into another, explaining this phenomenon. Generally, more concise programs are beneficial as long as the semantics are preserved.

MCTS-based Inference

	Reference Figures					Synthetic Sketches
Model	DSim_↑	KID_↓	cBLEU_↑	TED_↓	MST_↑	DSim_↑	KID_↓	cBLEU_↑	TED_↓	MST_↑
DeTikZify_v2 (8b)	89.020	0.016	6.593	52.466	52.723	81.482	0.313	3.344	56.405	53.586
DeTikZify_v2.5 (8b)	90.889	-0.047	4.646	51.824	68.12	83.74	0.61	1.976	55.239	78.908

We observe similar trends when using our MCTS-based inference algorithm with a time budget of 10 minutes. Compared to sampling-based inference, DeTikZify_v2.5 (8b) noticeably improves its scores, illustrating that MCTS on top of RLSF can still lead to additional gains. Additionally, within the same timeframe, DeTikZify_v2.5 (8b) generates 25 more outputs than DeTikZify_v2 (8b), supporting our hypothesis that the generated programs are more concise. On reference figures, DeTikZify_v2.5 (8b) scores better on both DreamSim and KID, with the KID score even being slightly negative due to the high similarity of distributions. For synthetic sketches, it achieves a higher DreamSim score but performs worse on KID, indicating a prioritization of faithfulness to the reference figure over just focusing on general aesthetics.

Inference with TikZero Adapters

	Captions
Model	DSim_↑	KID_↓	CLIP_↑	cBLEU_↑	TED_↓	MTE_↑
DeTikZify_v2 (8b)	52.829	5.103	10.051	1.603	65.51	82.291
DeTikZify_v2.5 (8b)	53.564	7.471	7.968	0.732	62.189	100

TikZero adapters integrate into the vision encoder of DeTikZify models, enabling them to be conditioned on text in addition to images. Since we keep the vision encoder frozen, we can evaluate DeTikZify_v2.5 (8b) on adapters trained for DeTikZify_v2 (8b). Compared to our previous experiments, the results are more varied. While DeTikZify_v2.5 (8b) achieves a better DreamSim value and maintains a perfect MTE, it performs worse on CLIPScore, suggesting difficulties in reproducing text from captions. This could be due to an increased modality gap, as RLSF further refines the model for image-only inputs. We plan to address this in future work by incorporating caption inputs into RLSF training.

Summary

Overall, RLSF greatly enhances model performance for most tasks. For image and sketch inputs, DeTikZify_v2.5 (8b) emerges as the clear leader. For text inputs via TikZero adapters, the choice between model versions depends on the specific use case, given the trade-offs involved.

Acknowledgments

This model was trained using computational resources provided by the bwForCluster Helix, as part of the bwHPC-S5 project. The authors acknowledge support from the state of Baden-Württemberg through the bwHPC initiative and the German Research Foundation (DFG) under grant INST 35/1597-1 FUGG. This project was inspired by the paper Rendering-Aware Reinforcement Learning for Vector Graphics Generation.

nllg
/

detikzify-v2.5-8b

Model Card for DeTikZify_v2.5 (8b)

Usage

Reinforcement Learning from Self-Feedback

Background

Model Training

Experiments and Results

Sampling-based Inference

MCTS-based Inference

Inference with TikZero Adapters

Summary

Acknowledgments

Model tree for nllg/detikzify-v2.5-8b

Collections including nllg/detikzify-v2.5-8b

DeTikZify

TikZero

Model Card for DeTikZifyv2.5 (8b)

Usage

Reinforcement Learning from Self-Feedback

Background

Model Training

Experiments and Results

Sampling-based Inference

MCTS-based Inference

Inference with TikZero Adapters

Summary

Acknowledgments

Model tree for nllg/detikzify-v2.5-8b

Collections including nllg/detikzify-v2.5-8b

Model Card for DeTikZify_v2.5 (8b)