Model Card for DeTikZifyv2.5 (8b)

DeTikZifyv2.5 (8b) is a multimodal language model that automatically converts sketches and existing scientific figures into editable, semantics-preserving TikZ graphics programs. It builds on DeTikZifyv2 post-trained with reinforcement learning and self-computed rewards. This approach, which we call reinforcement learning from self-feedback (RLSF), allows the model to considerably improve itself without necessitating external reward functions. Check out the DeTikZify project for more information and tips on how to best run the model.

Usage

from operator import itemgetter

from detikzify.model import load
from detikzify.infer import DetikzifyPipeline

image = "https://w.wiki/A7Cc"
pipeline = DetikzifyPipeline(*load(
    model_name_or_path="nllg/detikzify-v2.5-8b",
    device_map="auto",
    torch_dtype="bfloat16",
))

# generate a single TikZ program
fig = pipeline.sample(image=image)

# if it compiles, rasterize it and show it
if fig.is_rasterizable:
    fig.rasterize().show()

# run MCTS for 10 minutes and generate multiple TikZ programs
figs = set()
for score, fig in pipeline.simulate(image=image, timeout=600):
    figs.add((score, fig))

# save the best TikZ program
best = sorted(figs, key=itemgetter(0))[-1][1]
best.save("fig.tex")

Reinforcement Learning from Self-Feedback

Background

DeTikZify employs an iterative inference algorithm based on Monte Carlo Tree Search (MCTS), enabling it to continuously refine its outputs without additional training. The reward scores required by MCTS are computed entirely using DeTikZify's vision encoder, by visually assessing the similarity between input figures and compiled generated outputs. External reward models are not only unnecessary but often have lower correlations with human judgments, as the vision encoder was fine-tuned end-to-end with the entire model, optimizing it for evaluating this specific task. We refer readers to the DeTikZify and TikZero papers for further details.

These self-computed rewards have been effective in enhancing model outputs during inference. With reinforcement learning algorithms like Group Relative Policy Optimization, this reward signal could also be used for the model to improve itself during a post-training step, i.e., reinforcement learning from self-feedback (RLSF).

Model Training

Our post-training setup does only require figures, and does not require aligned code as in supervised fine-tuning, granting us more flexibility in selecting training data. 50% of the training data comes from the subset of DaTikZv3, that was filtered out during the training of DeTikZifyv2. The remaining 50% is sampled from the SPIQA dataset, which contains image labels for figures extracted from arXiv. We exclude all figures from papers included in DaTikZv3. We sample this split so that 60% of these figures are labeled as schematics, 20% as plots, and 20% come from other categories. Since these figures are not necessarily created from TikZ, they may aid in enhancing the model's generalization capabilities. As with DeTikZifyv2, input figures are randomly converted into synthetic sketches using image transformations and UltraSketch.

Using this dataset, we post-train DeTikZifyv2 with RLSF, employing a batch size of 16. For each image, 32 outputs are generated, resulting in the model being trained on 512 outputs per step. We train for a total of 500 steps which takes 5 days to complete on eight Nvidia H200 GPUs. We keep the vision encoder frozen to mitigate reward hacking.

Experiments and Results

We evaluate DeTikZifyv2.5 (8b) on the test split of DaTikZv3 and compare it to DeTikZifyv2 (8b). The metrics employed include DreamSim (DSim), Kernel Inception Distance (KID), CrystalBLEU (cBLEU), TeX Edit Distance (TED), Mean Token Efficiency (MTE), and Mean Sampling Throughput (MST). Refer to the DeTikZify paper for further details. All scores except MST are multiplied by 100.

Sampling-based Inference

Reference Figures Synthetic Sketches
Model DSim↑ KID↓ cBLEU↑ TED↓ MTE↑ DSim↑ KID↓ cBLEU↑ TED↓ MTE↑
DeTikZifyv2 (8b) 80.503 0.626 6.105 54.946 93.326 74.584 0.751 3.356 58.32 93.858
DeTikZifyv2.5 (8b) 84.6438 0.298 4.202 52.939 100 78.257 0.577 1.551 56.121 100

In sampling-based inference (i.e., accepting the first output that compiles successfully) using reference figures and synthetic sketch inputs, DeTikZifyv2.5 (8b) outperforms DeTikZifyv2 (8b) on most metrics, demonstrating that RLSF can effectively enhance performance. The considerably increased DreamSim scores indicate that DeTikZifyv2.5 (8b) generates outputs that are much more visually similar to the reference figures. Furthermore, it is much less likely to produce outputs that do not compile, as evidenced by its perfect MTE score. Interestingly, while it scores lower on the code-based metric CrystalBLEU, it performs better on the code-based TED. DeTikZifyv2.5 (8b) tends to generate more concise programs with less syntactic noise. While this likely reduces the n-gram overlap with the reference code, it also decreases the number of edits necessary to convert one into another, explaining this phenomenon. Generally, more concise programs are beneficial as long as the semantics are preserved.

MCTS-based Inference

Reference Figures Synthetic Sketches
Model DSim↑ KID↓ cBLEU↑ TED↓ MST↑ DSim↑ KID↓ cBLEU↑ TED↓ MST↑
DeTikZifyv2 (8b) 89.020 0.016 6.593 52.466 52.723 81.482 0.313 3.344 56.405 53.586
DeTikZifyv2.5 (8b) 90.889 -0.047 4.646 51.824 68.12 83.74 0.61 1.976 55.239 78.908

We observe similar trends when using our MCTS-based inference algorithm with a time budget of 10 minutes. Compared to sampling-based inference, DeTikZifyv2.5 (8b) noticeably improves its scores, illustrating that MCTS on top of RLSF can still lead to additional gains. Additionally, within the same timeframe, DeTikZifyv2.5 (8b) generates 25 more outputs than DeTikZifyv2 (8b), supporting our hypothesis that the generated programs are more concise. On reference figures, DeTikZifyv2.5 (8b) scores better on both DreamSim and KID, with the KID score even being slightly negative due to the high similarity of distributions. For synthetic sketches, it achieves a higher DreamSim score but performs worse on KID, indicating a prioritization of faithfulness to the reference figure over just focusing on general aesthetics.

Inference with TikZero Adapters

Captions
Model DSim↑ KID↓ CLIP↑ cBLEU↑ TED↓ MTE↑
DeTikZifyv2 (8b) 52.829 5.103 10.051 1.603 65.51 82.291
DeTikZifyv2.5 (8b) 53.564 7.471 7.968 0.732 62.189 100

TikZero adapters integrate into the vision encoder of DeTikZify models, enabling them to be conditioned on text in addition to images. Since we keep the vision encoder frozen, we can evaluate DeTikZifyv2.5 (8b) on adapters trained for DeTikZifyv2 (8b). Compared to our previous experiments, the results are more varied. While DeTikZifyv2.5 (8b) achieves a better DreamSim value and maintains a perfect MTE, it performs worse on CLIPScore, suggesting difficulties in reproducing text from captions. This could be due to an increased modality gap, as RLSF further refines the model for image-only inputs. We plan to address this in future work by incorporating caption inputs into RLSF training.

Summary

Overall, RLSF greatly enhances model performance for most tasks. For image and sketch inputs, DeTikZifyv2.5 (8b) emerges as the clear leader. For text inputs via TikZero adapters, the choice between model versions depends on the specific use case, given the trade-offs involved.

Acknowledgments

This model was trained using computational resources provided by the bwForCluster Helix, as part of the bwHPC-S5 project. The authors acknowledge support from the state of Baden-WΓΌrttemberg through the bwHPC initiative and the German Research Foundation (DFG) under grant INST 35/1597-1 FUGG. This project was inspired by the paper Rendering-Aware Reinforcement Learning for Vector Graphics Generation.

Downloads last month
98
Safetensors
Model size
8.47B params
Tensor type
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for nllg/detikzify-v2.5-8b

Finetuned
(1)
this model

Collections including nllg/detikzify-v2.5-8b