arxiv:2503.15055

ELTEX: A Framework for Domain-Driven Synthetic Data Generation

Published on Mar 19

· Submitted by

lavriz on Mar 20

Upvote

Authors:

Arina Razmyslovich ,

Kseniia Murasheva ,

Julien Capitaine ,

Abstract

We present ELTEX (Efficient LLM Token Extraction), a domain-driven framework for generating high-quality synthetic training data in specialized domains. While Large Language Models (LLMs) have shown impressive general capabilities, their performance in specialized domains like cybersecurity remains limited by the scarcity of domain-specific training data. ELTEX addresses this challenge by systematically integrating explicit domain indicator extraction with dynamic prompting to preserve critical domain knowledge throughout the generation process. We demonstrate ELTEX's effectiveness in the context of blockchain-related cyberattack detection, where we fine-tune Gemma-2B using various combinations of real and ELTEX-generated data. Our results show that the ELTEX-enhanced model achieves performance competitive with GPT-4 across both standard classification metrics and uncertainty calibration, while requiring significantly fewer computational resources. We release a curated synthetic dataset of social media texts for cyberattack detection in blockchain. Our work demonstrates that domain-driven synthetic data generation can effectively bridge the performance gap between resource-efficient models and larger architectures in specialized domains.

View arXiv page View PDF GitHub repository Add to collection

Community

lavriz

Paper author Paper submitter about 20 hours ago

•

edited about 11 hours ago

Hey! We just published the paper and released the dataset of synthetic social media messages for early cyberattack detection on blockchain. Let's discuss!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2503.15055 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2503.15055 in a Space README.md to link it from this page.