uc-ctds/llama-data-model-generator

Model Name: Llama Data Model Generator

Description: This fined-tuned LoRA layer is optimized for the specific task of generating structured data models in JSON format from a dump of tabular data files.

It only requires headers from the tabular files, allowing usage in environments where the underlying data may require separate authorization. It is trained and evaluated against a diverse range of research data models, with this initial release leaning towards biomedical research data models (as the nature of the available real models and project tended towards this domain). This means it is likely better at handling typical biomedical datasets with research studies, patients, samples, genomics, imaging, etc.

The creation of this model leveraged high quality synthetic data, generated from a custom algorithm which takes real, expert-curated, production data models in use today as input. The synthetic data generation algorithm addresses the challenge of providing a robust training foundation of high-quality data for AI-driven data model generation.

Intended Use: This will act as an interim step in an overall AI-assisted data harmonization tool that is still under development. We are releasing this portion of the work separately in case its usage in isolation (or in other tools or pipelines) can help further advance innovative solutions.

Base Model: Meta's Open Source Llama 3.1 8B Instruct

Model Size: 8B parameters

LoRA Configuration:

LoRA Rank: 8
LoRA Alpha: 32
Training Epochs: 1 epoch over 58,000 samples (Note that we used 3 epochs when evaluating models trained with only 5k samples)

Training Data:

We will detail the training data generation algorithms and methods elsewhere, but the summary of what was used is as follows:

Data Generation Strategy: We leveraged our experience in harmonizing novel datasets to existing biomedical data models to create synthetic data contributions that mimick expert-curated versions used across various real-world projects.
Data Generation Methods: We used a bottom-up approach ("Method 2") to generate 10,000 synthetic data models from 27 real biomedical data models. Synthetic models were then used to generate hundreds of thousands of synthetic data contributions in "Dataset 3" that have realistic, novel node combinations.
Dataset Size: We used 58,000 samples from Dataset 3 for training, half of which were intentionally diversified further. We also evaluated other sample sizes: 5k, 20k, & 87k samples, which did not perform as well as the 58k.
Data Diversity: We used an LLM (llama3.2) to further diversify the training data in order to improve the model’s robustness and generalization capabilities.

Training Info:

Base Models Evaluated: Llama-3-8B-Instruct, Mistral-7B-Instruct, DeepSeek-R1-Distill-Llama-8B.
Training Data Quantities Tested: 5,000 - 87,000 samples.
GPUs: These were fine-tuned and evaluated using 2 NVIDIA A100s in an on-prem GPU cluster

Inference Parameters:

Decoding: Unstructured decoding performed better and reduces errors when used in conjunction with low temperature.
Temperature: 0.1. Adjustable, but lower performs better with unstructured decoding.

Recommended Prompt:

A slight variation of this was used in training and evaluation:

You are a data structuring expert tasked with analyzing data files (CSV, TXT, TSV, XML) to identify their schema and
generate output in a JSON Data Dictionary format. Review the data files for column names, data types,
and relationships, and if a data dictionary is provided, ensure strict alignment with its metadata.
Column names may have embedded information to infer the type and/or units from.

Follow these steps:
- Examine each data file to define the schema
- Cross-reference with the data dictionary, if available, to match all column definitions and metadata exactly
- Generate an output schema that mirrors the provided data structure WITHOUT adding any new entities or attributes
- Limit your output to the smallest amount possible of JSON to capture the necessary information. DO NOT BE VERBOSE

The output must include nodes, properties of those nodes, descriptions of those properties, and links to other nodes.
The ouput must format as ONLY JSON, do not include additional text and please be concise. Limit your output to only what's
necessary (nodes, properties, descriptions, relationships / links).

File name: `$file_name`
File contents:

$file_contents

Please generate the JSON Data Dictionary:
"""

We have done minimal evalutation of alternative prompts, which is an area for future development.

Note: we provided CSV/TSVs containing only file headers. We did not provide data dictionaries, but we left the prompt as-is.

Intended Use:

This LoRA model is intended for:

De novo generation of structured data models based on a collection of tabular file headers.
As an initial step in data harmonization and transformation tasks, to represent the original data model before requesting harmonization to a target.

Challenges & Considerations:

Synthetic Model Generation: One core challenge was in creating high quality synthetic data with diverse structures and contents. The model's performance hinges on the quality and complexity of these synthetic models.
Evaluation of Model Effectiveness: We had to create a custom metrics and evaluation framework, along with a custom benchmark. This helped us appropriately determine more granular effectiveness for this specific problem.

Built with Llama

uc-ctds
/

llama-data-model-generator

You need to agree to share your contact information to access this model

Model tree for uc-ctds/llama-data-model-generator

Space using uc-ctds/llama-data-model-generator 1

Collection including uc-ctds/llama-data-model-generator

AI-Assisted Data Curation