SyReC-Mistral-7B-Reconstructor-v1
Model Description
This model is a specialized, fine-tuned version of mistralai/Mistral-7B-Instruct-v0.3
. It has been explicitly trained to perform syntactic and semantic reconstruction, a task that involves reconstructing a coherent, grammatically correct paragraph from a disordered "bag of words."
The model was fine-tuned on the SyReC (Syntactic Reconstruction Corpus), a dataset generated from English Wikipedia articles. This training process teaches the model to infer grammatical structure, logical flow, and narrative coherence from a fixed set of semantic tokens, forcing it to develop a deeper understanding of language structure.
The primary goal of this model is to serve as an expert tool for tasks requiring high-fidelity adherence to a provided context.
Intended Use
This model excels at tasks that require strict grounding in a source text and precise adherence to constraints.
- Primary Use Case: Solving the syntactic reconstruction task as defined by the SyReC benchmark.
- Downstream Applications:
- High-Fidelity RAG (Retrieval-Augmented Generation): Answering questions based only on the provided context documents, with a reduced tendency to hallucinate or inject outside knowledge.
- Fact-Based Summarization: Creating summaries that are more extractive and factually grounded in the source text.
- Complex Instruction Following: Adhering to strict positive and negative constraints within a prompt (e.g., "use only these words," "do not mention X").
This model is not intended for general-purpose creative writing, as its training may have biased it towards more literal and structured outputs.
How to Use
You can use this model with the transformers
library pipeline. Make sure to format the input using the model's chat template.
from transformers import pipeline
import torch
# The model you fine-tuned
model_id = "ambrosfitz/SyReC-Mistral-7B-Reconstructor-v1"
# The system prompt the model was trained with
system_prompt = "You are an expert at syntactic reconstruction. Reconstruct the original, coherent paragraph using only the provided words. You must use every word exactly once."
# A sample scrambled paragraph (from the SyReC dataset)
scrambled_text = "a, a, a, a, a, allow, an, as, assumed, belonging, brahe’s, but, called, circle, circle, circles, closed, conic, consistent, curve, curves, data, did, discovered, doing, ellipse, ellipse, eventually, family, find, flattened, for, had, he, him, his, initially, is, is, its, kepler, kind, known, mars, next, not, object, observations, of, of, of, of, orbit, orbit, orbits, path, planet, planets, sections, shape, simplest, so, somewhat, space, that, that, the, the, the, the, the, the, through, to, to, were, were, with, with, working"
# Setup the pipeline
pipe = pipeline(
"text-generation",
model=model_id,
torch_dtype=torch.bfloat16,
device_map="auto"
)
# Format the prompt using the chat template
messages = [
{"role": "system", "content": system_prompt},
{"role": "user", "content": scrambled_text},
]
# The tokenizer is loaded with the pipeline and applies the template
prompt = pipe.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
# Generate the output
outputs = pipe(prompt, max_new_tokens=256, do_sample=False)
reconstructed_text = outputs["generated_text"].split("<|assistant|>").strip()
print("--- Model Reconstruction ---")
print(reconstructed_text)
Training Details
- Base Model:
mistralai/Mistral-7B-Instruct-v0.3
- Dataset:
ambrosfitz/SyReC
- Fine-tuning Method: Parameter-Efficient Fine-Tuning (PEFT) using the LoRA method. The model was trained for 1 epoch.
- Framework: Hugging Face
transformers
,peft
, andtrl
'sSFTTrainer
.
Training Loss
The training loss showed a consistent downward trend, indicating that the model was successfully learning the reconstruction task throughout the training process.
Step | Training Loss |
---|---|
25 | 2.0018 |
150 | 1.5794 |
250 | 1.5323 |
350 | 1.4766 |
450 | 1.6083 |
550 | 1.5455 |
650 | 1.4824 |
750 | 1.4943 |
850 | 1.5606 |
950 | 1.4989 |
1050 | 1.4267 |
1150 | 1.4909 |
1250 | 1.5304 |
Evaluation
This model is experimental. Its primary evaluation should be its performance on the SyReC benchmark, specifically measuring the improvement in BLEU and Levenshtein scores compared to the base model on unseen reconstruction tasks.
Limitations and Bias
- This model inherits all the biases of its base model,
mistralai/Mistral-7B-Instruct-v0.3
. - The training data comes exclusively from English Wikipedia, and therefore reflects the topical and cultural biases of that source.
- Due to the nature of the fine-tuning task (the "alignment tax"), the model may exhibit reduced creativity or performance on general-purpose, open-ended tasks compared to the base model.
- The model is intended for use in English.
Citation
If you use this model in your work, please consider citing the project.
@misc{syrec_mistral_reconstructor_2025,
author = {Fitzgerald, Ambrose},
title = {SyReC-Mistral-7B-Reconstructor-v1: A Model for Syntactic Reconstruction},
year = {2025},
publisher = {Hugging Face},
url = {https://huggingface.co/ambrosfitz/SyReC-Mistral-7B-Reconstructor-v1}
}
- Downloads last month
- 15