SyReC-Mistral-7B-Reconstructor-v1

Model Description

This model is a specialized, fine-tuned version of mistralai/Mistral-7B-Instruct-v0.3. It has been explicitly trained to perform syntactic and semantic reconstruction, a task that involves reconstructing a coherent, grammatically correct paragraph from a disordered "bag of words."

The model was fine-tuned on the SyReC (Syntactic Reconstruction Corpus), a dataset generated from English Wikipedia articles. This training process teaches the model to infer grammatical structure, logical flow, and narrative coherence from a fixed set of semantic tokens, forcing it to develop a deeper understanding of language structure.

The primary goal of this model is to serve as an expert tool for tasks requiring high-fidelity adherence to a provided context.

Intended Use

This model excels at tasks that require strict grounding in a source text and precise adherence to constraints.

Primary Use Case: Solving the syntactic reconstruction task as defined by the SyReC benchmark.
Downstream Applications:
- High-Fidelity RAG (Retrieval-Augmented Generation): Answering questions based only on the provided context documents, with a reduced tendency to hallucinate or inject outside knowledge.
- Fact-Based Summarization: Creating summaries that are more extractive and factually grounded in the source text.
- Complex Instruction Following: Adhering to strict positive and negative constraints within a prompt (e.g., "use only these words," "do not mention X").

This model is not intended for general-purpose creative writing, as its training may have biased it towards more literal and structured outputs.

How to Use

You can use this model with the transformers library pipeline. Make sure to format the input using the model's chat template.

from transformers import pipeline
import torch

# The model you fine-tuned
model_id = "ambrosfitz/SyReC-Mistral-7B-Reconstructor-v1"

# The system prompt the model was trained with
system_prompt = "You are an expert at syntactic reconstruction. Reconstruct the original, coherent paragraph using only the provided words. You must use every word exactly once."

# A sample scrambled paragraph (from the SyReC dataset)
scrambled_text = "a, a, a, a, a, allow, an, as, assumed, belonging, brahe’s, but, called, circle, circle, circles, closed, conic, consistent, curve, curves, data, did, discovered, doing, ellipse, ellipse, eventually, family, find, flattened, for, had, he, him, his, initially, is, is, its, kepler, kind, known, mars, next, not, object, observations, of, of, of, of, orbit, orbit, orbits, path, planet, planets, sections, shape, simplest, so, somewhat, space, that, that, the, the, the, the, the, the, through, to, to, were, were, with, with, working"

# Setup the pipeline
pipe = pipeline(
    "text-generation",
    model=model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

# Format the prompt using the chat template
messages = [
    {"role": "system", "content": system_prompt},
    {"role": "user", "content": scrambled_text},
]
# The tokenizer is loaded with the pipeline and applies the template
prompt = pipe.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

# Generate the output
outputs = pipe(prompt, max_new_tokens=256, do_sample=False)
reconstructed_text = outputs["generated_text"].split("<|assistant|>").strip()

print("--- Model Reconstruction ---")
print(reconstructed_text)

Training Details

Base Model: mistralai/Mistral-7B-Instruct-v0.3
Dataset: ambrosfitz/SyReC
Fine-tuning Method: Parameter-Efficient Fine-Tuning (PEFT) using the LoRA method. The model was trained for 1 epoch.
Framework: Hugging Face transformers, peft, and trl's SFTTrainer.

Training Loss

The training loss showed a consistent downward trend, indicating that the model was successfully learning the reconstruction task throughout the training process.

Step	Training Loss
25	2.0018
150	1.5794
250	1.5323
350	1.4766
450	1.6083
550	1.5455
650	1.4824
750	1.4943
850	1.5606
950	1.4989
1050	1.4267
1150	1.4909
1250	1.5304

Evaluation

This model is experimental. Its primary evaluation should be its performance on the SyReC benchmark, specifically measuring the improvement in BLEU and Levenshtein scores compared to the base model on unseen reconstruction tasks.

Limitations and Bias

This model inherits all the biases of its base model, mistralai/Mistral-7B-Instruct-v0.3.
The training data comes exclusively from English Wikipedia, and therefore reflects the topical and cultural biases of that source.
Due to the nature of the fine-tuning task (the "alignment tax"), the model may exhibit reduced creativity or performance on general-purpose, open-ended tasks compared to the base model.
The model is intended for use in English.

Citation

If you use this model in your work, please consider citing the project.

@misc{syrec_mistral_reconstructor_2025,
  author    = {Fitzgerald, Ambrose},
  title     = {SyReC-Mistral-7B-Reconstructor-v1: A Model for Syntactic Reconstruction},
  year      = {2025},
  publisher = {Hugging Face},
  url       = {https://huggingface.co/ambrosfitz/SyReC-Mistral-7B-Reconstructor-v1}
}

Downloads last month: 15

Safetensors

Model size

7.25B params

Tensor type

BF16

Model tree for ambrosfitz/SyReC-Mistral-7B-Reconstructor-v1

Base model

mistralai/Mistral-7B-v0.3

Finetuned

mistralai/Mistral-7B-Instruct-v0.3

Finetuned

(266)

this model

Quantizations

1 model

ambrosfitz
/

SyReC-Mistral-7B-Reconstructor-v1