|
--- |
|
base_model: meta-llama/Llama-3.2-1B-Instruct |
|
datasets: |
|
- fineinstructions/template_instantiator_training |
|
tags: |
|
- datadreamer |
|
- datadreamer-0.46.0 |
|
- synthetic |
|
- text-generation |
|
pipeline_tag: text-generation |
|
--- |
|
This model will take a instruction template in the format of [FineTemplates](https://huggingface.co/datasets/fineinstructions/finetemplates) and a document and return an instantiated instruction and answer pair. |
|
|
|
The output will be a JSON object. |
|
|
|
## Simple Usage Example |
|
|
|
```python |
|
import json |
|
import re |
|
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline |
|
|
|
# Helper to expand excerpts in the answer |
|
def expand(document, text): |
|
excerpt_pattern = r"<excerpt>(.*?)<\.\.\.>(.*?)</excerpt>" |
|
matches = re.findall(excerpt_pattern, text, flags=re.DOTALL) |
|
replacements = {} |
|
for prefix, suffix in matches: |
|
match = re.search( |
|
re.escape(prefix) + r" (.*?) " + re.escape(suffix), |
|
document, |
|
flags=re.DOTALL, |
|
) |
|
try: |
|
if match: |
|
replacements[f"<excerpt>{prefix}<...>{suffix}</excerpt>"] = match.group( |
|
0 |
|
) |
|
else: |
|
return None |
|
except Exception: |
|
return None |
|
for old, new in replacements.items(): |
|
text = text.replace(old, new) |
|
return text |
|
|
|
# Load tokenizer and model |
|
tokenizer = AutoTokenizer.from_pretrained('fineinstructions/template_instantiator', revision=None) |
|
tokenizer.padding_side = 'left' |
|
model = AutoModelForCausalLM.from_pretrained('fineinstructions/template_instantiator', revision=None) |
|
pipe = pipeline('text-generation', model=model, tokenizer=tokenizer, pad_token_id=tokenizer.pad_token_id, return_full_text=False) |
|
|
|
# Run inference to instantiate the instruction template and generate an answer |
|
inputs = [json.dumps({ |
|
"instruction_template": "...", |
|
"document": "..." |
|
}, indent=2)] |
|
prompts = [tokenizer.apply_chat_template([{'role': 'user', 'content': i}], tokenize=False, add_generation_prompt=True) for i in inputs] |
|
generations = pipe(prompts, max_length=131072, truncation=True, temperature=None, top_p=None, do_sample=False) |
|
output = generations[0][0]['generated_text'] |
|
output_json = json.loads() |
|
|
|
# Expand the answer |
|
output_json["answer"] = expand(document=inputs[0]["document"], text=output_json["answer"]) |
|
|
|
# Print the output JSON |
|
print(output_json) |
|
|
|
##### Output JSON: |
|
# { |
|
# .. |
|
# } |
|
# |
|
``` |
|
--- |
|
This model was trained with a synthetic dataset with [DataDreamer 🤖💤](https://datadreamer.dev). The synthetic dataset card and model card can be found [here](datadreamer.json). The training arguments can be found [here](training_args.json). |