File size: 2,680 Bytes
f7bd340 4c6f5a5 1d4f843 4c6f5a5 f7bd340 bf6a2f3 f7bd340 1d4f843 f7bd340 1d4f843 f7bd340 1d4f843 921e2f9 1d4f843 f7bd340 921e2f9 1d4f843 4c6f5a5 1d4f843 4c6f5a5 f7bd340 1d4f843 4c6f5a5 1d4f843 921e2f9 4532e19 921e2f9 f7bd340 921e2f9 1d4f843 4c6f5a5 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 |
---
base_model: meta-llama/Llama-3.2-1B-Instruct
datasets:
- fineinstructions/template_instantiator_training
tags:
- datadreamer
- datadreamer-0.46.0
- synthetic
- text-generation
pipeline_tag: text-generation
---
This model will take a instruction template in the format of [FineTemplates](https://huggingface.co/datasets/fineinstructions/finetemplates) and a document and return an instantiated instruction and answer pair.
The output will be a JSON object.
## Simple Usage Example
```python
import json
import re
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
# Helper to expand excerpts in the answer
def expand(document, text):
excerpt_pattern = r"<excerpt>(.*?)<\.\.\.>(.*?)</excerpt>"
matches = re.findall(excerpt_pattern, text, flags=re.DOTALL)
replacements = {}
for prefix, suffix in matches:
match = re.search(
re.escape(prefix) + r" (.*?) " + re.escape(suffix),
document,
flags=re.DOTALL,
)
try:
if match:
replacements[f"<excerpt>{prefix}<...>{suffix}</excerpt>"] = match.group(
0
)
else:
return None
except Exception:
return None
for old, new in replacements.items():
text = text.replace(old, new)
return text
# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained('fineinstructions/template_instantiator', revision=None)
tokenizer.padding_side = 'left'
model = AutoModelForCausalLM.from_pretrained('fineinstructions/template_instantiator', revision=None)
pipe = pipeline('text-generation', model=model, tokenizer=tokenizer, pad_token_id=tokenizer.pad_token_id, return_full_text=False)
# Run inference to instantiate the instruction template and generate an answer
inputs = [json.dumps({
"instruction_template": "...",
"document": "..."
}, indent=2)]
prompts = [tokenizer.apply_chat_template([{'role': 'user', 'content': i}], tokenize=False, add_generation_prompt=True) for i in inputs]
generations = pipe(prompts, max_length=131072, truncation=True, temperature=None, top_p=None, do_sample=False)
output = generations[0][0]['generated_text']
output_json = json.loads()
# Expand the answer
output_json["answer"] = expand(document=inputs[0]["document"], text=output_json["answer"])
# Print the output JSON
print(output_json)
##### Output JSON:
# {
# ..
# }
#
```
---
This model was trained with a synthetic dataset with [DataDreamer 🤖💤](https://datadreamer.dev). The synthetic dataset card and model card can be found [here](datadreamer.json). The training arguments can be found [here](training_args.json). |