File size: 2,680 Bytes
f7bd340
4c6f5a5
 
1d4f843
4c6f5a5
 
 
 
 
 
f7bd340
bf6a2f3
f7bd340
1d4f843
f7bd340
1d4f843
f7bd340
1d4f843
 
921e2f9
1d4f843
f7bd340
921e2f9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1d4f843
 
4c6f5a5
1d4f843
4c6f5a5
f7bd340
1d4f843
 
 
 
 
4c6f5a5
1d4f843
 
921e2f9
 
 
4532e19
921e2f9
 
 
f7bd340
921e2f9
1d4f843
 
 
 
 
4c6f5a5
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
---
base_model: meta-llama/Llama-3.2-1B-Instruct
datasets:
- fineinstructions/template_instantiator_training
tags:
- datadreamer
- datadreamer-0.46.0
- synthetic
- text-generation
pipeline_tag: text-generation
---
This model will take a instruction template in the format of [FineTemplates](https://huggingface.co/datasets/fineinstructions/finetemplates) and a document and return an instantiated instruction and answer pair.

The output will be a JSON object.

## Simple Usage Example

```python
import json
import re
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

# Helper to expand excerpts in the answer
def expand(document, text):
    excerpt_pattern = r"<excerpt>(.*?)<\.\.\.>(.*?)</excerpt>"
    matches = re.findall(excerpt_pattern, text, flags=re.DOTALL)
    replacements = {}
    for prefix, suffix in matches:
        match = re.search(
            re.escape(prefix) + r" (.*?) " + re.escape(suffix),
            document,
            flags=re.DOTALL,
        )
        try:
            if match:
                replacements[f"<excerpt>{prefix}<...>{suffix}</excerpt>"] = match.group(
                    0
                )
            else:
                return None
        except Exception:
            return None
    for old, new in replacements.items():
        text = text.replace(old, new)
    return text

# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained('fineinstructions/template_instantiator', revision=None)
tokenizer.padding_side = 'left'
model = AutoModelForCausalLM.from_pretrained('fineinstructions/template_instantiator', revision=None)
pipe = pipeline('text-generation', model=model, tokenizer=tokenizer, pad_token_id=tokenizer.pad_token_id, return_full_text=False)

# Run inference to instantiate the instruction template and generate an answer
inputs = [json.dumps({
  "instruction_template": "...",
  "document": "..."
}, indent=2)]
prompts = [tokenizer.apply_chat_template([{'role': 'user', 'content': i}], tokenize=False, add_generation_prompt=True) for i in inputs]
generations = pipe(prompts, max_length=131072, truncation=True, temperature=None, top_p=None, do_sample=False)
output = generations[0][0]['generated_text']
output_json = json.loads()

# Expand the answer
output_json["answer"] = expand(document=inputs[0]["document"], text=output_json["answer"])

# Print the output JSON
print(output_json)

##### Output JSON:
# {
# ..
# }
# 
```
---
This model was trained with a synthetic dataset with [DataDreamer 🤖💤](https://datadreamer.dev). The synthetic dataset card and model card can be found [here](datadreamer.json). The training arguments can be found [here](training_args.json).