Shresth12345 commited on
Commit
ca19b60
·
verified ·
1 Parent(s): de1c04f

Upload fine-tuned entity extraction model

Browse files
README.md ADDED
@@ -0,0 +1,135 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ base_model: HuggingFaceTB/SmolLM-360M
4
+ tags:
5
+ - text-generation
6
+ - entity-extraction
7
+ - calendar-events
8
+ - fine-tuned
9
+ - pytorch
10
+ language:
11
+ - en
12
+ pipeline_tag: text-generation
13
+ library_name: transformers
14
+ ---
15
+
16
+ # Entity Extraction Model - Fine-tuned SmolLM-360M
17
+
18
+ This model is a fine-tuned version of [HuggingFaceTB/SmolLM-360M](https://huggingface.co/HuggingFaceTB/SmolLM-360M) for extracting structured entities from natural language calendar event descriptions.
19
+
20
+ ## Model Description
21
+
22
+ - **Base Model**: HuggingFaceTB/SmolLM-360M
23
+ - **Task**: Entity Extraction for Calendar Events
24
+ - **Language**: English
25
+ - **License**: Apache 2.0
26
+
27
+ ## Intended Use
28
+
29
+ This model extracts structured entities from natural language text describing calendar events. It outputs JSON with the following fields:
30
+
31
+ - `action`: Type of event (e.g., "meeting", "lunch")
32
+ - `date`: Date in DD/MM/YYYY format
33
+ - `time`: Time in HH:MM AM/PM format
34
+ - `attendees`: Array of attendee names (or null)
35
+ - `location`: Event location (or null)
36
+ - `duration`: Duration description (or null)
37
+ - `recurrence`: Recurrence pattern (or null)
38
+ - `notes`: Additional notes (or null)
39
+
40
+ ## Usage
41
+
42
+ ```python
43
+ from transformers import AutoModelForCausalLM, AutoTokenizer
44
+ import torch
45
+
46
+ # Load model and tokenizer
47
+ model_name = "Shresth12345/entity-extraction-smollm"
48
+ model = AutoModelForCausalLM.from_pretrained(model_name)
49
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
50
+
51
+ # Example usage
52
+ text = "Meeting with John tomorrow at 2pm for 1 hour at the office"
53
+ prompt = f"Extract entities from: {text}"
54
+
55
+ # Tokenize and generate
56
+ inputs = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=512)
57
+ with torch.no_grad():
58
+ outputs = model.generate(
59
+ **inputs,
60
+ max_new_tokens=150,
61
+ temperature=0.1,
62
+ do_sample=True,
63
+ pad_token_id=tokenizer.eos_token_id
64
+ )
65
+
66
+ # Decode response
67
+ response = tokenizer.decode(outputs[0], skip_special_tokens=True)
68
+ generated_text = response[len(prompt):].strip()
69
+ print(generated_text)
70
+ ```
71
+
72
+ ## Expected Output Format
73
+
74
+ ```json
75
+ {
76
+ "action": "Meeting",
77
+ "date": "tomorrow",
78
+ "time": "2:00 PM",
79
+ "attendees": ["John"],
80
+ "location": "office",
81
+ "duration": "1 hour",
82
+ "recurrence": null,
83
+ "notes": null
84
+ }
85
+ ```
86
+
87
+ ## Training Details
88
+
89
+ - **Training Data**: 793 calendar event samples
90
+ - **Training Split**: 70% train, 15% validation, 15% test
91
+ - **Custom Loss Function**: Entity-aware loss with weighted output portion
92
+ - **Training Framework**: PyTorch (custom trainer)
93
+ - **Evaluation Metrics**: Exact match accuracy, field-wise accuracy, JSON quality
94
+
95
+ ## Model Performance
96
+
97
+ The model demonstrates strong performance in:
98
+ - Accurate entity extraction from natural language
99
+ - Consistent JSON output format
100
+ - Handling of missing/null values
101
+ - Recognition of temporal expressions
102
+ - Identification of people and locations
103
+
104
+ ## Limitations
105
+
106
+ - Primarily trained on English calendar events
107
+ - May struggle with very complex or ambiguous temporal expressions
108
+ - Performance may vary with domain-specific terminology
109
+ - Requires specific input format: "Extract entities from: [text]"
110
+
111
+ ## Training Procedure
112
+
113
+ This model was fine-tuned using:
114
+ 1. Custom PyTorch trainer implementation
115
+ 2. Entity-weighted loss function (weight: 2.0)
116
+ 3. Cosine annealing learning rate schedule
117
+ 4. Gradient accumulation for effective larger batch sizes
118
+ 5. Early stopping based on validation performance
119
+
120
+ ## Citation
121
+
122
+ If you use this model, please cite:
123
+
124
+ ```bibtex
125
+ @misc{entity-extraction-smollm,
126
+ title={Entity Extraction Fine-tuned SmolLM-360M},
127
+ author={Your Name},
128
+ year={2024},
129
+ howpublished={\url{https://huggingface.co/Shresth12345/entity-extraction-smollm}}
130
+ }
131
+ ```
132
+
133
+ ## Contact
134
+
135
+ For questions about this model, please open an issue in the repository or contact the author.
config.json ADDED
@@ -0,0 +1,29 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "LlamaForCausalLM"
4
+ ],
5
+ "attention_bias": false,
6
+ "attention_dropout": 0.0,
7
+ "bos_token_id": 0,
8
+ "eos_token_id": 0,
9
+ "head_dim": 64,
10
+ "hidden_act": "silu",
11
+ "hidden_size": 960,
12
+ "initializer_range": 0.02,
13
+ "intermediate_size": 2560,
14
+ "max_position_embeddings": 2048,
15
+ "mlp_bias": false,
16
+ "model_type": "llama",
17
+ "num_attention_heads": 15,
18
+ "num_hidden_layers": 32,
19
+ "num_key_value_heads": 5,
20
+ "pretraining_tp": 1,
21
+ "rms_norm_eps": 1e-05,
22
+ "rope_scaling": null,
23
+ "rope_theta": 10000.0,
24
+ "tie_word_embeddings": true,
25
+ "torch_dtype": "float32",
26
+ "transformers_version": "4.55.0",
27
+ "use_cache": true,
28
+ "vocab_size": 49152
29
+ }
generation_config.json ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ {
2
+ "_from_model_config": true,
3
+ "bos_token_id": 0,
4
+ "eos_token_id": 0,
5
+ "transformers_version": "4.55.0"
6
+ }
merges.txt ADDED
The diff for this file is too large to render. See raw diff
 
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:7d458fbc9b5a0fc2c9c4572a5bb385a2363af0ebe089f29564eb26e2a1b32dcd
3
+ size 1447317080
special_tokens_map.json ADDED
@@ -0,0 +1,43 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "additional_special_tokens": [
3
+ "<|endoftext|>",
4
+ "<|im_start|>",
5
+ "<|im_end|>",
6
+ "<repo_name>",
7
+ "<reponame>",
8
+ "<file_sep>",
9
+ "<filename>",
10
+ "<gh_stars>",
11
+ "<issue_start>",
12
+ "<issue_comment>",
13
+ "<issue_closed>",
14
+ "<jupyter_start>",
15
+ "<jupyter_text>",
16
+ "<jupyter_code>",
17
+ "<jupyter_output>",
18
+ "<jupyter_script>",
19
+ "<empty_output>"
20
+ ],
21
+ "bos_token": {
22
+ "content": "<|endoftext|>",
23
+ "lstrip": false,
24
+ "normalized": false,
25
+ "rstrip": false,
26
+ "single_word": false
27
+ },
28
+ "eos_token": {
29
+ "content": "<|endoftext|>",
30
+ "lstrip": false,
31
+ "normalized": false,
32
+ "rstrip": false,
33
+ "single_word": false
34
+ },
35
+ "pad_token": "<|endoftext|>",
36
+ "unk_token": {
37
+ "content": "<|endoftext|>",
38
+ "lstrip": false,
39
+ "normalized": false,
40
+ "rstrip": false,
41
+ "single_word": false
42
+ }
43
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,169 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_prefix_space": false,
3
+ "added_tokens_decoder": {
4
+ "0": {
5
+ "content": "<|endoftext|>",
6
+ "lstrip": false,
7
+ "normalized": false,
8
+ "rstrip": false,
9
+ "single_word": false,
10
+ "special": true
11
+ },
12
+ "1": {
13
+ "content": "<|im_start|>",
14
+ "lstrip": false,
15
+ "normalized": false,
16
+ "rstrip": false,
17
+ "single_word": false,
18
+ "special": true
19
+ },
20
+ "2": {
21
+ "content": "<|im_end|>",
22
+ "lstrip": false,
23
+ "normalized": false,
24
+ "rstrip": false,
25
+ "single_word": false,
26
+ "special": true
27
+ },
28
+ "3": {
29
+ "content": "<repo_name>",
30
+ "lstrip": false,
31
+ "normalized": false,
32
+ "rstrip": false,
33
+ "single_word": false,
34
+ "special": true
35
+ },
36
+ "4": {
37
+ "content": "<reponame>",
38
+ "lstrip": false,
39
+ "normalized": false,
40
+ "rstrip": false,
41
+ "single_word": false,
42
+ "special": true
43
+ },
44
+ "5": {
45
+ "content": "<file_sep>",
46
+ "lstrip": false,
47
+ "normalized": false,
48
+ "rstrip": false,
49
+ "single_word": false,
50
+ "special": true
51
+ },
52
+ "6": {
53
+ "content": "<filename>",
54
+ "lstrip": false,
55
+ "normalized": false,
56
+ "rstrip": false,
57
+ "single_word": false,
58
+ "special": true
59
+ },
60
+ "7": {
61
+ "content": "<gh_stars>",
62
+ "lstrip": false,
63
+ "normalized": false,
64
+ "rstrip": false,
65
+ "single_word": false,
66
+ "special": true
67
+ },
68
+ "8": {
69
+ "content": "<issue_start>",
70
+ "lstrip": false,
71
+ "normalized": false,
72
+ "rstrip": false,
73
+ "single_word": false,
74
+ "special": true
75
+ },
76
+ "9": {
77
+ "content": "<issue_comment>",
78
+ "lstrip": false,
79
+ "normalized": false,
80
+ "rstrip": false,
81
+ "single_word": false,
82
+ "special": true
83
+ },
84
+ "10": {
85
+ "content": "<issue_closed>",
86
+ "lstrip": false,
87
+ "normalized": false,
88
+ "rstrip": false,
89
+ "single_word": false,
90
+ "special": true
91
+ },
92
+ "11": {
93
+ "content": "<jupyter_start>",
94
+ "lstrip": false,
95
+ "normalized": false,
96
+ "rstrip": false,
97
+ "single_word": false,
98
+ "special": true
99
+ },
100
+ "12": {
101
+ "content": "<jupyter_text>",
102
+ "lstrip": false,
103
+ "normalized": false,
104
+ "rstrip": false,
105
+ "single_word": false,
106
+ "special": true
107
+ },
108
+ "13": {
109
+ "content": "<jupyter_code>",
110
+ "lstrip": false,
111
+ "normalized": false,
112
+ "rstrip": false,
113
+ "single_word": false,
114
+ "special": true
115
+ },
116
+ "14": {
117
+ "content": "<jupyter_output>",
118
+ "lstrip": false,
119
+ "normalized": false,
120
+ "rstrip": false,
121
+ "single_word": false,
122
+ "special": true
123
+ },
124
+ "15": {
125
+ "content": "<jupyter_script>",
126
+ "lstrip": false,
127
+ "normalized": false,
128
+ "rstrip": false,
129
+ "single_word": false,
130
+ "special": true
131
+ },
132
+ "16": {
133
+ "content": "<empty_output>",
134
+ "lstrip": false,
135
+ "normalized": false,
136
+ "rstrip": false,
137
+ "single_word": false,
138
+ "special": true
139
+ }
140
+ },
141
+ "additional_special_tokens": [
142
+ "<|endoftext|>",
143
+ "<|im_start|>",
144
+ "<|im_end|>",
145
+ "<repo_name>",
146
+ "<reponame>",
147
+ "<file_sep>",
148
+ "<filename>",
149
+ "<gh_stars>",
150
+ "<issue_start>",
151
+ "<issue_comment>",
152
+ "<issue_closed>",
153
+ "<jupyter_start>",
154
+ "<jupyter_text>",
155
+ "<jupyter_code>",
156
+ "<jupyter_output>",
157
+ "<jupyter_script>",
158
+ "<empty_output>"
159
+ ],
160
+ "bos_token": "<|endoftext|>",
161
+ "clean_up_tokenization_spaces": false,
162
+ "eos_token": "<|endoftext|>",
163
+ "extra_special_tokens": {},
164
+ "model_max_length": 1000000000000000019884624838656,
165
+ "pad_token": "<|endoftext|>",
166
+ "tokenizer_class": "GPT2Tokenizer",
167
+ "unk_token": "<|endoftext|>",
168
+ "vocab_size": 49152
169
+ }
training_config.json ADDED
@@ -0,0 +1,14 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "model_name": "HuggingFaceTB/SmolLM-360M",
3
+ "max_length": 512,
4
+ "learning_rate": 5e-05,
5
+ "batch_size": 8,
6
+ "num_epochs": 3,
7
+ "entity_loss_weight": 2.0,
8
+ "training_completed": true,
9
+ "checkpoint_info": {
10
+ "epoch": 1,
11
+ "global_step": 24,
12
+ "best_eval_score": "unknown"
13
+ }
14
+ }
usage_example.json ADDED
@@ -0,0 +1,28 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "example_usage": {
3
+ "input": "Extract entities from: Meeting with John tomorrow at 2pm for 1 hour",
4
+ "expected_output": {
5
+ "action": "Meeting",
6
+ "date": "tomorrow",
7
+ "time": "2:00 PM",
8
+ "attendees": [
9
+ "John"
10
+ ],
11
+ "location": null,
12
+ "duration": "1 hour",
13
+ "recurrence": null,
14
+ "notes": null
15
+ }
16
+ },
17
+ "supported_fields": [
18
+ "action",
19
+ "date",
20
+ "time",
21
+ "attendees",
22
+ "location",
23
+ "duration",
24
+ "recurrence",
25
+ "notes"
26
+ ],
27
+ "input_format": "Extract entities from: [your event description]"
28
+ }
vocab.json ADDED
The diff for this file is too large to render. See raw diff