AshwinSankar commited on
Commit
d6686f1
·
verified ·
1 Parent(s): bbcf2db

Upload model

Browse files
.gitattributes CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ tokenizer.json filter=lfs diff=lfs merge=lfs -text
README.md CHANGED
@@ -1,27 +1,172 @@
1
  ---
2
- license: apache-2.0
3
  language:
4
- - hi
5
- - en
6
- - ta
7
- - te
8
- - bn
9
- - gu
10
- - kn
11
- - ml
12
- - mr
13
- - pa
14
- - or
15
- - as
 
 
 
 
 
 
 
 
 
 
 
 
16
  tags:
17
- - punctuation
18
- - indic-languages
19
- - cadence
 
20
  datasets:
21
  - ai4bharat/sangraha
22
  - HuggingFaceFW/fineweb-2
23
  metrics:
24
  - f1
25
- base_model:
26
- - google/gemma-3-1b-pt
27
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
 
2
  language:
3
+ - en
4
+ - as # Assamese
5
+ - bn # Bengali
6
+ - brx # Bodo
7
+ - doi # Dogri
8
+ - gu # Gujarati
9
+ - hi # Hindi
10
+ - kn # Kannada
11
+ - ks # Kashmiri
12
+ - kok # Konkani
13
+ - mai # Maithili
14
+ - ml # Malayalam
15
+ - mni # Manipuri (Meitei Mayek script requires transcription to Bengali script for tokenizer)
16
+ - mr # Marathi
17
+ - ne # Nepali
18
+ - or # Odia
19
+ - pa # Punjabi
20
+ - sa # Sanskrit
21
+ - sat # Santali
22
+ - sd # Sindhi
23
+ - ta # Tamil
24
+ - te # Telugu
25
+ - ur # Urdu
26
+ license: mit
27
  tags:
28
+ - punctuation-restoration
29
+ - multilingual
30
+ - indic-languages
31
+ - ai4bharat
32
  datasets:
33
  - ai4bharat/sangraha
34
  - HuggingFaceFW/fineweb-2
35
  metrics:
36
  - f1
37
+ pipeline_tag: token-classification
38
+ library_name: cadence-punctuation
39
+ base_model: google/gemma-3-1b-it
40
+ widget:
41
+ - text: "hello world how are you today"
42
+ example_title: "English Punctuation"
43
+ - text: "यह एक हिंदी वाक्य है"
44
+ example_title: "Hindi Punctuation"
45
+ - text: "cadence is a great model for punctuation"
46
+ example_title: "Another English Example"
47
+ ---
48
+
49
+ # Cadence
50
+
51
+ A multilingual punctuation restoration model based on Gemma-3-1b
52
+
53
+ ## Features
54
+ - **Multilingual Support**: English + 22 Indic languages
55
+ - **Script-Aware**: Handles multiple scripts with appropriate punctuation rules
56
+ - **Unimodel**: A single model for punctuations (doesn't require language identifier)
57
+ - **Encoder**: Bi-directional encoder (blazing fast)
58
+ - **Efficient Processing**: Supports batch processing and sliding window for long texts
59
+ - **AutoModel Compatible**: Easy integration with Hugging Face ecosystem
60
+
61
+ ## Installation
62
+
63
+ ```bash
64
+ pip install cadence-punctuation
65
+ ```
66
+
67
+ ## Quick Start
68
+
69
+ ### Using the Simple Interface
70
+
71
+ ```python
72
+ from cadence-punctuation import PunctuationModel
73
+
74
+ # Load model (local path or Hugging Face model ID)
75
+ model = PunctuationModel("path/to/download/weights")
76
+
77
+ # Punctuate single text
78
+ text = "hello world how are you today"
79
+ result = model.punctuate([text])
80
+ print(result[0]) # "Hello world, how are you today?"
81
+
82
+ # Punctuate multiple texts
83
+ texts = [
84
+ "hello world how are you",
85
+ "this is another test sentence",
86
+ "यह एक हिंदी वाक्य है" # Hindi example
87
+ ]
88
+ results = model.punctuate(texts, batch_size=8)
89
+ for original, punctuated in zip(texts, results):
90
+ print(f"Original: {original}")
91
+ print(f"Punctuated: {punctuated}")
92
+ print()
93
+ ```
94
+
95
+ ### Using AutoModel
96
+
97
+ ```python
98
+ from transformers import AutoTokenizer, AutoModel
99
+ import torch
100
+
101
+ # Load model and tokenizer
102
+ model_name = "ai4bharat/Cadence"
103
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
104
+ model = AutoModel.from_pretrained(model_name, trust_remote_code=True)
105
+
106
+ # Prepare input
107
+ text = "hello world how are you"
108
+ inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True)
109
+
110
+ # Get predictions
111
+ with torch.no_grad():
112
+ outputs = model(**inputs)
113
+ predictions = torch.argmax(outputs.logits, dim=-1)
114
+
115
+ print(predictions)
116
+ ```
117
+
118
+
119
+ ## Officially Supported Languages
120
+ - English, Assamese, Bengali, Bodo, Dogri, Gujarati, Hindi, Kannada, Kashmiri, Konkani, Maithili, Malayalam, Manipuri, Marathi, Nepali, Odia, Punjabi, Sanskrit, Santali, Sindhi, Tamil, Telugu, Urdu
121
+
122
+ Tokenizer doesn't support Manipuri's Meitei script. The model can punctuate if the text is transcribed it to Bengali's script.
123
+
124
+ One can try using this model for languages not listed above. Performance may vary.
125
+
126
+ ## Supported Punctuation
127
+ The model can predict the following punctuation marks:
128
+ - Period (.)
129
+ - Comma (,)
130
+ - Question mark (?)
131
+ - Exclamation mark (!)
132
+ - Semicolon (;)
133
+ - Colon (:)
134
+ - Hyphen (-)
135
+ - Quotes (" and ')
136
+ - Ellipse (...)
137
+ - Parentheses ()
138
+ - Hindi Danda (।)
139
+ - Urdu punctuation (۔، ؟)
140
+ - Arabic punctuation (٬ ،)
141
+ - Santali punctuation (᱾ ᱾।)
142
+ - Sanskrit punctuation (॥)
143
+ - And various combinations
144
+
145
+ ## Configuration Options
146
+
147
+ ### PunctuationModel Parameters
148
+
149
+ - `model_path`: Local path to download the weights
150
+ - `gpu_id`: GPU device ID (None for auto-detection)
151
+ - `cpu`: Force CPU usage (default: False)
152
+ - `max_length`: Maximum sequence length (default: 300)
153
+ - `sliding_window`: Enable sliding window for long texts (default: True)
154
+ - `verbose`: Enable verbose logging (default: False)
155
+ - `d_type`: Precision with which weights are loaded (default: bfloat16)
156
+
157
+
158
+ ```python
159
+ # Custom configuration
160
+ model = PunctuationModel(
161
+ model_path="path/to/download/weights",
162
+ gpu_id=0, # Use specific GPU
163
+ max_length=512, # Longer sequences
164
+ sliding_window=True, # Handle long texts
165
+ verbose=False, # Quiet mode
166
+ d_type="bfloat16"
167
+ )
168
+
169
+ # Process long texts with sliding window
170
+ long_text = "Your very long text here..." * 100
171
+ result = model.punctuate([long_text])
172
+ ```
config.json ADDED
@@ -0,0 +1,102 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "Gemma3ForTokenClassification"
4
+ ],
5
+ "attention_bias": false,
6
+ "attention_dropout": 0.0,
7
+ "attn_logit_softcapping": null,
8
+ "bos_token_id": 2,
9
+ "cache_implementation": "hybrid",
10
+ "classifier_dropout_prob": 0.0,
11
+ "eos_token_id": 1,
12
+ "final_logit_softcapping": null,
13
+ "head_dim": 256,
14
+ "hidden_activation": "gelu_pytorch_tanh",
15
+ "hidden_size": 1152,
16
+ "id2label": {
17
+ "0": "O",
18
+ "1": ".",
19
+ "10": "\"",
20
+ "11": "\u0964",
21
+ "12": "(",
22
+ "13": ")",
23
+ "14": ":",
24
+ "15": "\u066c",
25
+ "16": "\u06d4",
26
+ "17": "\u061f",
27
+ "18": ".\"",
28
+ "19": ").",
29
+ "2": ",",
30
+ "20": "),",
31
+ "21": "\",",
32
+ "22": "\".",
33
+ "23": "?\"",
34
+ "24": "\"?",
35
+ "25": "\u0964\"",
36
+ "26": "\"\u0964",
37
+ "27": "\u060c",
38
+ "28": "\u1c7e",
39
+ "29": "\u0965",
40
+ "3": "?",
41
+ "30": "\u1c7e\u0964",
42
+ "4": "-",
43
+ "5": ";",
44
+ "6": "_",
45
+ "7": "!",
46
+ "8": "'",
47
+ "9": "..."
48
+ },
49
+ "initializer_range": 0.02,
50
+ "intermediate_size": 6912,
51
+ "label2id": {
52
+ "!": 7,
53
+ "\"": 10,
54
+ "\",": 21,
55
+ "\".": 22,
56
+ "\"?": 24,
57
+ "\"\u0964": 26,
58
+ "'": 8,
59
+ "(": 12,
60
+ ")": 13,
61
+ "),": 20,
62
+ ").": 19,
63
+ ",": 2,
64
+ "-": 4,
65
+ ".": 1,
66
+ ".\"": 18,
67
+ "...": 9,
68
+ ":": 14,
69
+ ";": 5,
70
+ "?": 3,
71
+ "?\"": 23,
72
+ "O": 0,
73
+ "_": 6,
74
+ "\u060c": 27,
75
+ "\u061f": 17,
76
+ "\u066c": 15,
77
+ "\u06d4": 16,
78
+ "\u0964": 11,
79
+ "\u0964\"": 25,
80
+ "\u0965": 29,
81
+ "\u1c7e": 28,
82
+ "\u1c7e\u0964": 30
83
+ },
84
+ "max_position_embeddings": 32768,
85
+ "model_type": "cadence_punctuation",
86
+ "num_attention_heads": 4,
87
+ "num_hidden_layers": 26,
88
+ "num_key_value_heads": 1,
89
+ "pad_token_id": 0,
90
+ "query_pre_attn_scalar": 256,
91
+ "rms_norm_eps": 1e-06,
92
+ "rope_local_base_freq": 10000,
93
+ "rope_scaling": null,
94
+ "rope_theta": 1000000,
95
+ "sliding_window": 512,
96
+ "sliding_window_pattern": 6,
97
+ "torch_dtype": "float32",
98
+ "transformers_version": "4.51.3",
99
+ "use_cache": true,
100
+ "use_non_causal_attention": true,
101
+ "vocab_size": 262146
102
+ }
generation_config.json ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_from_model_config": true,
3
+ "bos_token_id": 2,
4
+ "cache_implementation": "hybrid",
5
+ "eos_token_id": 1,
6
+ "pad_token_id": 0,
7
+ "transformers_version": "4.51.3"
8
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:2fd66fdcd492b575773877dbaa9a113a5f710d8a79020cfb26c35927dcfc9142
3
+ size 3999735300
special_tokens_map.json ADDED
@@ -0,0 +1,40 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "boi_token": "<start_of_image>",
3
+ "bos_token": {
4
+ "content": "<bos>",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false
9
+ },
10
+ "eoi_token": "<end_of_image>",
11
+ "eos_token": {
12
+ "content": "<eos>",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false
17
+ },
18
+ "image_token": "<image_soft_token>",
19
+ "mask_token": {
20
+ "content": "[MASK]",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false
25
+ },
26
+ "pad_token": {
27
+ "content": "<pad>",
28
+ "lstrip": false,
29
+ "normalized": false,
30
+ "rstrip": false,
31
+ "single_word": false
32
+ },
33
+ "unk_token": {
34
+ "content": "<unk>",
35
+ "lstrip": false,
36
+ "normalized": false,
37
+ "rstrip": false,
38
+ "single_word": false
39
+ }
40
+ }
tokenizer.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:88ec6df915623f4b307188dbb6fe60ddb8a1ef273c864ba38de97a320dd17dea
3
+ size 33384751
tokenizer_config.json ADDED
The diff for this file is too large to render. See raw diff