Mir-2002 commited on
Commit
4dd2328
·
verified ·
1 Parent(s): 2f7f445

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +27 -1
README.md CHANGED
@@ -88,7 +88,33 @@ Task: a task that was abandoned but should be retried or None if there are
88
  no abandoned tasks that should be retried.</s>
89
  ```
90
 
91
- This helps the decoder know under what downstream task it is currently being fine tuned in, improving the process.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
92
  # Hyperparameters
93
 
94
  MAX_SOURCE_LENGTH = 256 <br>
 
88
  no abandoned tasks that should be retried.</s>
89
  ```
90
 
91
+ This helps the decoder know under what downstream task it is currently being fine tuned in, improving the process. However, the paper doesn't clearly define whether or not the token
92
+ is already included in the tokenizer's vocabulary. For safe measure, i manually included the token in the tokenizer's vocabulary using this script:
93
+
94
+ ```python
95
+ from transformers import AutoTokenizer, T5ForConditionalGeneration
96
+
97
+ model_name = "Salesforce/codet5p-220m-bimodal"
98
+ model_path = "/path/to/your/model"
99
+
100
+ import os
101
+ os.makedirs(model_path, exist_ok=True)
102
+
103
+ # Load base model and tokenizer
104
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
105
+ model = T5ForConditionalGeneration.from_pretrained(model_name)
106
+
107
+ # Add special token(s)
108
+ tokenizer.add_special_tokens({"additional_special_tokens": ["<tdec>"]})
109
+
110
+ # Resize embeddings to match new vocab size
111
+ model.resize_token_embeddings(len(tokenizer))
112
+
113
+ # Save both to a custom directory or just as a runtime
114
+ tokenizer.save_pretrained(model_path)
115
+ model.save_pretrained(model_path)
116
+ ```
117
+
118
  # Hyperparameters
119
 
120
  MAX_SOURCE_LENGTH = 256 <br>