Mir-2002
/

codet5p-google-style-docstrings

Model card Files Files and versions Community

Mir-2002 commited on Jun 26

Commit

4dd2328

·

verified ·

1 Parent(s): 2f7f445

Update README.md

Files changed (1) hide show

README.md +27 -1

README.md CHANGED Viewed

@@ -88,7 +88,33 @@ Task: a task that was abandoned but should be retried or None if there are
 no abandoned tasks that should be retried.</s>
 ```
-This helps the decoder know under what downstream task it is currently being fine tuned in, improving the process.
 # Hyperparameters
 MAX_SOURCE_LENGTH = 256 <br>

 no abandoned tasks that should be retried.</s>
 ```
+This helps the decoder know under what downstream task it is currently being fine tuned in, improving the process. However, the paper doesn't clearly define whether or not the token
+is already included in the tokenizer's vocabulary. For safe measure, i manually included the token in the tokenizer's vocabulary using this script:
+```python
+from transformers import AutoTokenizer, T5ForConditionalGeneration
+model_name = "Salesforce/codet5p-220m-bimodal"
+model_path = "/path/to/your/model"
+import os
+os.makedirs(model_path, exist_ok=True)
+# Load base model and tokenizer
+tokenizer = AutoTokenizer.from_pretrained(model_name)
+model = T5ForConditionalGeneration.from_pretrained(model_name)
+# Add special token(s)
+tokenizer.add_special_tokens({"additional_special_tokens": ["<tdec>"]})
+# Resize embeddings to match new vocab size
+model.resize_token_embeddings(len(tokenizer))
+# Save both to a custom directory or just as a runtime
+tokenizer.save_pretrained(model_path)
+model.save_pretrained(model_path)
+```
 # Hyperparameters
 MAX_SOURCE_LENGTH = 256 <br>