--- datasets: - Mir-2002/python-google-style-docstrings language: - en metrics: - bleu - rouge base_model: - Salesforce/codet5p-220m-bimodal pipeline_tag: summarization tags: - code --- # Overview This is a fine tuned CodeT5+ (220m) bimodal model tuned on a dataset consisting of 59,000 Python code-docstring pairs. The docstrings are in Google style format. A google style docstring is formatted as follows: ``` Args: () : () : Returns: () : Raises: () : ``` For more information on my dataset, please see the included referenced dataset. You can test the model using this: ``` from transformers import T5ForConditionalGeneration, AutoTokenizer checkpoint = "Mir-2002/codet5p-google-style-docstrings" device = "cuda" # or CPU tokenizer = AutoTokenizer.from_pretrained(checkpoint) model = T5ForConditionalGeneration.from_pretrained(checkpoint).to(device) input = """ def calculate_sum(a, b): return a + b """ inputs = tokenizer.encode(input, return_tensors="pt").to(device) outputs = model.generate(inputs, max_length=128) print(tokenizer.decode(outputs[0], skip_special_tokens=True)) # Calculate the sum of two numbers. # Args: # a (int): The first number. # b (int): The second number. ``` # Hyperparameters MAX_SOURCE_LENGTH = 256
MAX_TARGET_LENGTH = 128
BATCH_SIZE = 16
NUM_EPOCHS = 35
LEARNING_RATE = 3e-5
GRADIENT_ACCUMULATION_STEPS = 4
EARLY_STOPPING_PATIENCE = 2
WEIGHT_DECAY = 0.01
OPTIMIZER = ADAFACTOR
LR_SCHEDULER = LINEAR
# Loss On the 35th epoch, the model achieved the following loss: | Epoch | Training Loss | Validation Loss | | ----------- | ----------- | ----------- | | 35 | 0.894800 | 1.268536 # BLEU and ROUGE Scores | BLEU | ROUGE-1 | ROUGE-2 | ROUGE-L | ----------- | ----------- | ----------- | ----------- | | 35.40 | 58.55 | 39.46 | 52.43 |