File size: 5,780 Bytes

---
datasets:
- Mir-2002/python-google-style-docstrings
language:
- en
metrics:
- bleu
- rouge
base_model:
- Salesforce/codet5p-220m-bimodal
pipeline_tag: summarization
tags:
- code
---

# Overview

This is a fine tuned CodeT5+ (220m) bimodal model tuned on a dataset consisting of 59,000 Python code-docstring pairs. The docstrings are in Google style format.
A google style docstring is formatted as follows:
```
<Description of the code>

Args:
<var1> (<data-type>) : <description of var1>
<var2> (<data_type>) : <description of var2>

Returns:
<var3> (<data-type>) : <description of var3>

Raises:
<var4> (<data-type>) : <description of var4>
```

For more information on my dataset, please see the included referenced dataset.

You can test the model using this:

```python
from transformers import T5ForConditionalGeneration, AutoTokenizer

checkpoint = "Mir-2002/codet5p-google-style-docstrings"
device = "cuda" # or CPU

tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = T5ForConditionalGeneration.from_pretrained(checkpoint).to(device)

input = """
def calculate_sum(a, b):
    return a + b
"""

inputs = tokenizer.encode(input, return_tensors="pt").to(device)
outputs = model.generate(
            inputs,
            max_length=128,
            num_beams=8,
            early_stopping=True,
            no_repeat_ngram_size=3,
            pad_token_id=tokenizer.pad_token_id)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
# Calculate the sum of two numbers.

# Args:
# a (int): The first number.
# b (int): The second number.

```
# Fine tuning

In fine tuning the model, i used the special token `<tdec>`. According to CodeT5+'s paper:

" Specifically, when the input is a text
sample, we prepend a [CDec] token to the input
sequence to the decoder. In this case, the decoder
operates under code generation functionality. Alternatively, when the input is a code sample, we
prepend a [TDec] token to the input sequence to
the decoder. The decoder operates under text generation functionality in this case. This type of Causal
LM has been shown to be an effective learning
objective to close the pretrain-finetune gap for generative downstream tasks"

Generally speaking, the `<tdec>` token was prepended to the target (the docstring) to signal to the decoder that it is in a text generation functionality. A sample row looks like this:

```
<s><tdec> Creates a task that to retry a previously abandoned task.

Returns:
Task: a task that was abandoned but should be retried or None if there are
no abandoned tasks that should be retried.</s>
```

This helps the decoder know under what downstream task it is currently being fine tuned in, improving the process. However, the paper doesn't clearly define whether or not the token 
is already included in the tokenizer's vocabulary. For safe measure, i manually included the token in the tokenizer's vocabulary using this script:

```python
from transformers import AutoTokenizer, T5ForConditionalGeneration

model_name = "Salesforce/codet5p-220m-bimodal"
model_path = "/path/to/your/model"

import os
os.makedirs(model_path, exist_ok=True)

# Load base model and tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = T5ForConditionalGeneration.from_pretrained(model_name)

# Add special token(s)
tokenizer.add_special_tokens({"additional_special_tokens": ["<tdec>"]})

# Resize embeddings to match new vocab size
model.resize_token_embeddings(len(tokenizer))

# Save both to a custom directory or just as a runtime
tokenizer.save_pretrained(model_path)
model.save_pretrained(model_path)
```

I then verified the token was added using this script:

```python
print("Token ID for <tdec>:", tokenizer.convert_tokens_to_ids("<tdec>"))
print("Tokenized form of '<tdec>':", tokenizer.tokenize("<tdec>"))

# Token ID for <tdec>: 32103
# Tokenized form of '<tdec>': ['<tdec>']
```

The scripts were run beforehand and the modified model and tokenizer was used during fine tuning.

# Hyperparameters

MAX_SOURCE_LENGTH = 256 <br>
MAX_TARGET_LENGTH = 128 <br>
BATCH_SIZE = 16 <br>
NUM_EPOCHS = 35 <br>
LEARNING_RATE = 3e-5 <br>
GRADIENT_ACCUMULATION_STEPS = 4 <br>
EARLY_STOPPING_PATIENCE = 2 <br>
WEIGHT_DECAY = 0.01 <br>
OPTIMIZER = ADAFACTOR <br>
LR_SCHEDULER = LINEAR <br>

The model was trained on via Colab Pro, on an L4 GPU. A gradient accumulation step of 4 was used to simulate an effective batch size of 64 (16 * 4).

# Loss

On the 35th epoch, the model achieved the following loss:

| Epoch      | Training Loss | Validation Loss |
| ----------- | ----------- | ----------- |
| 35     | 0.894800	       | 1.268536


# BLEU and ROUGE Scores

| SacreBLEU      | ROUGE-1 | ROUGE-2 | ROUGE-L
| ----------- | ----------- | ----------- | ----------- |
| 35.40     | 58.55	       | 39.46 | 52.43 |

While a SacreBLEU score of 35 is a moderate score, it is important to consider that docstrings in Google style format vary extremely. Some are outliers that have extra sections
that is usually not included in the general population which leads the model to generate "hallucinations". An example of this is this particular sample:

```
Reference:  Validate timestamp specified by request.

See `validate.request` for additional info.

Args:
stamp: str. Time request was made as ISO 8601 timestamp.
tolerance: int. Number of seconds request remains valid from timestamp.

Returns
bool: True if valid, False otherwise.
-----------------------------------------------------------------------
Prediction:  Validate timestamp.

Args:
stamp (str): A date string in the format YYYY-MM-DDThh:mm:ss.######[+-]##:##

Returns:
bool: True if valid, False otherwise.
```

As you can see, the model generated gibberish in the prediction's Args section specifically the string format for the date.