codet5p-google-style-docstrings / README.md

Mir-2002

Update README.md

992672e verified 3 months ago

5.78 kB

	---
	datasets:
	- Mir-2002/python-google-style-docstrings
	language:
	- en
	metrics:
	- bleu
	- rouge
	base_model:
	- Salesforce/codet5p-220m-bimodal
	pipeline_tag: summarization
	tags:
	- code
	---

	# Overview

	This is a fine tuned CodeT5+ (220m) bimodal model tuned on a dataset consisting of 59,000 Python code-docstring pairs. The docstrings are in Google style format.
	A google style docstring is formatted as follows:
	```
	<Description of the code>

	Args:
	<var1> (<data-type>) : <description of var1>
	<var2> (<data_type>) : <description of var2>

	Returns:
	<var3> (<data-type>) : <description of var3>

	Raises:
	<var4> (<data-type>) : <description of var4>
	```

	For more information on my dataset, please see the included referenced dataset.

	You can test the model using this:

	```python
	from transformers import T5ForConditionalGeneration, AutoTokenizer

	checkpoint = "Mir-2002/codet5p-google-style-docstrings"
	device = "cuda" # or CPU

	tokenizer = AutoTokenizer.from_pretrained(checkpoint)
	model = T5ForConditionalGeneration.from_pretrained(checkpoint).to(device)

	input = """
	def calculate_sum(a, b):
	return a + b
	"""

	inputs = tokenizer.encode(input, return_tensors="pt").to(device)
	outputs = model.generate(
	inputs,
	max_length=128,
	num_beams=8,
	early_stopping=True,
	no_repeat_ngram_size=3,
	pad_token_id=tokenizer.pad_token_id)
	print(tokenizer.decode(outputs[0], skip_special_tokens=True))
	# Calculate the sum of two numbers.

	# Args:
	# a (int): The first number.
	# b (int): The second number.

	```
	# Fine tuning

	In fine tuning the model, i used the special token `<tdec>`. According to CodeT5+'s paper:

	" Specifically, when the input is a text
	sample, we prepend a [CDec] token to the input
	sequence to the decoder. In this case, the decoder
	operates under code generation functionality. Alternatively, when the input is a code sample, we
	prepend a [TDec] token to the input sequence to
	the decoder. The decoder operates under text generation functionality in this case. This type of Causal
	LM has been shown to be an effective learning
	objective to close the pretrain-finetune gap for generative downstream tasks"

	Generally speaking, the `<tdec>` token was prepended to the target (the docstring) to signal to the decoder that it is in a text generation functionality. A sample row looks like this:

	```
	<s><tdec> Creates a task that to retry a previously abandoned task.

	Returns:
	Task: a task that was abandoned but should be retried or None if there are
	no abandoned tasks that should be retried.</s>
	```

	This helps the decoder know under what downstream task it is currently being fine tuned in, improving the process. However, the paper doesn't clearly define whether or not the token
	is already included in the tokenizer's vocabulary. For safe measure, i manually included the token in the tokenizer's vocabulary using this script:

	```python
	from transformers import AutoTokenizer, T5ForConditionalGeneration

	model_name = "Salesforce/codet5p-220m-bimodal"
	model_path = "/path/to/your/model"

	import os
	os.makedirs(model_path, exist_ok=True)

	# Load base model and tokenizer
	tokenizer = AutoTokenizer.from_pretrained(model_name)
	model = T5ForConditionalGeneration.from_pretrained(model_name)

	# Add special token(s)
	tokenizer.add_special_tokens({"additional_special_tokens": ["<tdec>"]})

	# Resize embeddings to match new vocab size
	model.resize_token_embeddings(len(tokenizer))

	# Save both to a custom directory or just as a runtime
	tokenizer.save_pretrained(model_path)
	model.save_pretrained(model_path)
	```

	I then verified the token was added using this script:

	```python
	print("Token ID for <tdec>:", tokenizer.convert_tokens_to_ids("<tdec>"))
	print("Tokenized form of '<tdec>':", tokenizer.tokenize("<tdec>"))

	# Token ID for <tdec>: 32103
	# Tokenized form of '<tdec>': ['<tdec>']
	```

	The scripts were run beforehand and the modified model and tokenizer was used during fine tuning.

	# Hyperparameters

	MAX_SOURCE_LENGTH = 256 <br>
	MAX_TARGET_LENGTH = 128 <br>
	BATCH_SIZE = 16 <br>
	NUM_EPOCHS = 35 <br>
	LEARNING_RATE = 3e-5 <br>
	GRADIENT_ACCUMULATION_STEPS = 4 <br>
	EARLY_STOPPING_PATIENCE = 2 <br>
	WEIGHT_DECAY = 0.01 <br>
	OPTIMIZER = ADAFACTOR <br>
	LR_SCHEDULER = LINEAR <br>

	The model was trained on via Colab Pro, on an L4 GPU. A gradient accumulation step of 4 was used to simulate an effective batch size of 64 (16 * 4).

	# Loss

	On the 35th epoch, the model achieved the following loss:

	\| Epoch \| Training Loss \| Validation Loss \|
	\| ----------- \| ----------- \| ----------- \|
	\| 35 \| 0.894800 \| 1.268536


	# BLEU and ROUGE Scores

	\| SacreBLEU \| ROUGE-1 \| ROUGE-2 \| ROUGE-L
	\| ----------- \| ----------- \| ----------- \| ----------- \|
	\| 35.40 \| 58.55 \| 39.46 \| 52.43 \|

	While a SacreBLEU score of 35 is a moderate score, it is important to consider that docstrings in Google style format vary extremely. Some are outliers that have extra sections
	that is usually not included in the general population which leads the model to generate "hallucinations". An example of this is this particular sample:

	```
	Reference: Validate timestamp specified by request.

	See `validate.request` for additional info.

	Args:
	stamp: str. Time request was made as ISO 8601 timestamp.
	tolerance: int. Number of seconds request remains valid from timestamp.

	Returns
	bool: True if valid, False otherwise.
	-----------------------------------------------------------------------
	Prediction: Validate timestamp.

	Args:
	stamp (str): A date string in the format YYYY-MM-DDThh:mm:ss.######[+-]##:##

	Returns:
	bool: True if valid, False otherwise.
	```

	As you can see, the model generated gibberish in the prediction's Args section specifically the string format for the date.