Tesoro
Collection
Datasets and models for TD detection in Java using comment and source code. See our Github page: https://github.com/NamCyan/tesoro
•
18 items
•
Updated
•
1
This model is the part of Tesoro project, used for detecting technical debt in source code. More information can be found at Tesoro HomePage.
Use the code below to get started with the model.
from transformers import AutoModelForSequenceClassification, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("NamCyan/codebert-base-technical-debt-code-tesoro")
model = AutoModelForSequenceClassification.from_pretrained("NamCyan/codebert-base-technical-debt-code-tesoro")
Training Data: The model is finetuned using tesoro-code
Infrastructure: Training process is conducted on two NVIDIA A100 GPUs with 80GB of VRAM.
Model | Model size | EM | F1 |
---|---|---|---|
Encoder-based PLMs | |||
CodeBERT | 125M | 38.28 | 43.47 |
UniXCoder | 125M | 38.12 | 42.58 |
GraphCodeBERT | 125M | 39.38 | 44.21 |
RoBERTa | 125M | 35.37 | 38.22 |
ALBERT | 11.8M | 39.32 | 41.99 |
Encoder-Decoder-based PLMs | |||
PLBART | 140M | 36.85 | 39.90 |
Codet5 | 220M | 32.66 | 35.41 |
CodeT5+ | 220M | 37.91 | 41.96 |
Decoder-based PLMs (LLMs) | |||
TinyLlama | 1.03B | 37.05 | 40.05 |
DeepSeek-Coder | 1.28B | 42.52 | 46.19 |
OpenCodeInterpreter | 1.35B | 38.16 | 41.76 |
phi-2 | 2.78B | 37.92 | 41.57 |
starcoder2 | 3.03B | 35.37 | 41.77 |
CodeLlama | 6.74B | 34.14 | 38.16 |
Magicoder | 6.74B | 39.14 | 42.49 |
@article{nam2024tesoro,
title={Improving the detection of technical debt in Java source code with an enriched dataset},
author={Hai, Nam Le and Bui, Anh M. T. Bui and Nguyen, Phuong T. and Ruscio, Davide Di and Kazman, Rick},
journal={},
year={2024}
}
Base model
microsoft/codebert-base