zhihan1996/DNABERT-2-117M · Impact of Padding on DNABERT Model Performance

Hi,

I'm working with a DNABERT2 model and I have a question regarding the impact of padding on model performance. I've tokenized a DNA sequence and then compared the model's output for the original tokenized input against the same input with padding added.

import torch
from transformers import AutoTokenizer, AutoModel, BertConfig

tokenizer = AutoTokenizer.from_pretrained("zhihan1996/DNABERT-2-117M", trust_remote_code=True)
config = BertConfig.from_pretrained("zhihan1996/DNABERT-2-117M")
dnabert_model = AutoModel.from_config(config)

dna = "CGTGGTTTCCTGTGGTTGGAATT"
inputs = tokenizer(dna, return_tensors = 'pt')["input_ids"]

with torch.no_grad() :
padded_input = F.pad(inputs, (0, 2), value = 3)
attention_mask = torch.tensor([[1,1,1,1,1,1,0,0]])
nonpad_hidden_states = dnabert_model(inputs)[0][:,0]
pad_hidden_states = dnabert_model(padded_input, attention_mask = attention_mask)[0][:,0]
print(nonpad_hidden_states.squeeze()[0:10])
print(pad_hidden_states.squeeze()[0:10])

The outputs for the non-padded and padded inputs are significantly different. I'm curious if padding is expected to affect the model's performance in this way, or if I might be misunderstanding how DNABERT handles padded tokens.

Can anyone provide insights into whether this behavior is expected and any recommended practices for handling padding with DNABERT?

Thank you!