BiRNA-BERT
Collection
Contains BiRNA-BERT, its tokenizer, and related ablation-study models.
•
8 items
•
Updated
buetnlpbio/nuc-only-rna-bert
is trained on Nucleotide tokens only (for ablation). Please consider using buetnlpbio/birna-bert
instead.
BiRNA-BERT is a BERT-style transformer encoder model that generates embeddings for RNA sequences. BiRNA-BERT has been trained on BPE tokens and individual nucleotides. As a result, it can generate both granular nucleotide-level embeddings and efficient sequence-level embeddings (using BPE).
BiRNA-BERT was trained using the MosaicBERT framework - https://huggingface.co/mosaicml/mosaic-bert-base
import torch
import transformers
from transformers import AutoModelForMaskedLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("buetnlpbio/birna-tokenizer")
config = transformers.BertConfig.from_pretrained("buetnlpbio/nuc-only-rna-bert")
mysterybert = AutoModelForMaskedLM.from_pretrained("buetnlpbio/nuc-only-rna-bert",config=config,trust_remote_code=True)
mysterybert.cls = torch.nn.Identity()
# To get nucleotide embeddings
char_embed = mysterybert(**tokenizer("A G C T A C G T A C G T", return_tensors="pt"))
print(char_embed.logits.shape) # CLS + 12 nucleotide token embeddings + SEP