--- library_name: transformers license: mit pipeline_tag: text-generation tags: - biology - genomics - long-context --- # GENERator-eukaryote-3b-base model ## **Important Notice** If you are using **GENERator** for sequence generation, please ensure that the length of each input sequence is a multiple of **6**. This can be achieved by either: 1. Padding the sequence on the left with `'A'` (**left padding**); 2. Truncating the sequence from the left (**left truncation**). This requirement arises because **GENERator** employs a 6-mer tokenizer. If the input sequence length is not a multiple of **6**, the tokenizer will append an `''` (out-of-vocabulary) token to the end of the token sequence. This can result in uninformative subsequent generations, such as repeated `'AAAAAA'`. We apologize for any inconvenience this may cause and recommend adhering to the above guidelines to ensure accurate and meaningful generation results. ## Abouts In this repository, we present GENERator, a generative genomic foundation model featuring a context length of 98k base pairs and 3B parameters, trained on an expansive dataset comprising 386 billion base pairs of eukaryotic DNA. The extensive and diverse pre-training data endow the GENERator with enhanced understanding and generation capabilities across various organisms. For more technical details, please refer to our paper [GENERator: A Long-Context Generative Genomic Foundation Model](https://arxiv.org/abs/2502.07272). The code and implementation details are available on Github: [https://github.com/GenerTeam/GENERator](https://github.com/GenerTeam/GENERator). ## How to use ### Simple example1: generation ```python import torch from transformers import AutoTokenizer, AutoModelForCausalLM # Load the tokenizer and model. tokenizer = AutoTokenizer.from_pretrained("GenerTeam/GENERator-eukaryote-3b-base", trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained("GenerTeam/GENERator-eukaryote-3b-base") config = model.config max_length = config.max_position_embeddings # Define input sequences. sequences = [ "ATGAGGTGGCAAGAAATGGGCTAC", "GAATTCCATGAGGCTATAGAATAATCTAAGAGAAAT" ] def left_padding(sequence, padding_char='A', multiple=6): remainder = len(sequence) % multiple if remainder != 0: padding_length = multiple - remainder return padding_char * padding_length + sequence return sequence def left_truncation(sequence, multiple=6): remainder = len(sequence) % multiple if remainder != 0: return sequence[remainder:] return sequence # Apply left_padding to all sequences # padded_sequences = [left_padding(seq) for seq in sequences] # Apply left_truncation to all sequences truncated_sequences = [left_truncation(seq) for seq in sequences] # Process the sequences sequences = [tokenizer.bos_token + sequence for sequence in truncated_sequences] # Tokenize the sequences tokenizer.padding_side = "left" inputs = tokenizer( sequences, add_special_tokens=False, return_tensors="pt", padding=True, truncation=True, max_length=max_length ) # Generate the sequences with torch.inference_mode(): outputs = model.generate(**inputs, max_new_tokens=32, temperature=0.00001, top_k=1) # Decode the generated sequences decoded_sequences = tokenizer.batch_decode(outputs, skip_special_tokens=True) # Print the decoded sequences print(decoded_sequences) # It is expected to observe non-sense decoded sequences (e.g., 'AAAAAA') # The input sequences are too short to provide sufficient context. ``` ### Simple example2: embedding ```python import torch from transformers import AutoTokenizer, AutoModelForCausalLM # Load the tokenizer and model. tokenizer = AutoTokenizer.from_pretrained("GENERator-eukaryote-3b-base", trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained("GenerTeam/GENERator-eukaryote-3b-base") config = model.config max_length = config.max_position_embeddings # Define input sequences. sequences = [ "ATGAGGTGGCAAGAAATGGGCTAC", "GAATTCCATGAGGCTATAGAATAATCTAAGAGAAAT" ] # Tokenize the sequences with add_special_tokens=True to automatically add special tokens, # such as the BOS EOS token, at the appropriate positions. tokenizer.padding_side = "right" inputs = tokenizer( sequences, add_special_tokens=True, return_tensors="pt", padding=True, truncation=True, max_length=max_length ) # Perform a forward pass through the model to obtain the outputs, including hidden states. with torch.inference_mode(): outputs = model(**inputs, output_hidden_states=True) # Retrieve the hidden states from the last layer. hidden_states = outputs.hidden_states[-1] # Shape: (batch_size, sequence_length, hidden_size) # Use the attention_mask to determine the index of the last token in each sequence. # Since add_special_tokens=True is used, the last token is typically the EOS token. attention_mask = inputs["attention_mask"] last_token_indices = attention_mask.sum(dim=1) - 1 # Index of the last token for each sequence # Extract the embedding corresponding to the EOS token for each sequence. seq_embeddings = [] for i, token_index in enumerate(last_token_indices): # Fetch the embedding for the last token (EOS token). seq_embedding = hidden_states[i, token_index, :] seq_embeddings.append(seq_embedding) # Stack the embeddings into a tensor with shape (batch_size, hidden_size) seq_embeddings = torch.stack(seq_embeddings) print("Sequence Embeddings:", seq_embeddings) ``` ## Citation ``` @misc{wu2025generator, title={GENERator: A Long-Context Generative Genomic Foundation Model}, author={Wei Wu and Qiuyi Li and Mingyang Li and Kun Fu and Fuli Feng and Jieping Ye and Hui Xiong and Zheng Wang}, year={2025}, eprint={2502.07272}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2502.07272}, } ```