GenerTeam
/

GENERator-eukaryote-3b-base

Text Generation

text-generation-inference

Model card Files Files and versions

GenerTeam commited on Jun 8

Commit

5615914

·

verified ·

1 Parent(s): 9e889c1

Update README.md

Files changed (1) hide show

README.md +10 -0

README.md CHANGED Viewed

@@ -10,6 +10,16 @@ tags:
 # GENERator-eukaryote-3b-base model
 ## Abouts
 In this repository, we present GENERator, a generative genomic foundation model featuring a context length of 98k base pairs and 3B parameters, trained on an expansive dataset comprising 386 billion base pairs of eukaryotic DNA. The extensive and diverse pre-training data endow the GENERator with enhanced understanding and generation capabilities across various organisms.

 # GENERator-eukaryote-3b-base model
+## **Important Notice**
+If you are using **GENERator** for sequence generation, please ensure that the length of each input sequence is a multiple of **6**. This can be achieved by either:
+1. Padding the sequence on the left with `'A'` (**left padding**);
+2. Truncating the sequence from the left (**left truncation**).
+This requirement arises because **GENERator** employs a 6-mer tokenizer. If the input sequence length is not a multiple of **6**, the tokenizer will append an `'<oov>'` (out-of-vocabulary) token to the end of the token sequence. This can result in uninformative subsequent generations, such as repeated `'AAAAAA'`.
+We apologize for any inconvenience this may cause and recommend adhering to the above guidelines to ensure accurate and meaningful generation results.
 ## Abouts
 In this repository, we present GENERator, a generative genomic foundation model featuring a context length of 98k base pairs and 3B parameters, trained on an expansive dataset comprising 386 billion base pairs of eukaryotic DNA. The extensive and diverse pre-training data endow the GENERator with enhanced understanding and generation capabilities across various organisms.