Update README.md
Browse files
README.md
CHANGED
@@ -10,6 +10,16 @@ tags:
|
|
10 |
|
11 |
# GENERator-eukaryote-3b-base model
|
12 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
13 |
## Abouts
|
14 |
In this repository, we present GENERator, a generative genomic foundation model featuring a context length of 98k base pairs and 3B parameters, trained on an expansive dataset comprising 386 billion base pairs of eukaryotic DNA. The extensive and diverse pre-training data endow the GENERator with enhanced understanding and generation capabilities across various organisms.
|
15 |
|
|
|
10 |
|
11 |
# GENERator-eukaryote-3b-base model
|
12 |
|
13 |
+
## **Important Notice**
|
14 |
+
If you are using **GENERator** for sequence generation, please ensure that the length of each input sequence is a multiple of **6**. This can be achieved by either:
|
15 |
+
1. Padding the sequence on the left with `'A'` (**left padding**);
|
16 |
+
2. Truncating the sequence from the left (**left truncation**).
|
17 |
+
|
18 |
+
This requirement arises because **GENERator** employs a 6-mer tokenizer. If the input sequence length is not a multiple of **6**, the tokenizer will append an `'<oov>'` (out-of-vocabulary) token to the end of the token sequence. This can result in uninformative subsequent generations, such as repeated `'AAAAAA'`.
|
19 |
+
|
20 |
+
We apologize for any inconvenience this may cause and recommend adhering to the above guidelines to ensure accurate and meaningful generation results.
|
21 |
+
|
22 |
+
|
23 |
## Abouts
|
24 |
In this repository, we present GENERator, a generative genomic foundation model featuring a context length of 98k base pairs and 3B parameters, trained on an expansive dataset comprising 386 billion base pairs of eukaryotic DNA. The extensive and diverse pre-training data endow the GENERator with enhanced understanding and generation capabilities across various organisms.
|
25 |
|