GenerTeam commited on
Commit
5615914
·
verified ·
1 Parent(s): 9e889c1

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +10 -0
README.md CHANGED
@@ -10,6 +10,16 @@ tags:
10
 
11
  # GENERator-eukaryote-3b-base model
12
 
 
 
 
 
 
 
 
 
 
 
13
  ## Abouts
14
  In this repository, we present GENERator, a generative genomic foundation model featuring a context length of 98k base pairs and 3B parameters, trained on an expansive dataset comprising 386 billion base pairs of eukaryotic DNA. The extensive and diverse pre-training data endow the GENERator with enhanced understanding and generation capabilities across various organisms.
15
 
 
10
 
11
  # GENERator-eukaryote-3b-base model
12
 
13
+ ## **Important Notice**
14
+ If you are using **GENERator** for sequence generation, please ensure that the length of each input sequence is a multiple of **6**. This can be achieved by either:
15
+ 1. Padding the sequence on the left with `'A'` (**left padding**);
16
+ 2. Truncating the sequence from the left (**left truncation**).
17
+
18
+ This requirement arises because **GENERator** employs a 6-mer tokenizer. If the input sequence length is not a multiple of **6**, the tokenizer will append an `'<oov>'` (out-of-vocabulary) token to the end of the token sequence. This can result in uninformative subsequent generations, such as repeated `'AAAAAA'`.
19
+
20
+ We apologize for any inconvenience this may cause and recommend adhering to the above guidelines to ensure accurate and meaningful generation results.
21
+
22
+
23
  ## Abouts
24
  In this repository, we present GENERator, a generative genomic foundation model featuring a context length of 98k base pairs and 3B parameters, trained on an expansive dataset comprising 386 billion base pairs of eukaryotic DNA. The extensive and diverse pre-training data endow the GENERator with enhanced understanding and generation capabilities across various organisms.
25