banglagov commited on
Commit
11547fc
·
verified ·
1 Parent(s): 5e72a6f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +0 -10
README.md CHANGED
@@ -30,16 +30,6 @@ We used 36 GB of text data to train the model. The used corpus has the following
30
  | Total sentences | 181,447,732 (about 181.45 million) |
31
  | Total documents | 17,516,890 (about 17.52 million) |
32
 
33
- The raw crawled text is pre-processed in several steps to produce the final 36 GB of data. The pre-processing contains the following steps:
34
-
35
- - Normalization of text
36
- - Cleaning text Text cleaning removes URLs, HTML tags, emojis, and multiple spaces.
37
- - Splitting the text into sentences
38
- - Removing sentences with fewer than 3 words or more than 50 words.
39
- - Removing sentences containing any non-Bangla characters.
40
- - Deduplicating the corpus at the document level.
41
- - Ensuring each text file contains one sentence per line, with each document separated by a blank line.
42
-
43
 
44
  ## Model Details
45
 
 
30
  | Total sentences | 181,447,732 (about 181.45 million) |
31
  | Total documents | 17,516,890 (about 17.52 million) |
32
 
 
 
 
 
 
 
 
 
 
 
33
 
34
  ## Model Details
35