Update README.md
Browse files
README.md
CHANGED
@@ -30,16 +30,6 @@ We used 36 GB of text data to train the model. The used corpus has the following
|
|
30 |
| Total sentences | 181,447,732 (about 181.45 million) |
|
31 |
| Total documents | 17,516,890 (about 17.52 million) |
|
32 |
|
33 |
-
The raw crawled text is pre-processed in several steps to produce the final 36 GB of data. The pre-processing contains the following steps:
|
34 |
-
|
35 |
-
- Normalization of text
|
36 |
-
- Cleaning text Text cleaning removes URLs, HTML tags, emojis, and multiple spaces.
|
37 |
-
- Splitting the text into sentences
|
38 |
-
- Removing sentences with fewer than 3 words or more than 50 words.
|
39 |
-
- Removing sentences containing any non-Bangla characters.
|
40 |
-
- Deduplicating the corpus at the document level.
|
41 |
-
- Ensuring each text file contains one sentence per line, with each document separated by a blank line.
|
42 |
-
|
43 |
|
44 |
## Model Details
|
45 |
|
|
|
30 |
| Total sentences | 181,447,732 (about 181.45 million) |
|
31 |
| Total documents | 17,516,890 (about 17.52 million) |
|
32 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
33 |
|
34 |
## Model Details
|
35 |
|