banglagov
/

banBERT-Base

Feature Extraction

text-embeddings-inference

Model card Files Files and versions

banglagov commited on Jan 27

Commit

11547fc

·

verified ·

1 Parent(s): 5e72a6f

Update README.md

Files changed (1) hide show

README.md +0 -10

README.md CHANGED Viewed

@@ -30,16 +30,6 @@ We used 36 GB of text data to train the model. The used corpus has the following
 | Total sentences    | 181,447,732 (about 181.45 million)   |
 | Total documents    | 17,516,890 (about 17.52 million)     |
-The raw crawled text is pre-processed in several steps to produce the final 36 GB of data. The pre-processing contains the following steps:
-  - Normalization of text
-  - Cleaning text Text cleaning removes URLs, HTML tags, emojis, and multiple spaces.
-  - Splitting the text into sentences
-  - Removing sentences with fewer than 3 words or more than 50 words.
-  - Removing sentences containing any non-Bangla characters.
-  - Deduplicating the corpus at the document level.
-  - Ensuring each text file contains one sentence per line, with each document separated by a blank line.
 ## Model Details

 | Total sentences    | 181,447,732 (about 181.45 million)   |
 | Total documents    | 17,516,890 (about 17.52 million)     |
 ## Model Details