herooooooooo commited on
Commit
6ec1278
·
verified ·
1 Parent(s): 7507f04

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +61 -7
README.md CHANGED
@@ -1,15 +1,69 @@
1
  ---
2
  license: cc-by-4.0
3
  datasets:
4
- - kenhktsui/math-classifiers-data
5
  language:
6
- - en
7
  metrics:
8
- - accuracy
9
- - recall
10
- - precision
11
  base_model:
12
- - facebook/fasttext-en-vectors
13
  pipeline_tag: text-classification
14
  library_name: fasttext
15
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: cc-by-4.0
3
  datasets:
4
+ - kenhktsui/math-classifiers-data
5
  language:
6
+ - en
7
  metrics:
8
+ - accuracy
9
+ - recall
10
+ - precision
11
  base_model:
12
+ - facebook/fasttext-en-vectors
13
  pipeline_tag: text-classification
14
  library_name: fasttext
15
+ ---
16
+ # Model Card for FastText Math vs. Non-Math Classifier
17
+
18
+ A FastText-based binary classifier trained to distinguish “math” text from “non-math” text in English webpages. It is fine-tuned on the `kenhktsui/math-classifiers-data` dataset using `facebook/fasttext-en-vectors` as the base word-embedding model.
19
+
20
+ ---
21
+
22
+ ## Model Details
23
+
24
+ ### Overview
25
+
26
+ This model takes raw English text (for example, the plain-text extraction of an HTML page) and predicts whether the content is math-related (label `__label__math`) or not (label `__label__non-math`). It was developed by user **herooooooooo** and is released under the CC-BY-4.0 license.
27
+
28
+ - **Model type:** Supervised FastText classifier (binary classification)
29
+ - **Developed by:** herooooooooo
30
+ - **License:** CC-BY-4.0
31
+ - **Language:** English (en)
32
+ - **Base model:** `facebook/fasttext-en-vectors` (pretrained word vectors)
33
+ - **Fine-tuned on:** `kenhktsui/math-classifiers-data` (a public Hugging Face dataset of labeled math vs. non-math examples)
34
+
35
+ ### Intended Use
36
+
37
+ - **Primary application:** Filtering or labeling large corpora of webpages or documents for math content (e.g., selecting only math-related pages from web crawls).
38
+ - **Foreseeable users:** Researchers preparing math-focused corpora, data engineers curating domain-specific text, or educators building math content pipelines.
39
+ - **Out-of-scope:**
40
+ - Not intended for general topic classification beyond “math vs. non-math.”
41
+ - Performance may degrade on extremely short texts (less than ~20 tokens) or on highly technical subdomains not well represented in the training set (e.g., very specialized LaTeX macros not covered by the dataset).
42
+ - Should not be used for any safety- or compliance-critical pipeline without additional validation.
43
+
44
+ ---
45
+
46
+ ## Bias, Risks, and Limitations
47
+
48
+ - **Biases:**
49
+ - The model is trained on the `kenhktsui/math-classifiers-data` dataset, which predominately contains English posts from math forums and random English web text. It may underperform on non-North American or non-European English dialects (e.g., Indian English math blogs) if they were underrepresented.
50
+ - The classifier can mislabel “math-adjacent” text (e.g., computer science blogs discussing algorithms, physics pages dense with formulas) as “non-math” if the training set did not include similar examples.
51
+
52
+ - **Technical limitations:**
53
+ - Since FastText is a bag-of-words (BoW + n-gram) approach, it does not capture very long-range dependencies or advanced context. Very subtle math content (e.g., a single embedded formula in an otherwise non-math article) may be missed.
54
+ - Very short snippets (e.g., a single equation or a title) may be misclassified because there may not be enough context to distinguish “math” from “non-math.”
55
+
56
+ ### Recommendations
57
+
58
+ - Before applying at scale, evaluate on a held-out set of your target webpages (especially if they come from a domain not represented in the original dataset).
59
+ - If you encounter persistent misclassification on a new subdomain (e.g., a specialized math blog), collect additional labeled examples from that source and fine-tune or retrain a new FastText model.
60
+ - Use appropriate preprocessing (HTML-to-text extraction, removal of boilerplate navigation) to feed only the main article content into the model for best results.
61
+
62
+ ---
63
+
64
+ ## How to Get Started with the Model
65
+
66
+ Install dependencies:
67
+
68
+ ```bash
69
+ pip install fasttext tiktoken