Update README.md
Browse files
README.md
CHANGED
@@ -1,15 +1,69 @@
|
|
1 |
---
|
2 |
license: cc-by-4.0
|
3 |
datasets:
|
4 |
-
- kenhktsui/math-classifiers-data
|
5 |
language:
|
6 |
-
- en
|
7 |
metrics:
|
8 |
-
- accuracy
|
9 |
-
- recall
|
10 |
-
- precision
|
11 |
base_model:
|
12 |
-
- facebook/fasttext-en-vectors
|
13 |
pipeline_tag: text-classification
|
14 |
library_name: fasttext
|
15 |
-
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
license: cc-by-4.0
|
3 |
datasets:
|
4 |
+
- kenhktsui/math-classifiers-data
|
5 |
language:
|
6 |
+
- en
|
7 |
metrics:
|
8 |
+
- accuracy
|
9 |
+
- recall
|
10 |
+
- precision
|
11 |
base_model:
|
12 |
+
- facebook/fasttext-en-vectors
|
13 |
pipeline_tag: text-classification
|
14 |
library_name: fasttext
|
15 |
+
---
|
16 |
+
# Model Card for FastText Math vs. Non-Math Classifier
|
17 |
+
|
18 |
+
A FastText-based binary classifier trained to distinguish “math” text from “non-math” text in English webpages. It is fine-tuned on the `kenhktsui/math-classifiers-data` dataset using `facebook/fasttext-en-vectors` as the base word-embedding model.
|
19 |
+
|
20 |
+
---
|
21 |
+
|
22 |
+
## Model Details
|
23 |
+
|
24 |
+
### Overview
|
25 |
+
|
26 |
+
This model takes raw English text (for example, the plain-text extraction of an HTML page) and predicts whether the content is math-related (label `__label__math`) or not (label `__label__non-math`). It was developed by user **herooooooooo** and is released under the CC-BY-4.0 license.
|
27 |
+
|
28 |
+
- **Model type:** Supervised FastText classifier (binary classification)
|
29 |
+
- **Developed by:** herooooooooo
|
30 |
+
- **License:** CC-BY-4.0
|
31 |
+
- **Language:** English (en)
|
32 |
+
- **Base model:** `facebook/fasttext-en-vectors` (pretrained word vectors)
|
33 |
+
- **Fine-tuned on:** `kenhktsui/math-classifiers-data` (a public Hugging Face dataset of labeled math vs. non-math examples)
|
34 |
+
|
35 |
+
### Intended Use
|
36 |
+
|
37 |
+
- **Primary application:** Filtering or labeling large corpora of webpages or documents for math content (e.g., selecting only math-related pages from web crawls).
|
38 |
+
- **Foreseeable users:** Researchers preparing math-focused corpora, data engineers curating domain-specific text, or educators building math content pipelines.
|
39 |
+
- **Out-of-scope:**
|
40 |
+
- Not intended for general topic classification beyond “math vs. non-math.”
|
41 |
+
- Performance may degrade on extremely short texts (less than ~20 tokens) or on highly technical subdomains not well represented in the training set (e.g., very specialized LaTeX macros not covered by the dataset).
|
42 |
+
- Should not be used for any safety- or compliance-critical pipeline without additional validation.
|
43 |
+
|
44 |
+
---
|
45 |
+
|
46 |
+
## Bias, Risks, and Limitations
|
47 |
+
|
48 |
+
- **Biases:**
|
49 |
+
- The model is trained on the `kenhktsui/math-classifiers-data` dataset, which predominately contains English posts from math forums and random English web text. It may underperform on non-North American or non-European English dialects (e.g., Indian English math blogs) if they were underrepresented.
|
50 |
+
- The classifier can mislabel “math-adjacent” text (e.g., computer science blogs discussing algorithms, physics pages dense with formulas) as “non-math” if the training set did not include similar examples.
|
51 |
+
|
52 |
+
- **Technical limitations:**
|
53 |
+
- Since FastText is a bag-of-words (BoW + n-gram) approach, it does not capture very long-range dependencies or advanced context. Very subtle math content (e.g., a single embedded formula in an otherwise non-math article) may be missed.
|
54 |
+
- Very short snippets (e.g., a single equation or a title) may be misclassified because there may not be enough context to distinguish “math” from “non-math.”
|
55 |
+
|
56 |
+
### Recommendations
|
57 |
+
|
58 |
+
- Before applying at scale, evaluate on a held-out set of your target webpages (especially if they come from a domain not represented in the original dataset).
|
59 |
+
- If you encounter persistent misclassification on a new subdomain (e.g., a specialized math blog), collect additional labeled examples from that source and fine-tune or retrain a new FastText model.
|
60 |
+
- Use appropriate preprocessing (HTML-to-text extraction, removal of boilerplate navigation) to feed only the main article content into the model for best results.
|
61 |
+
|
62 |
+
---
|
63 |
+
|
64 |
+
## How to Get Started with the Model
|
65 |
+
|
66 |
+
Install dependencies:
|
67 |
+
|
68 |
+
```bash
|
69 |
+
pip install fasttext tiktoken
|