Model Card for FastText Math vs. Non-Math Classifier

A FastText-based binary classifier trained to distinguish “math” text from “non-math” text in English webpages. It is fine-tuned on the kenhktsui/math-classifiers-data dataset using facebook/fasttext-en-vectors as the base word-embedding model.


Model Details

Overview

This model takes raw English text (for example, the plain-text extraction of an HTML page) and predicts whether the content is math-related (label __label__math) or not (label __label__non-math). It was developed by user herooooooooo and is released under the CC-BY-4.0 license.

  • Model type: Supervised FastText classifier (binary classification)
  • Developed by: herooooooooo
  • License: CC-BY-4.0
  • Language: English (en)
  • Base model: facebook/fasttext-en-vectors (pretrained word vectors)
  • Fine-tuned on: kenhktsui/math-classifiers-data (a public Hugging Face dataset of labeled math vs. non-math examples)

Intended Use

  • Primary application: Filtering or labeling large corpora of webpages or documents for math content (e.g., selecting only math-related pages from web crawls).
  • Foreseeable users: Researchers preparing math-focused corpora, data engineers curating domain-specific text, or educators building math content pipelines.
  • Out-of-scope:
    • Not intended for general topic classification beyond “math vs. non-math.”
    • Performance may degrade on extremely short texts (less than ~20 tokens) or on highly technical subdomains not well represented in the training set (e.g., very specialized LaTeX macros not covered by the dataset).
    • Should not be used for any safety- or compliance-critical pipeline without additional validation.

Bias, Risks, and Limitations

  • Biases:

    • The model is trained on the kenhktsui/math-classifiers-data dataset, which predominately contains English posts from math forums and random English web text. It may underperform on non-North American or non-European English dialects (e.g., Indian English math blogs) if they were underrepresented.
    • The classifier can mislabel “math-adjacent” text (e.g., computer science blogs discussing algorithms, physics pages dense with formulas) as “non-math” if the training set did not include similar examples.
  • Technical limitations:

    • Since FastText is a bag-of-words (BoW + n-gram) approach, it does not capture very long-range dependencies or advanced context. Very subtle math content (e.g., a single embedded formula in an otherwise non-math article) may be missed.
    • Very short snippets (e.g., a single equation or a title) may be misclassified because there may not be enough context to distinguish “math” from “non-math.”

Recommendations

  • Before applying at scale, evaluate on a held-out set of your target webpages (especially if they come from a domain not represented in the original dataset).
  • If you encounter persistent misclassification on a new subdomain (e.g., a specialized math blog), collect additional labeled examples from that source and fine-tune or retrain a new FastText model.
  • Use appropriate preprocessing (HTML-to-text extraction, removal of boilerplate navigation) to feed only the main article content into the model for best results.

How to Get Started with the Model

Install dependencies:

pip install fasttext tiktoken
Downloads last month
4
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for herooooooooo/fasttext_math

Finetuned
(1)
this model

Dataset used to train herooooooooo/fasttext_math