Feature Description
Name en_textcat_culture
Version 0.0.3
spaCy >=3.8.7,<3.9.0
Default Pipeline textcat
Components textcat
Vectors 0 keys, 0 unique vectors (0 dimensions)
Sources n/a
License n/a
Author n/a

Description

Profile text data classification using SpaCy textcat components. A super tiny model with no vectors. Internally uses an ensemle of BOW and single-headed attention. I've got better performance scores with transformer-based models but they are much slower to train/run on CPU and much larger. Sticking to the basics so far.

The model is trained on specifically formatted profile fields [name] ([login], [email]) in [location] so it's critical to format the data with the provided format_input helper before applying a model to the text.

Usage

from en_textcat_culture import load as load_culture_model
from en_textcat_culture.utils import format_input

culture_model = load_culture_model()

class PredictCultureByProfileInput(BaseModel):
  login: str
  name: str
  location: str | None = None
  email: str | None = None

type Culture = Literal["Ru", "Other"]

# Example of model use in the context of FastAPI route
@router.post("/predictCultureByProfile")
async def predictCultureByProfile(body: list[PredictCultureByProfileInput]) -> list[Culture]:
  texts = [format_input({
    "login": input.login,
    "name": input.name,
    "location": input.location,
    "email": input.email,
  }) for input in body]
  docs = culture_model.pipe(texts)
  cultures: list[Culture] = []
  for doc in docs:
    match max(doc.cats, key=doc.cats.get):
      case "RU": cultures.append("Ru")
      case "UNSURE": cultures.append("Other")
      case _: raise HTTPException(500, "Invalid enum value")
  return cultures

Label Scheme

Component Labels
textcat RU, OTHER
  • RU is a prediction of a RU-speaking human/org profile
  • OTHER is a prediction of other profiles

Gonna extend this to capture Hispanic and other cultures in the future.

Accuracy

Type Score
CATS_SCORE 94.09
CATS_MICRO_P 94.12
CATS_MICRO_R 94.12
CATS_MICRO_F 94.12
CATS_MACRO_P 94.14
CATS_MACRO_R 94.06
CATS_MACRO_F 94.09
CATS_MACRO_AUC 97.48
CATS_MACRO_AUC_PER_TYPE 0.00
TEXTCAT_LOSS 32.88
Downloads last month
9
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support