ivan-kleshnin/en_textcat_culture

Feature	Description
Name	`en_textcat_culture`
Version	`0.0.3`
spaCy	`>=3.8.7,<3.9.0`
Default Pipeline	`textcat`
Components	`textcat`
Vectors	0 keys, 0 unique vectors (0 dimensions)
Sources	n/a
License	n/a
Author	n/a

Description

Profile text data classification using SpaCy textcat components. A super tiny model with no vectors. Internally uses an ensemle of BOW and single-headed attention. I've got better performance scores with transformer-based models but they are much slower to train/run on CPU and much larger. Sticking to the basics so far.

The model is trained on specifically formatted profile fields [name] ([login], [email]) in [location] so it's critical to format the data with the provided format_input helper before applying a model to the text.

Usage

from en_textcat_culture import load as load_culture_model
from en_textcat_culture.utils import format_input

culture_model = load_culture_model()

class PredictCultureByProfileInput(BaseModel):
  login: str
  name: str
  location: str | None = None
  email: str | None = None

type Culture = Literal["Ru", "Other"]

# Example of model use in the context of FastAPI route
@router.post("/predictCultureByProfile")
async def predictCultureByProfile(body: list[PredictCultureByProfileInput]) -> list[Culture]:
  texts = [format_input({
    "login": input.login,
    "name": input.name,
    "location": input.location,
    "email": input.email,
  }) for input in body]
  docs = culture_model.pipe(texts)
  cultures: list[Culture] = []
  for doc in docs:
    match max(doc.cats, key=doc.cats.get):
      case "RU": cultures.append("Ru")
      case "UNSURE": cultures.append("Other")
      case _: raise HTTPException(500, "Invalid enum value")
  return cultures

Label Scheme

Component	Labels
`textcat`	`RU`, `OTHER`

RU is a prediction of a RU-speaking human/org profile
OTHER is a prediction of other profiles

Gonna extend this to capture Hispanic and other cultures in the future.

Accuracy

Type	Score
`CATS_SCORE`	94.09
`CATS_MICRO_P`	94.12
`CATS_MICRO_R`	94.12
`CATS_MICRO_F`	94.12
`CATS_MACRO_P`	94.14
`CATS_MACRO_R`	94.06
`CATS_MACRO_F`	94.09
`CATS_MACRO_AUC`	97.48
`CATS_MACRO_AUC_PER_TYPE`	0.00
`TEXTCAT_LOSS`	32.88

ivan-kleshnin
/

en_textcat_culture

Description

Usage

Label Scheme

Accuracy

Evaluation results