ivan-kleshnin/en_textcat_gender

Description

Profile text data classification using SpaCy textcat components. A super tiny model with no vectors. Internally uses an ensemle of BOW and single-headed attention. I've got better performance scores with transformer-based models but they are much slower to train/run on CPU and much larger. Sticking to the basics so far.

The model is trained on specifically formatted profile fields [name] ([login], [email]) in [location] so it's critical to format the data with the provided format_input helper before applying a model to the text.

Usage

from en_textcat_gender import load as load_gender_model
from en_textcat_gender.utils import format_input

gender_model = load_gender_model()

class PredictGenderByProfileInput(BaseModel):
  login: str
  name: str
  location: str | None = None
  email: str | None = None

type Gender = Literal["Male", "Female", "Neutral"]

# Example of model use in the context of FastAPI route
@router.post("/predictGenderByProfile")
async def predictGenderByProfile(body: list[PredictGenderByProfileInput]) -> list[Gender]:
  texts = [format_input({
    "login": input.login,
    "name": input.name,
    "location": input.location,
    "email": input.email,
  }) for input in body]
  docs = gender_model.pipe(texts)
  genders: list[GenderAlt] = []
  for doc in docs:
    match max(doc.cats, key=doc.cats.get):
      case "FEMALE": genders.append("Female")
      case "MALE": genders.append("Male")
      case "NEUTRAL": genders.append("Neutral")
      case _: raise HTTPException(500, "Invalid enum value")
  return genders

Feature	Description
Name	`en_textcat_gender`
Version	`0.0.3`
spaCy	`>=3.8.7,<3.9.0`
Default Pipeline	`textcat`
Components	`textcat`
Vectors	0 keys, 0 unique vectors (0 dimensions)
Sources	n/a
License	n/a
Author	n/a

Label Scheme

Component	Labels
`textcat`	`MALE`, `FEMALE`, `NEUTRAL`

MALE is a prediction of male human profiles
FEMALE is a prediction for female human profiles
NEUTRAL is a prediction for gender-neutral names, unintelligible sequences ("Foo Bar") and non-human profiles (organizations, companies)

Accuracy

Type	Score
`CATS_SCORE`	93.37
`CATS_MICRO_P`	93.19
`CATS_MICRO_R`	93.19
`CATS_MICRO_F`	93.19
`CATS_MACRO_P`	93.75
`CATS_MACRO_R`	93.04
`CATS_MACRO_F`	93.37
`CATS_MACRO_AUC`	98.76
`CATS_MACRO_AUC_PER_TYPE`	0.00
`TEXTCAT_LOSS`	105.21

ivan-kleshnin
/

en_textcat_gender

Description

Usage

Label Scheme

Accuracy

Evaluation results