Description
Profile text data classification using SpaCy textcat
components. A super tiny model with no vectors.
Internally uses an ensemle of BOW and single-headed attention. I've got better performance scores with transformer-based models
but they are much slower to train/run on CPU and much larger. Sticking to the basics so far.
The model is trained on specifically formatted profile fields [name] ([login], [email]) in [location]
so it's critical
to format the data with the provided format_input
helper before applying a model to the text.
Usage
from en_textcat_gender import load as load_gender_model
from en_textcat_gender.utils import format_input
gender_model = load_gender_model()
class PredictGenderByProfileInput(BaseModel):
login: str
name: str
location: str | None = None
email: str | None = None
type Gender = Literal["Male", "Female", "Neutral"]
@router.post("/predictGenderByProfile")
async def predictGenderByProfile(body: list[PredictGenderByProfileInput]) -> list[Gender]:
texts = [format_input({
"login": input.login,
"name": input.name,
"location": input.location,
"email": input.email,
}) for input in body]
docs = gender_model.pipe(texts)
genders: list[GenderAlt] = []
for doc in docs:
match max(doc.cats, key=doc.cats.get):
case "FEMALE": genders.append("Female")
case "MALE": genders.append("Male")
case "NEUTRAL": genders.append("Neutral")
case _: raise HTTPException(500, "Invalid enum value")
return genders
Feature |
Description |
Name |
en_textcat_gender |
Version |
0.0.3 |
spaCy |
>=3.8.7,<3.9.0 |
Default Pipeline |
textcat |
Components |
textcat |
Vectors |
0 keys, 0 unique vectors (0 dimensions) |
Sources |
n/a |
License |
n/a |
Author |
n/a |
Label Scheme
Component |
Labels |
textcat |
MALE , FEMALE , NEUTRAL |
- MALE is a prediction of male human profiles
- FEMALE is a prediction for female human profiles
- NEUTRAL is a prediction for gender-neutral names, unintelligible sequences ("Foo Bar") and non-human profiles (organizations, companies)
Accuracy
Type |
Score |
CATS_SCORE |
93.37 |
CATS_MICRO_P |
93.19 |
CATS_MICRO_R |
93.19 |
CATS_MICRO_F |
93.19 |
CATS_MACRO_P |
93.75 |
CATS_MACRO_R |
93.04 |
CATS_MACRO_F |
93.37 |
CATS_MACRO_AUC |
98.76 |
CATS_MACRO_AUC_PER_TYPE |
0.00 |
TEXTCAT_LOSS |
105.21 |