Feature | Description |
---|---|
Name | en_textcat_culture |
Version | 0.0.3 |
spaCy | >=3.8.7,<3.9.0 |
Default Pipeline | textcat |
Components | textcat |
Vectors | 0 keys, 0 unique vectors (0 dimensions) |
Sources | n/a |
License | n/a |
Author | n/a |
Description
Profile text data classification using SpaCy textcat
components. A super tiny model with no vectors.
Internally uses an ensemle of BOW and single-headed attention. I've got better performance scores with transformer-based models
but they are much slower to train/run on CPU and much larger. Sticking to the basics so far.
The model is trained on specifically formatted profile fields [name] ([login], [email]) in [location]
so it's critical
to format the data with the provided format_input
helper before applying a model to the text.
Usage
from en_textcat_culture import load as load_culture_model
from en_textcat_culture.utils import format_input
culture_model = load_culture_model()
class PredictCultureByProfileInput(BaseModel):
login: str
name: str
location: str | None = None
email: str | None = None
type Culture = Literal["Ru", "Other"]
# Example of model use in the context of FastAPI route
@router.post("/predictCultureByProfile")
async def predictCultureByProfile(body: list[PredictCultureByProfileInput]) -> list[Culture]:
texts = [format_input({
"login": input.login,
"name": input.name,
"location": input.location,
"email": input.email,
}) for input in body]
docs = culture_model.pipe(texts)
cultures: list[Culture] = []
for doc in docs:
match max(doc.cats, key=doc.cats.get):
case "RU": cultures.append("Ru")
case "UNSURE": cultures.append("Other")
case _: raise HTTPException(500, "Invalid enum value")
return cultures
Label Scheme
Component | Labels |
---|---|
textcat |
RU , OTHER |
- RU is a prediction of a RU-speaking human/org profile
- OTHER is a prediction of other profiles
Gonna extend this to capture Hispanic and other cultures in the future.
Accuracy
Type | Score |
---|---|
CATS_SCORE |
94.09 |
CATS_MICRO_P |
94.12 |
CATS_MICRO_R |
94.12 |
CATS_MICRO_F |
94.12 |
CATS_MACRO_P |
94.14 |
CATS_MACRO_R |
94.06 |
CATS_MACRO_F |
94.09 |
CATS_MACRO_AUC |
97.48 |
CATS_MACRO_AUC_PER_TYPE |
0.00 |
TEXTCAT_LOSS |
32.88 |
- Downloads last month
- 9