INTRODUCTION:

This model, developed as part of the propp-fr project, is a NER model built on top of camembert-large embeddings, trained to predict nested entities in french, specifically for literary texts.

The predicted entities are:

  • mentions of characters (PER): pronouns (je, tu, il, ...), possessive pronouns (mon, ton, son, ...), common nouns (le capitaine, la princesse, ...) and proper nouns (Indiana Delmare, Honoré de Pardaillan, ...)
  • facilities (FAC): chatêau, sentier, chambre, couloir, ...
  • time (TIME): le règne de Louis XIV, ce matin, en juillet, ...
  • geo-political entities (GPE): Montrouge, France, le petit hameau, ...
  • locations (LOC): le sud, Mars, l'océan, le bois, ...
  • vehicles (VEH): avion, voitures, calèche, vélos, ...

MODEL PERFORMANCES (LOOCV):

NER_tag precision recall f1_score support support %
PER 94.58% 95.16% 94.87% 71,738 100.00%
micro_avg 94.58% 95.16% 94.87% 71,738 100.00%
macro_avg 94.58% 95.16% 94.87% 71,738 100.00%

TRAINING PARAMETERS:

  • Entities types: ['PER']
  • Tagging scheme: BIOES
  • Nested entities levels: [0, 1]
  • Split strategy: Leave-one-out cross-validation (31 files)
  • Train/Validation split: 0.85 / 0.15
  • Batch size: 16
  • Initial learning rate: 0.00014

MODEL ARCHITECTURE:

Model Input: Maximum context camembert-large embeddings (1024 dimensions)

  • Locked Dropout: 0.5

  • Projection layer:

    • layer type: highway layer
    • input: 1024 dimensions
    • output: 2048 dimensions
  • BiLSTM layer:

    • input: 2048 dimensions
    • output: 256 dimensions (hidden state)
  • Linear layer:

    • input: 256 dimensions
    • output: 5 dimensions (predicted labels with BIOES tagging scheme)
  • CRF layer

Model Output: BIOES labels sequence

HOW TO USE:

*** IN CONSTRUCTION ***

TRAINING CORPUS:

Document Tokens Count Is included in model eval
0 1731_Prévost-Antoine-François_Manon-Lescaut_PER-ONLY 71,219 tokens True
1 1830_Balzac-Honoré-de_La-maison-du-chat-qui-pelote 24,776 tokens True
2 1830_Balzac-Honoré-de_Sarrasine 15,408 tokens True
3 1832_Sand-George_Indiana_PER-ONLY 112,221 tokens True
4 1836_Gautier-Théophile_La-morte-amoureuse 14,293 tokens True
5 1837_Balzac-Honoré-de_La-maison-Nucingen 30,030 tokens True
6 1841_Sand-George_Pauline 12,398 tokens True
7 1856_Cousin-Victor_Madame-de-Hautefort 11,768 tokens True
8 1863_Gautier-Théophile_Le-capitaine-Fracasse 11,848 tokens True
9 1873_Zola-Émile_Le-ventre-de-Paris 12,613 tokens True
10 1881_Flaubert-Gustave_Bouvard-et-Pécuchet 12,308 tokens True
11 1882-1883_Maupassant-Guy-de_Mademoiselle-Fifi-La-buche 2,267 tokens True
12 1882-1883_Maupassant-Guy-de_Mademoiselle-Fifi-La-relique 2,041 tokens True
13 1882-1883_Maupassant-Guy-de_Mademoiselle-Fifi-La-rouille 2,949 tokens True
14 1882-1883_Maupassant-Guy-de_Mademoiselle-Fifi-Madame-Baptiste 2,578 tokens True
15 1882-1883_Maupassant-Guy-de_Mademoiselle-Fifi-Marocca 4,078 tokens True
16 1882-1883_Maupassant-Guy-de_Mademoiselle-Fifi-Nouveaux-contes-A-cheval 2,878 tokens True
17 1882-1883_Maupassant-Guy-de_Mademoiselle-Fifi-Nouveaux-contes-Fou 1,905 tokens True
18 1882-1883_Maupassant-Guy-de_Mademoiselle-Fifi-Nouveaux-contes-Mademoiselle-Fifi 5,439 tokens True
19 1882-1883_Maupassant-Guy-de_Mademoiselle-Fifi-Nouveaux-contes-Reveil 2,159 tokens True
20 1882-1883_Maupassant-Guy-de_Mademoiselle-Fifi-Nouveaux-contes-Un-reveillon 2,364 tokens True
21 1882-1883_Maupassant-Guy-de_Mademoiselle-Fifi-Nouveaux-contes-Une-ruse 2,469 tokens True
22 1901_Achard-Lucie_Rosalie-de-Constant-sa-famille-et-ses-amis 12,775 tokens True
23 1903_Conan-Laure_Élisabeth-Seton 13,046 tokens True
24 1904-1912_Rolland-Romain_Jean-Christophe(1) 10,982 tokens True
25 1904-1912_Rolland-Romain_Jean-Christophe(2) 10,305 tokens True
26 1917_Bourgeois-Adèle_Némoville 12,468 tokens True
27 1923_Delly_Dans-les-ruines 95,617 tokens True
28 1923_Radiguet-Raymond_Le-diable-au-corps 14,850 tokens True
29 1926_Audoux-Marguerite_De-la-ville-au-moulin 12,144 tokens True
30 1937_Audoux-Marguerite_Douce-Lumière 12,346 tokens True
31 TOTAL 554,542 tokens 3 files used for cross-validation

PREDICTIONS CONFUSION MATRIX:

Gold Labels PER O support
PER 68,267 3,471 71,738
O 3,910 0 3,910

CONTACT:

mail: antoine [dot] bourgois [at] protonmail [dot] com

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for AntoineBourgois/propp-fr_NER_camembert-large_PER

Finetuned
(11)
this model