Model Description

This model is a fine-tuned version of BioBERT on the GENIA dataset for Constituency Parsing. We adapt the research Constituent Parsing as Sequence Labeling to create labels and use this as ground truth.

Intended Use

This model is intended for analyzing the syntactic structure of biomedical text by constructing constituency-based parse trees.

Training Data

This model was trained on the GENIA dataset. The dataset is in the raw form and it needs to be processed and split. We randomly split the dataset and here is the result:

Dataset Number of sentence
Train 14543
Dev 1824
Test 1824

Training method

  • We collected the data and convert it to trainable dataset by adapting the code from the research tree2label.
  • In the dataset, we encounter numerous issues that tree2label can't generate the label. We list out the issue here. When done, we moved on fine-tuning the model.
    • Some chemical names like (OH) is mistaken to be the constituent. To solve this, we remove the bracket for the code to work.
    • This dataset also collect null constituents, which also not supported by tree2label. To solve this, we add a word like NULL to the null constituent.
  • The number of label generated is 704.
  • The pre-trained model we use is BioBERT from DMIS-Lab, which is suitable for the domain. The .safetensor version is used, provided by HuggingFace staff in the pull request.
  • We decide to freeze classifier layer when training to prevent overfitting.

Result

We trained and evaluated on Google Colab T4 GPU with 4 epochs. Here are the results on the test dataset we collected.

metric dev test
f1 81.7 81.5
precision 81.1 80.7
recall 82.9 82.9

Demo

We have included a demo, please see the Space next to the datacard for more information.

Downloads last month
55
Safetensors
Model size
108M params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for almo762/biobert-constituency-parsing-v1.6

Finetuned
(18)
this model

Space using almo762/biobert-constituency-parsing-v1.6 1