mirth
/

chonky_distilbert_base_uncased_1

@@ -18,21 +18,76 @@ __Chonky__ is a transformer model that intelligently segments text into meaningf
-### Model Description
 The model processes text and divides it into semantically coherent segments. These chunks can then be fed into embedding-based retrieval systems or language models as part of a RAG pipeline.
 ## How to use
-### Training Data
 The model was trained to split paragraphs from the bookcorpus dataset.
-### Metrics
 | Metric   | Value |
 | -------- | ------|
@@ -41,6 +96,6 @@ The model was trained to split paragraphs from the bookcorpus dataset.
 | Recall   | 0.63  |
 | Accuracy | 0.99  |
-### Hardware
 Model was fine-tuned on 2x1080ti

+## Model Description
 The model processes text and divides it into semantically coherent segments. These chunks can then be fed into embedding-based retrieval systems or language models as part of a RAG pipeline.
 ## How to use
+I've made a small python library for this model: [chonky](https://github.com/mirth/chonky)
+Here is the usage:
+```
+from chonky import TextSplitter
+# on the first run it will download the transformer model
+splitter = TextSplitter(device="cpu")
+text = """Before college the two main things I worked on, outside of school, were writing and programming. I didn't write essays. I wrote what beginning writers were supposed to write then, and probably still are: short stories. My stories were awful. They had hardly any plot, just characters with strong feelings, which I imagined made them deep. The first programs I tried writing were on the IBM 1401 that our school district used for what was then called "data processing." This was in 9th grade, so I was 13 or 14. The school district's 1401 happened to be in the basement of our junior high school, and my friend Rich Draves and I got permission to use it. It was like a mini Bond villain's lair down there, with all these alien-looking machines — CPU, disk drives, printer, card reader — sitting up on a raised floor under bright fluorescent lights."""
+for chunk in splitter(text):
+  print(chunk)
+  print("--")
+```
+But you can use this model using standart NER pipeline:
+```
+from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline
+model_name = "mirth/chonky_distilbert_uncased_1"
+tokenizer = AutoTokenizer.from_pretrained(model_name)
+id2label = {
+    0: "O",
+    1: "separator",
+}
+label2id = {
+    "O": 0,
+    "separator": 1,
+}
+model = AutoModelForTokenClassification.from_pretrained(
+    model_name,
+    num_labels=2,
+    id2label=id2label,
+    label2id=label2id,
+)
+pipe = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple")
+text = """Before college the two main things I worked on, outside of school, were writing and programming. I didn't write essays. I wrote what beginning writers were supposed to write then, and probably still are: short stories. My stories were awful. They had hardly any plot, just characters with strong feelings, which I imagined made them deep. The first programs I tried writing were on the IBM 1401 that our school district used for what was then called "data processing." This was in 9th grade, so I was 13 or 14. The school district's 1401 happened to be in the basement of our junior high school, and my friend Rich Draves and I got permission to use it. It was like a mini Bond villain's lair down there, with all these alien-looking machines — CPU, disk drives, printer, card reader — sitting up on a raised floor under bright fluorescent lights."""
+pipe(text)
+# Output
+[
+  {'entity_group': 'separator', 'score': 0.89515704, 'word': 'deep.', 'start': 333, 'end': 338},
+  {'entity_group': 'separator', 'score': 0.61160326, 'word': '.', 'start': 652, 'end': 653}
+]
+```
+## Training Data
 The model was trained to split paragraphs from the bookcorpus dataset.
+## Metrics
 | Metric   | Value |
 | -------- | ------|
 | Recall   | 0.63  |
 | Accuracy | 0.99  |
+## Hardware
 Model was fine-tuned on 2x1080ti