mirth commited on
Commit
c244154
·
verified ·
1 Parent(s): c8ce218

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +59 -4
README.md CHANGED
@@ -18,21 +18,76 @@ __Chonky__ is a transformer model that intelligently segments text into meaningf
18
 
19
 
20
 
21
- ### Model Description
22
 
23
  The model processes text and divides it into semantically coherent segments. These chunks can then be fed into embedding-based retrieval systems or language models as part of a RAG pipeline.
24
 
25
 
26
  ## How to use
27
 
 
28
 
 
29
 
30
- ### Training Data
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
31
 
32
  The model was trained to split paragraphs from the bookcorpus dataset.
33
 
34
 
35
- ### Metrics
36
 
37
  | Metric | Value |
38
  | -------- | ------|
@@ -41,6 +96,6 @@ The model was trained to split paragraphs from the bookcorpus dataset.
41
  | Recall | 0.63 |
42
  | Accuracy | 0.99 |
43
 
44
- ### Hardware
45
 
46
  Model was fine-tuned on 2x1080ti
 
18
 
19
 
20
 
21
+ ## Model Description
22
 
23
  The model processes text and divides it into semantically coherent segments. These chunks can then be fed into embedding-based retrieval systems or language models as part of a RAG pipeline.
24
 
25
 
26
  ## How to use
27
 
28
+ I've made a small python library for this model: [chonky](https://github.com/mirth/chonky)
29
 
30
+ Here is the usage:
31
 
32
+ ```
33
+ from chonky import TextSplitter
34
+
35
+ # on the first run it will download the transformer model
36
+ splitter = TextSplitter(device="cpu")
37
+
38
+ text = """Before college the two main things I worked on, outside of school, were writing and programming. I didn't write essays. I wrote what beginning writers were supposed to write then, and probably still are: short stories. My stories were awful. They had hardly any plot, just characters with strong feelings, which I imagined made them deep. The first programs I tried writing were on the IBM 1401 that our school district used for what was then called "data processing." This was in 9th grade, so I was 13 or 14. The school district's 1401 happened to be in the basement of our junior high school, and my friend Rich Draves and I got permission to use it. It was like a mini Bond villain's lair down there, with all these alien-looking machines — CPU, disk drives, printer, card reader — sitting up on a raised floor under bright fluorescent lights."""
39
+
40
+ for chunk in splitter(text):
41
+ print(chunk)
42
+ print("--")
43
+ ```
44
+
45
+ But you can use this model using standart NER pipeline:
46
+
47
+ ```
48
+ from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline
49
+
50
+ model_name = "mirth/chonky_distilbert_uncased_1"
51
+
52
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
53
+
54
+ id2label = {
55
+ 0: "O",
56
+ 1: "separator",
57
+ }
58
+ label2id = {
59
+ "O": 0,
60
+ "separator": 1,
61
+ }
62
+
63
+ model = AutoModelForTokenClassification.from_pretrained(
64
+ model_name,
65
+ num_labels=2,
66
+ id2label=id2label,
67
+ label2id=label2id,
68
+ )
69
+
70
+ pipe = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple")
71
+
72
+ text = """Before college the two main things I worked on, outside of school, were writing and programming. I didn't write essays. I wrote what beginning writers were supposed to write then, and probably still are: short stories. My stories were awful. They had hardly any plot, just characters with strong feelings, which I imagined made them deep. The first programs I tried writing were on the IBM 1401 that our school district used for what was then called "data processing." This was in 9th grade, so I was 13 or 14. The school district's 1401 happened to be in the basement of our junior high school, and my friend Rich Draves and I got permission to use it. It was like a mini Bond villain's lair down there, with all these alien-looking machines — CPU, disk drives, printer, card reader — sitting up on a raised floor under bright fluorescent lights."""
73
+
74
+ pipe(text)
75
+
76
+ # Output
77
+
78
+ [
79
+ {'entity_group': 'separator', 'score': 0.89515704, 'word': 'deep.', 'start': 333, 'end': 338},
80
+ {'entity_group': 'separator', 'score': 0.61160326, 'word': '.', 'start': 652, 'end': 653}
81
+ ]
82
+
83
+ ```
84
+
85
+ ## Training Data
86
 
87
  The model was trained to split paragraphs from the bookcorpus dataset.
88
 
89
 
90
+ ## Metrics
91
 
92
  | Metric | Value |
93
  | -------- | ------|
 
96
  | Recall | 0.63 |
97
  | Accuracy | 0.99 |
98
 
99
+ ## Hardware
100
 
101
  Model was fine-tuned on 2x1080ti