sahithkumar7 commited on
Commit
bb883ca
·
verified ·
1 Parent(s): 33c817b

Add new SentenceTransformer model

Browse files
1_Pooling/config.json ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "word_embedding_dimension": 768,
3
+ "pooling_mode_cls_token": false,
4
+ "pooling_mode_mean_tokens": true,
5
+ "pooling_mode_max_tokens": false,
6
+ "pooling_mode_mean_sqrt_len_tokens": false,
7
+ "pooling_mode_weightedmean_tokens": false,
8
+ "pooling_mode_lasttoken": false,
9
+ "include_prompt": true
10
+ }
README.md ADDED
@@ -0,0 +1,775 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ tags:
3
+ - sentence-transformers
4
+ - sentence-similarity
5
+ - feature-extraction
6
+ - dense
7
+ - generated_from_trainer
8
+ - dataset_size:80
9
+ - loss:MultipleNegativesRankingLoss
10
+ base_model: microsoft/mpnet-base
11
+ widget:
12
+ - source_sentence: How many different active substances were detected in surface water
13
+ across all catchment areas?
14
+ sentences:
15
+ - 'metabolites were not detected in the water bodies.
16
+
17
+ 2.1.1. Antibiotics/Enzyme-Inhibitors and
18
+
19
+ Abacavir in Surface-Water
20
+
21
+ Fifty detections were found in all catchment areas in surface water, which corresponds
22
+ to 15 different active substances:
23
+
24
+ 12 antibiotics, two enzyme inhibitors, and one antiviral. The number of detections
25
+ per sampling station ranged from 0 to 7
26
+
27
+ different active substances. The Ave river-Prazins (Santo Tirso) and Serzedelo
28
+ I and II (Guimar ã es) as well as Ria
29
+
30
+ Formosa-coastal water (Faro and Olh ã o), each one with two sampling sites, showed
31
+ the most detected compounds in'
32
+ - '2. Results
33
+
34
+ 2.1. Frequency of Detections:
35
+
36
+ Antibiotics/Enzyme-Inhibitors and Abacavir
37
+
38
+ in Surface-Groundwater
39
+
40
+ During the screening framework beyond the antibiotics/enzyme-inhibitors, the antiviral
41
+ abacavir was detected. Therefore,
42
+
43
+ given the relevance of this compound, it was included in the present study. Although
44
+ enzyme inhibitors belong to the
45
+
46
+ antibiotic group, their specific pharmacological properties and detection were
47
+ sorted apart. In the present study, antibiotic
48
+
49
+ metabolites were not detected in the water bodies.
50
+
51
+ 2.1.1. Antibiotics/Enzyme-Inhibitors and
52
+
53
+ Abacavir in Surface-Water'
54
+ - 'surface water. The relatively higher detection of substances downstream of the
55
+ effluent discharge points compared with a
56
+
57
+ low detection in upstream samples could be attributed to the low efficiency in
58
+ urban wastewater treatment plants or
59
+
60
+ agricultural pressure. The environmental impact is more critical due to active
61
+ substances in drinking water or premix
62
+
63
+ medicated feeds in the veterinary site.
64
+
65
+ Furthermore, the detection of substances of exclusive human use (abacavir, tazobactam
66
+ and cilastatin) prove the weak'
67
+ - source_sentence: What group of pharmaceuticals was sulfamethazine matched to when
68
+ its quantity was missing?
69
+ sentences:
70
+ - 'ciprofloxacin
71
+
72
+ 43%
73
+
74
+ (3/7), enrofloxacin, norfloxacin, trimethoprim, lincomycin (29% (2/7), abacavir
75
+ and tetracycline
76
+
77
+ 14% (1/7). The enzyme inhibitors, namely clavulanic acid and cilastatin, were
78
+ detected once in an urban region located
79
+
80
+ well. This catchment point showed the most significant
81
+
82
+ number of pharmaceuticals. West/Tejo and Centre were the regions with the most
83
+ considerable number of substances in
84
+
85
+ groundwater, accounting for 43%. All groundwater
86
+
87
+ samples were contaminated by at least one antibiotic. Supplemental Tables S2 and
88
+ S4 contain a detailed description of
89
+
90
+ the'
91
+ - 'clarithromycin) were the only ones that demonstrated the potential to concentrate
92
+ in living organisms (log Kow ≥ 3) [14].
93
+
94
+ All the remaining antibiotics showed a relatively low log Kow and were expected
95
+ to be present mainly in surface water.
96
+
97
+ However, the soil mobility/adsorption detected The detected pharmaceuticals showed
98
+ high to moderate water solubility
99
+
100
+ and are small ionisable molecules (MW ≤ 900 g/mol). Regarding the octanol/water
101
+ partitioning coefficient (log Kow) data,'
102
+ - 'missing quantity for sulfamethazine, the sulfonamides group has been matched.
103
+
104
+ Consumption (Kg) of the detected pharmaceuticals in Portugal (2017).
105
+
106
+ 1 Amount from ESVAC Report-2017; 2 Match the sulfonamides amount; NA-not available.
107
+
108
+ Amount of detected pharmaceuticals consumption per Portuguese region. Amount of
109
+ detected pharmaceuticals
110
+
111
+ consumption per Portuguese region.'
112
+ - source_sentence: What directive sets environmental quality standards for substances
113
+ in surface waters?
114
+ sentences:
115
+ - 'As much as the specificities of each member state should be considered this issue
116
+ has become one of the European
117
+
118
+ community''s main concerns [8].
119
+
120
+ The strategies against water pollution are provided in the Water Framework Directive
121
+ [9] and the Directive on
122
+
123
+ Environmental Quality Standards that set environmental quality standards (EQS)
124
+ for the substances in surface waters
125
+
126
+ and confirm their designation as priority or priority hazardous substances [10].
127
+ Evidence of potential impacts and'
128
+ - 'seems to undertake a similar fate in the environment.
129
+
130
+ Nevertheless, due to stronger adsorption, with higher emergence in sediment, its
131
+ occurrence in the surface water is lower
132
+
133
+ [71]. The use of tetracyclines, mainly as medicated premix and oral solution for
134
+ food-producing animals [72], and the very
135
+
136
+ low bioavailability (e.g. in pig feed) [43] contribute to increasing its release
137
+ into the environment. Regarding macrolides,
138
+
139
+ erythromycin and clarithromycin exhibit a remarkable frequency of detection in
140
+ surface water samples. The most'
141
+ - 'low flows; otherwise, POCIS might be damage. In ground-waters was used one POCIS
142
+ unit/well. Due to the high sorption
143
+
144
+ capacity, POCIS was deployed approximately for 30 days, allowing the polar organic
145
+ compounds adsorbed to be in the
146
+
147
+ equilibrium stage with the active substances in an aqueous medium. In the laboratory,
148
+ POCIS disks were frozen until
149
+
150
+ extraction.
151
+
152
+ 4.2.2. Qualitative Analysis Method Used
153
+
154
+ for the Characterisation of Antibiotics in
155
+
156
+ Surface-Groundwater'
157
+ - source_sentence: What is the molecular weight range of the detected pharmaceuticals?
158
+ sentences:
159
+ - '2.3. Physicochemical Properties and Key Pharmacokinetic Features of Detected
160
+ Pharmaceuticals 2.3. Physicochemical
161
+
162
+ Properties and Key Pharmacokinetic Features of Detected Pharmaceuticals
163
+
164
+ The detected pharmaceuticals showed high to moderate water solubility and are
165
+ small ionisable molecules (MW ≤ 900
166
+
167
+ g/mol). Regarding the octanol/water partitioning coefficient (log Kow) data, macrolide
168
+ antibiotics (azithromycin and
169
+
170
+ clarithromycin) were the only ones that demonstrated the potential to concentrate
171
+ in living organisms (log Kow ≥ 3) [14].'
172
+ - 'As much as the specificities of each member state should be considered this issue
173
+ has become one of the European
174
+
175
+ community''s main concerns [8].
176
+
177
+ The strategies against water pollution are provided in the Water Framework Directive
178
+ [9] and the Directive on
179
+
180
+ Environmental Quality Standards that set environmental quality standards (EQS)
181
+ for the substances in surface waters
182
+
183
+ and confirm their designation as priority or priority hazardous substances [10].
184
+ Evidence of potential impacts and'
185
+ - 'passive samplers in groundwater considered the well technical features; the depth
186
+ and groundwater level were previously
187
+
188
+ determined since they should be detected at the superficial levels. The passive
189
+ sampler was placed using a water level
190
+
191
+ meter, 2 m below the groundwater level. The sampler always remained immersed in
192
+ water, avoiding extractions and the
193
+
194
+ regional lowering of the water table [104]. For the sampling stations, sites of
195
+ different environmental pressures were
196
+
197
+ considered, specifically urban, agricultural area/animal production, and aquaculture.
198
+ The information regarding the'
199
+ - source_sentence: What was the most frequently identified pharmaceutical in the groundwater
200
+ samples?
201
+ sentences:
202
+ - 'Pharmacokinetic characteristics may represent key features in understanding antibiotics
203
+ occurrence [62]. Most antibiotics
204
+
205
+ are not completely metabolised in humans and animals; thus, a high percentage
206
+ of the active substance (40-90%) is
207
+
208
+ excreted in urine/faeces in the unchanged form. These molecules are discharged
209
+ into water and soil through wastewater,
210
+
211
+ animal manure, and sewage sludge, frequently used as fertilisers to agricultural
212
+ lands. Also, it is expected that the
213
+
214
+ hospital effluent will contribute partly to the pharmaceutical load in the wastewater
215
+ treatment plant influence [63].'
216
+ - 'many domestic and livestock animals. Several formulations of powder for administration
217
+ in drinking water and medicated
218
+
219
+ premix are available for poultry and pigs. The excretion of amoxicillin is predominantly
220
+ renal; more than 80% of the parent
221
+
222
+ drug is recovered unchanged in the urine. While bioavailability of 75 to 80% is
223
+ reported in humans, a low value (~30%)
224
+
225
+ was observed in pigs, calves, foals, and pigeons [26,52]. Maybe this last group
226
+ of animals contribute more sharply to the'
227
+ - 'from one to five compounds. The most frequently identified pharmaceuticals, in
228
+ decreasing order, were ciprofloxacin 43%
229
+
230
+ (3/7), enrofloxacin, norfloxacin, trimethoprim, lincomycin (29% (2/7), abacavir
231
+ and tetracycline 14% (1/7). The enzyme
232
+
233
+ inhibitors, namely clavulanic acid and cilastatin, were detected once in an urban
234
+ region located well. This catchment point
235
+
236
+ showed the most significant number of pharmaceuticals. West/Tejo and Centre were
237
+ the regions with the most
238
+
239
+ considerable number of substances in groundwater, accounting for 43%. All groundwater
240
+ samples were contaminated by'
241
+ pipeline_tag: sentence-similarity
242
+ library_name: sentence-transformers
243
+ metrics:
244
+ - cosine_accuracy
245
+ model-index:
246
+ - name: SentenceTransformer based on microsoft/mpnet-base
247
+ results:
248
+ - task:
249
+ type: triplet
250
+ name: Triplet
251
+ dataset:
252
+ name: initial test
253
+ type: initial_test
254
+ metrics:
255
+ - type: cosine_accuracy
256
+ value: 0.9599999785423279
257
+ name: Cosine Accuracy
258
+ - task:
259
+ type: triplet
260
+ name: Triplet
261
+ dataset:
262
+ name: final test
263
+ type: final_test
264
+ metrics:
265
+ - type: cosine_accuracy
266
+ value: 0.6800000071525574
267
+ name: Cosine Accuracy
268
+ - type: cosine_accuracy
269
+ value: 0.8999999761581421
270
+ name: Cosine Accuracy
271
+ - type: cosine_accuracy
272
+ value: 0.9200000166893005
273
+ name: Cosine Accuracy
274
+ - type: cosine_accuracy
275
+ value: 0.9399999976158142
276
+ name: Cosine Accuracy
277
+ - type: cosine_accuracy
278
+ value: 0.9599999785423279
279
+ name: Cosine Accuracy
280
+ - type: cosine_accuracy
281
+ value: 0.9599999785423279
282
+ name: Cosine Accuracy
283
+ - type: cosine_accuracy
284
+ value: 0.9599999785423279
285
+ name: Cosine Accuracy
286
+ - type: cosine_accuracy
287
+ value: 0.9599999785423279
288
+ name: Cosine Accuracy
289
+ - type: cosine_accuracy
290
+ value: 0.9599999785423279
291
+ name: Cosine Accuracy
292
+ - type: cosine_accuracy
293
+ value: 0.9800000190734863
294
+ name: Cosine Accuracy
295
+ - type: cosine_accuracy
296
+ value: 0.9800000190734863
297
+ name: Cosine Accuracy
298
+ ---
299
+
300
+ # SentenceTransformer based on microsoft/mpnet-base
301
+
302
+ This is a [sentence-transformers](https://www.SBERT.net) model finetuned from [microsoft/mpnet-base](https://huggingface.co/microsoft/mpnet-base) on the json dataset. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
303
+
304
+ ## Model Details
305
+
306
+ ### Model Description
307
+ - **Model Type:** Sentence Transformer
308
+ - **Base model:** [microsoft/mpnet-base](https://huggingface.co/microsoft/mpnet-base) <!-- at revision 6996ce1e91bd2a9c7d7f61daec37463394f73f09 -->
309
+ - **Maximum Sequence Length:** 512 tokens
310
+ - **Output Dimensionality:** 768 dimensions
311
+ - **Similarity Function:** Cosine Similarity
312
+ - **Training Dataset:**
313
+ - json
314
+ <!-- - **Language:** Unknown -->
315
+ <!-- - **License:** Unknown -->
316
+
317
+ ### Model Sources
318
+
319
+ - **Documentation:** [Sentence Transformers Documentation](https://sbert.net)
320
+ - **Repository:** [Sentence Transformers on GitHub](https://github.com/UKPLab/sentence-transformers)
321
+ - **Hugging Face:** [Sentence Transformers on Hugging Face](https://huggingface.co/models?library=sentence-transformers)
322
+
323
+ ### Full Model Architecture
324
+
325
+ ```
326
+ SentenceTransformer(
327
+ (0): Transformer({'max_seq_length': 512, 'do_lower_case': False, 'architecture': 'MPNetModel'})
328
+ (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
329
+ )
330
+ ```
331
+
332
+ ## Usage
333
+
334
+ ### Direct Usage (Sentence Transformers)
335
+
336
+ First install the Sentence Transformers library:
337
+
338
+ ```bash
339
+ pip install -U sentence-transformers
340
+ ```
341
+
342
+ Then you can load this model and run inference.
343
+ ```python
344
+ from sentence_transformers import SentenceTransformer
345
+
346
+ # Download from the 🤗 Hub
347
+ model = SentenceTransformer("sahithkumar7/mpnet-base-finetuned-iter01")
348
+ # Run inference
349
+ sentences = [
350
+ 'What was the most frequently identified pharmaceutical in the groundwater samples?',
351
+ 'from one to five compounds. The most frequently identified pharmaceuticals, in decreasing order, were ciprofloxacin 43%\n(3/7), enrofloxacin, norfloxacin, trimethoprim, lincomycin (29% (2/7), abacavir and tetracycline 14% (1/7). The enzyme\ninhibitors, namely clavulanic acid and cilastatin, were detected once in an urban region located well. This catchment point\nshowed the most significant number of pharmaceuticals. West/Tejo and Centre were the regions with the most\nconsiderable number of substances in groundwater, accounting for 43%. All groundwater samples were contaminated by',
352
+ 'Pharmacokinetic characteristics may represent key features in understanding antibiotics occurrence [62]. Most antibiotics\nare not completely metabolised in humans and animals; thus, a high percentage of the active substance (40-90%) is\nexcreted in urine/faeces in the unchanged form. These molecules are discharged into water and soil through wastewater,\nanimal manure, and sewage sludge, frequently used as fertilisers to agricultural lands. Also, it is expected that the\nhospital effluent will contribute partly to the pharmaceutical load in the wastewater treatment plant influence [63].',
353
+ ]
354
+ embeddings = model.encode(sentences)
355
+ print(embeddings.shape)
356
+ # [3, 768]
357
+
358
+ # Get the similarity scores for the embeddings
359
+ similarities = model.similarity(embeddings, embeddings)
360
+ print(similarities)
361
+ # tensor([[ 1.0000, 0.4988, -0.0391],
362
+ # [ 0.4988, 1.0000, 0.0047],
363
+ # [-0.0391, 0.0047, 1.0000]])
364
+ ```
365
+
366
+ <!--
367
+ ### Direct Usage (Transformers)
368
+
369
+ <details><summary>Click to see the direct usage in Transformers</summary>
370
+
371
+ </details>
372
+ -->
373
+
374
+ <!--
375
+ ### Downstream Usage (Sentence Transformers)
376
+
377
+ You can finetune this model on your own dataset.
378
+
379
+ <details><summary>Click to expand</summary>
380
+
381
+ </details>
382
+ -->
383
+
384
+ <!--
385
+ ### Out-of-Scope Use
386
+
387
+ *List how the model may foreseeably be misused and address what users ought not to do with the model.*
388
+ -->
389
+
390
+ ## Evaluation
391
+
392
+ ### Metrics
393
+
394
+ #### Triplet
395
+
396
+ * Datasets: `initial_test`, `final_test`, `final_test`, `final_test`, `final_test`, `final_test`, `final_test`, `final_test`, `final_test`, `final_test`, `final_test` and `final_test`
397
+ * Evaluated with [<code>TripletEvaluator</code>](https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#sentence_transformers.evaluation.TripletEvaluator)
398
+
399
+ | Metric | initial_test | final_test |
400
+ |:--------------------|:-------------|:-----------|
401
+ | **cosine_accuracy** | **0.96** | **0.98** |
402
+
403
+ <!--
404
+ ## Bias, Risks and Limitations
405
+
406
+ *What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.*
407
+ -->
408
+
409
+ <!--
410
+ ### Recommendations
411
+
412
+ *What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.*
413
+ -->
414
+
415
+ ## Training Details
416
+
417
+ ### Training Dataset
418
+
419
+ #### json
420
+
421
+ * Dataset: json
422
+ * Size: 80 training samples
423
+ * Columns: <code>anchor</code>, <code>positive</code>, and <code>negative</code>
424
+ * Approximate statistics based on the first 80 samples:
425
+ | | anchor | positive | negative |
426
+ |:--------|:----------------------------------------------------------------------------------|:-------------------------------------------------------------------------------------|:-------------------------------------------------------------------------------------|
427
+ | type | string | string | string |
428
+ | details | <ul><li>min: 9 tokens</li><li>mean: 16.14 tokens</li><li>max: 33 tokens</li></ul> | <ul><li>min: 48 tokens</li><li>mean: 125.65 tokens</li><li>max: 218 tokens</li></ul> | <ul><li>min: 48 tokens</li><li>mean: 122.97 tokens</li><li>max: 211 tokens</li></ul> |
429
+ * Samples:
430
+ | anchor | positive | negative |
431
+ |:-----------------------------------------------------------------------------------------------|:-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
432
+ | <code>Which two macrolide antibiotics are frequently detected in surface water samples?</code> | <code>seems to undertake a similar fate in the environment.<br>Nevertheless, due to stronger adsorption, with higher emergence in sediment, its occurrence in the surface water is lower<br>[71]. The use of tetracyclines, mainly as medicated premix and oral solution for food-producing animals [72], and the very<br>low bioavailability (e.g. in pig feed) [43] contribute to increasing its release into the environment. Regarding macrolides,<br>erythromycin and clarithromycin exhibit a remarkable frequency of detection in surface water samples. The most</code> | <code>Nonetheless, besides the sorption capacity, these antibiotics have high solubility in water. Crucial routes for these<br>substances into the environment are manure from animal production and sewage sludge from wastewater treatment<br>plant (WWTP) used as fertilisers. Therefore, these substances have been evidenced in topsoil samples [68]. These<br>quinolones and other antibiotics, for instance, norfloxacin and tetracycline, have been identified in groundwater samples<br>despite being influenced by sorption processes. They were not readily degraded; instead, the input into groundwater</code> |
433
+ | <code>What antimicrobial drugs were identified in the survey besides macrolides?</code> | <code>is one of the most frequently pharmaceutical in representative rivers [74,75]. The three macrolides identified in our<br>detection survey are included since 2018 in the first 'watch list' [76].<br>Another group of antimicrobial drugs identified in our survey were sulfamethoxazole/trimethoprim and sulfamethazine.<br>Sulfamethoxazole/trimethoprim are often used combined since the effectiveness of sulfonamides is enhanced. In the<br>present study, the detection of both substances was comparable; however, trimethoprim was detected in groundwater.</code> | <code>upstream samples obtained in rural locations was demonstrated and could be attributed to a low efficiency in the urban<br>wastewater treatment plants or due to agricultural pressure.<br>The higher frequency of detection for most substances was observed in the Ave river and Ria Formosa, confirming that<br>several effluents impact these water bodies from urban wastewater treatment plants and livestock production.<br>Pharmacokinetic characteristics may represent key features in understanding antibiotics occurrence [62]. Most antibiotics</code> |
434
+ | <code>How long was the observational period of the antibiotic survey in Portugal?</code> | <code>of antibiotics and their metabolites in surface- groundwater. It seeks to reflect the current demographic, spatial, drug<br>consumption, and drug profile on an observational period of 3 years in Portugal. The greatest challenge of this survey<br>data will be to promote the ecopharmacovigilance framework development shortly to implement measures for avoiding<br>misuse/overuse of antibiotics and slow down emission and antibiotic resistance.<br>2. Results<br>2.1. Frequency of Detections:<br>Antibiotics/Enzyme-Inhibitors and Abacavir<br>in Surface-Groundwater</code> | <code>despite being influenced by sorption processes. They were not readily degraded; instead, the input into groundwater<br>could be due to livestock farming pressure, namely by spreading manure in the soil or the possible sewage sludge<br>application in the area. High clay and low sand content in soils can decrease the mobility of pharmaceuticals, which is<br>attributed to clay intense exchange capacity. Thus, soil properties (e.g. particle composition) are a significant, influential</code> |
435
+ * Loss: [<code>MultipleNegativesRankingLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#multiplenegativesrankingloss) with these parameters:
436
+ ```json
437
+ {
438
+ "scale": 20.0,
439
+ "similarity_fct": "cos_sim"
440
+ }
441
+ ```
442
+
443
+ ### Evaluation Dataset
444
+
445
+ #### json
446
+
447
+ * Dataset: json
448
+ * Size: 20 evaluation samples
449
+ * Columns: <code>anchor</code>, <code>positive</code>, and <code>negative</code>
450
+ * Approximate statistics based on the first 20 samples:
451
+ | | anchor | positive | negative |
452
+ |:--------|:----------------------------------------------------------------------------------|:-------------------------------------------------------------------------------------|:------------------------------------------------------------------------------------|
453
+ | type | string | string | string |
454
+ | details | <ul><li>min: 11 tokens</li><li>mean: 16.4 tokens</li><li>max: 25 tokens</li></ul> | <ul><li>min: 76 tokens</li><li>mean: 113.65 tokens</li><li>max: 148 tokens</li></ul> | <ul><li>min: 89 tokens</li><li>mean: 118.8 tokens</li><li>max: 162 tokens</li></ul> |
455
+ * Samples:
456
+ | anchor | positive | negative |
457
+ |:-----------------------------------------------------------------------------------------------------------------|:-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
458
+ | <code>What percentage of unchanged excretion did the most significant number of detected substances show?</code> | <code>coefficients were not available for lincomycin, clavulanic acid and cilastatin.<br>Physicochemical properties of detected pharmaceuticals.<br>1 Data retrieved from [16]; 2 Data retrieved from [17]; 3 Data retrieved from [18]; 4 Data retrieved from [19]; 5<br>Data retrieved from [20];<br>6 Data retrieved from [21]; 7 Data retrieved from [22]; 8 Data retrieved from [23]; 9 Data retrieved from [24]; 10<br>Data retrieved from [25];<br>NA-not available.<br>The most significant number of detected substances showed a percentage of unchanged excretion higher than 40%.</code> | <code>1. Introduction<br>Antibiotics are a critical component of human and veterinary modern medicine, developed to produce desirable or<br>beneficial effects on infections induced by pathogens. Like most pharmaceuticals, antibiotics tend to be small organic<br>polar compounds, generally ionisable, ordinarily subject to a metabolism or biotransformation process by the organism to<br>be eliminated more efficiently [1,2]. The excretion of these compounds and their metabolites occurs mainly through urine,</code> |
459
+ | <code>How many kilograms of abacavir were detected in Portugal in 2017?</code> | <code>Regarding the different regions, it has been concluded that North and West/Tejo were the regions with the higher<br>consuming values. Both regions presented a significant value (33%) for the abacavir. For the detected antiviral abacavir,<br>an amount of 1458 kg has been observed.<br>Regarding antibiotics used in veterinary medicine, the regional amount was not available. Likewise, due to the reported<br>missing quantity for sulfamethazine, the sulfonamides group has been matched.<br>Consumption (Kg) of the detected pharmaceuticals in Portugal (2017).</code> | <code>43%<br>(3/7), enrofloxacin, norfloxacin, trimethoprim, lincomycin (29% (2/7), abacavir and tetracycline<br>14% (1/7). The enzyme inhibitors, namely clavulanic acid and cilastatin, were detected once in an urban region located<br>well. This catchment point showed the most significant<br>number of pharmaceuticals. West/Tejo and Centre were the regions with the most considerable number of substances in<br>groundwater, accounting for 43%. All groundwater<br>samples were contaminated by at least one antibiotic. Supplemental Tables S2 and S4 contain a detailed description of<br>the</code> |
460
+ | <code>What must marketing authorisation procedures for medicines include since 2006?</code> | <code>substances in passive samplers [7]. Since 2006, marketing authorisation procedures for both human and veterinary<br>medicines must include an environmental risk assessment that comprises a prospective exposure assessment,<br>underestimating the possible impact and the occurrence of antibiotics after years of consumption. Ultimately, the potential<br>risk may not be correctly anticipated. It becomes urgent to generate new data, mainly to refine exposure assessments.<br>As much as the specificities of each member state should be considered this issue has become one of the European</code> | <code>clarithromycin/erythromycin, tetracycline, sulfamethoxazole, and abacavir. In groundwater, enrofloxacin/ciprofloxacin,<br>norfloxacin, trimethoprim, lincomycin, abacavir and tetracycline were recovered. Metabolites were not detected in water<br>bodies. Noticeable was the detection of enzyme inhibitors, tazobactam and cilastatin, which are both for exclusive<br>hospital use. The North region and Algarve (South) were the areas with the most significant frequency of substances in<br>surface water. The relatively higher detection of substances downstream of the effluent discharge points compared with a</code> |
461
+ * Loss: [<code>MultipleNegativesRankingLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#multiplenegativesrankingloss) with these parameters:
462
+ ```json
463
+ {
464
+ "scale": 20.0,
465
+ "similarity_fct": "cos_sim"
466
+ }
467
+ ```
468
+
469
+ ### Training Hyperparameters
470
+ #### Non-Default Hyperparameters
471
+
472
+ - `eval_strategy`: steps
473
+ - `per_device_train_batch_size`: 16
474
+ - `per_device_eval_batch_size`: 16
475
+ - `num_train_epochs`: 1
476
+ - `warmup_ratio`: 0.1
477
+ - `fp16`: True
478
+ - `batch_sampler`: no_duplicates
479
+
480
+ #### All Hyperparameters
481
+ <details><summary>Click to expand</summary>
482
+
483
+ - `overwrite_output_dir`: False
484
+ - `do_predict`: False
485
+ - `eval_strategy`: steps
486
+ - `prediction_loss_only`: True
487
+ - `per_device_train_batch_size`: 16
488
+ - `per_device_eval_batch_size`: 16
489
+ - `per_gpu_train_batch_size`: None
490
+ - `per_gpu_eval_batch_size`: None
491
+ - `gradient_accumulation_steps`: 1
492
+ - `eval_accumulation_steps`: None
493
+ - `torch_empty_cache_steps`: None
494
+ - `learning_rate`: 5e-05
495
+ - `weight_decay`: 0.0
496
+ - `adam_beta1`: 0.9
497
+ - `adam_beta2`: 0.999
498
+ - `adam_epsilon`: 1e-08
499
+ - `max_grad_norm`: 1.0
500
+ - `num_train_epochs`: 1
501
+ - `max_steps`: -1
502
+ - `lr_scheduler_type`: linear
503
+ - `lr_scheduler_kwargs`: {}
504
+ - `warmup_ratio`: 0.1
505
+ - `warmup_steps`: 0
506
+ - `log_level`: passive
507
+ - `log_level_replica`: warning
508
+ - `log_on_each_node`: True
509
+ - `logging_nan_inf_filter`: True
510
+ - `save_safetensors`: True
511
+ - `save_on_each_node`: False
512
+ - `save_only_model`: False
513
+ - `restore_callback_states_from_checkpoint`: False
514
+ - `no_cuda`: False
515
+ - `use_cpu`: False
516
+ - `use_mps_device`: False
517
+ - `seed`: 42
518
+ - `data_seed`: None
519
+ - `jit_mode_eval`: False
520
+ - `use_ipex`: False
521
+ - `bf16`: False
522
+ - `fp16`: True
523
+ - `fp16_opt_level`: O1
524
+ - `half_precision_backend`: auto
525
+ - `bf16_full_eval`: False
526
+ - `fp16_full_eval`: False
527
+ - `tf32`: None
528
+ - `local_rank`: 0
529
+ - `ddp_backend`: None
530
+ - `tpu_num_cores`: None
531
+ - `tpu_metrics_debug`: False
532
+ - `debug`: []
533
+ - `dataloader_drop_last`: False
534
+ - `dataloader_num_workers`: 0
535
+ - `dataloader_prefetch_factor`: None
536
+ - `past_index`: -1
537
+ - `disable_tqdm`: False
538
+ - `remove_unused_columns`: True
539
+ - `label_names`: None
540
+ - `load_best_model_at_end`: False
541
+ - `ignore_data_skip`: False
542
+ - `fsdp`: []
543
+ - `fsdp_min_num_params`: 0
544
+ - `fsdp_config`: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
545
+ - `fsdp_transformer_layer_cls_to_wrap`: None
546
+ - `accelerator_config`: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
547
+ - `deepspeed`: None
548
+ - `label_smoothing_factor`: 0.0
549
+ - `optim`: adamw_torch
550
+ - `optim_args`: None
551
+ - `adafactor`: False
552
+ - `group_by_length`: False
553
+ - `length_column_name`: length
554
+ - `ddp_find_unused_parameters`: None
555
+ - `ddp_bucket_cap_mb`: None
556
+ - `ddp_broadcast_buffers`: False
557
+ - `dataloader_pin_memory`: True
558
+ - `dataloader_persistent_workers`: False
559
+ - `skip_memory_metrics`: True
560
+ - `use_legacy_prediction_loop`: False
561
+ - `push_to_hub`: False
562
+ - `resume_from_checkpoint`: None
563
+ - `hub_model_id`: None
564
+ - `hub_strategy`: every_save
565
+ - `hub_private_repo`: None
566
+ - `hub_always_push`: False
567
+ - `gradient_checkpointing`: False
568
+ - `gradient_checkpointing_kwargs`: None
569
+ - `include_inputs_for_metrics`: False
570
+ - `include_for_metrics`: []
571
+ - `eval_do_concat_batches`: True
572
+ - `fp16_backend`: auto
573
+ - `push_to_hub_model_id`: None
574
+ - `push_to_hub_organization`: None
575
+ - `mp_parameters`:
576
+ - `auto_find_batch_size`: False
577
+ - `full_determinism`: False
578
+ - `torchdynamo`: None
579
+ - `ray_scope`: last
580
+ - `ddp_timeout`: 1800
581
+ - `torch_compile`: False
582
+ - `torch_compile_backend`: None
583
+ - `torch_compile_mode`: None
584
+ - `include_tokens_per_second`: False
585
+ - `include_num_input_tokens_seen`: False
586
+ - `neftune_noise_alpha`: None
587
+ - `optim_target_modules`: None
588
+ - `batch_eval_metrics`: False
589
+ - `eval_on_start`: False
590
+ - `use_liger_kernel`: False
591
+ - `eval_use_gather_object`: False
592
+ - `average_tokens_across_devices`: False
593
+ - `prompts`: None
594
+ - `batch_sampler`: no_duplicates
595
+ - `multi_dataset_batch_sampler`: proportional
596
+ - `router_mapping`: {}
597
+ - `learning_rate_mapping`: {}
598
+
599
+ </details>
600
+
601
+ ### Training Logs
602
+ <details><summary>Click to expand</summary>
603
+
604
+ | Epoch | Step | Training Loss | Validation Loss | initial_test_cosine_accuracy | final_test_cosine_accuracy |
605
+ |:-----:|:----:|:-------------:|:---------------:|:----------------------------:|:--------------------------:|
606
+ | -1 | -1 | - | - | 0.7800 | - |
607
+ | 0.2 | 1 | 3.3315 | - | - | - |
608
+ | 0.4 | 2 | 3.0922 | - | - | - |
609
+ | 0.6 | 3 | 3.2635 | - | - | - |
610
+ | 0.8 | 4 | 3.0702 | - | - | - |
611
+ | 1.0 | 5 | 3.3282 | - | - | - |
612
+ | -1 | -1 | - | - | - | 0.6800 |
613
+ | 0.2 | 1 | 2.9487 | - | - | - |
614
+ | 0.4 | 2 | 2.9845 | - | - | - |
615
+ | 0.6 | 3 | 2.935 | - | - | - |
616
+ | 0.8 | 4 | 3.0702 | - | - | - |
617
+ | 1.0 | 5 | 3.0039 | - | - | - |
618
+ | 1.2 | 6 | 2.4806 | - | - | - |
619
+ | 1.4 | 7 | 2.2646 | - | - | - |
620
+ | 1.6 | 8 | 1.8101 | - | - | - |
621
+ | 1.8 | 9 | 1.3463 | - | - | - |
622
+ | 2.0 | 10 | 1.942 | - | - | - |
623
+ | -1 | -1 | - | - | - | 0.9000 |
624
+ | 0.2 | 1 | 2.2356 | - | - | - |
625
+ | 0.4 | 2 | 1.0123 | - | - | - |
626
+ | 0.6 | 3 | 1.2411 | - | - | - |
627
+ | 0.8 | 4 | 0.9194 | - | - | - |
628
+ | 1.0 | 5 | 0.891 | - | - | - |
629
+ | 1.2 | 6 | 0.602 | - | - | - |
630
+ | 1.4 | 7 | 0.5426 | - | - | - |
631
+ | 1.6 | 8 | 0.5738 | - | - | - |
632
+ | 1.8 | 9 | 0.2678 | - | - | - |
633
+ | 2.0 | 10 | 0.7113 | - | - | - |
634
+ | 2.2 | 11 | 0.2911 | - | - | - |
635
+ | 2.4 | 12 | 0.4745 | - | - | - |
636
+ | 2.6 | 13 | 0.4188 | - | - | - |
637
+ | 2.8 | 14 | 0.3708 | - | - | - |
638
+ | 3.0 | 15 | 0.2882 | - | - | - |
639
+ | -1 | -1 | - | - | - | 0.9200 |
640
+ | 0.2 | 1 | 0.5156 | - | - | - |
641
+ | 0.4 | 2 | 0.0749 | - | - | - |
642
+ | 0.6 | 3 | 0.0634 | - | - | - |
643
+ | 0.8 | 4 | 0.0534 | - | - | - |
644
+ | 1.0 | 5 | 0.019 | - | - | - |
645
+ | 1.2 | 6 | 0.0682 | - | - | - |
646
+ | 1.4 | 7 | 0.0381 | - | - | - |
647
+ | 1.6 | 8 | 0.171 | - | - | - |
648
+ | 1.8 | 9 | 0.1188 | - | - | - |
649
+ | 2.0 | 10 | 0.1861 | - | - | - |
650
+ | 2.2 | 11 | 0.0895 | - | - | - |
651
+ | 2.4 | 12 | 0.2492 | - | - | - |
652
+ | 2.6 | 13 | 0.0964 | - | - | - |
653
+ | 2.8 | 14 | 0.2424 | - | - | - |
654
+ | 3.0 | 15 | 0.1096 | - | - | - |
655
+ | 3.2 | 16 | 0.1981 | - | - | - |
656
+ | 3.4 | 17 | 0.1438 | - | - | - |
657
+ | 3.6 | 18 | 0.3454 | - | - | - |
658
+ | 3.8 | 19 | 0.4011 | - | - | - |
659
+ | 4.0 | 20 | 0.1591 | 0.5567 | 0.9400 | - |
660
+ | -1 | -1 | - | - | - | 0.9400 |
661
+ | 0.125 | 1 | 0.0594 | - | - | - |
662
+ | 0.25 | 2 | 0.0584 | - | - | - |
663
+ | 0.375 | 3 | 0.0146 | - | - | - |
664
+ | 0.5 | 4 | 0.0542 | - | - | - |
665
+ | 0.625 | 5 | 0.0965 | - | - | - |
666
+ | 0.75 | 6 | 0.2209 | - | - | - |
667
+ | 0.875 | 7 | 0.0312 | - | - | - |
668
+ | 1.0 | 8 | 0.1142 | - | - | - |
669
+ | -1 | -1 | - | - | - | 0.9600 |
670
+ | 0.125 | 1 | 0.0082 | - | - | - |
671
+ | 0.25 | 2 | 0.004 | - | - | - |
672
+ | 0.375 | 3 | 0.001 | - | - | - |
673
+ | 0.5 | 4 | 0.0118 | - | - | - |
674
+ | 0.625 | 5 | 0.0508 | - | - | - |
675
+ | 0.75 | 6 | 0.0816 | - | - | - |
676
+ | 0.875 | 7 | 0.0149 | - | - | - |
677
+ | 1.0 | 8 | 0.0163 | - | - | - |
678
+ | 1.125 | 9 | 0.038 | - | - | - |
679
+ | 1.25 | 10 | 0.0618 | - | - | - |
680
+ | 1.375 | 11 | 0.0097 | - | - | - |
681
+ | 1.5 | 12 | 0.0368 | - | - | - |
682
+ | 1.625 | 13 | 0.0212 | - | - | - |
683
+ | 1.75 | 14 | 0.0072 | - | - | - |
684
+ | 1.875 | 15 | 0.0037 | - | - | - |
685
+ | 2.0 | 16 | 0.128 | - | - | - |
686
+ | -1 | -1 | - | - | - | 0.9600 |
687
+ | 0.125 | 1 | 0.0012 | - | - | - |
688
+ | 0.25 | 2 | 0.0003 | - | - | - |
689
+ | 0.375 | 3 | 0.0008 | - | - | - |
690
+ | 0.5 | 4 | 0.0008 | - | - | - |
691
+ | 0.625 | 5 | 0.0013 | - | - | - |
692
+ | 0.75 | 6 | 0.0743 | - | - | - |
693
+ | 0.875 | 7 | 0.0024 | - | - | - |
694
+ | 1.0 | 8 | 0.001 | - | - | - |
695
+ | 1.125 | 9 | 0.0024 | - | - | - |
696
+ | 1.25 | 10 | 0.01 | - | - | - |
697
+ | 1.375 | 11 | 0.0009 | - | - | - |
698
+ | 1.5 | 12 | 0.1912 | - | - | - |
699
+ | 1.625 | 13 | 0.0024 | - | - | - |
700
+ | 1.75 | 14 | 0.002 | - | - | - |
701
+ | 1.875 | 15 | 0.0038 | - | - | - |
702
+ | 2.0 | 16 | 0.1492 | - | - | - |
703
+ | 2.125 | 17 | 0.004 | - | - | - |
704
+ | 2.25 | 18 | 0.0123 | - | - | - |
705
+ | 2.375 | 19 | 0.0348 | - | - | - |
706
+ | 2.5 | 20 | 0.0068 | 0.5351 | 0.9600 | - |
707
+ | 2.625 | 21 | 0.1679 | - | - | - |
708
+ | 2.75 | 22 | 0.0123 | - | - | - |
709
+ | 2.875 | 23 | 0.0934 | - | - | - |
710
+ | 3.0 | 24 | 0.0048 | - | - | - |
711
+ | -1 | -1 | - | - | - | 0.9600 |
712
+ | 0.2 | 1 | 0.0763 | - | - | - |
713
+ | 0.4 | 2 | 0.0119 | - | - | - |
714
+ | 0.6 | 3 | 0.0019 | - | - | - |
715
+ | 0.8 | 4 | 0.0034 | - | - | - |
716
+ | 1.0 | 5 | 0.001 | - | - | - |
717
+ | -1 | -1 | - | - | - | 0.9800 |
718
+
719
+ </details>
720
+
721
+ ### Framework Versions
722
+ - Python: 3.11.13
723
+ - Sentence Transformers: 5.0.0
724
+ - Transformers: 4.52.4
725
+ - PyTorch: 2.6.0+cu124
726
+ - Accelerate: 1.8.1
727
+ - Datasets: 3.6.0
728
+ - Tokenizers: 0.21.2
729
+
730
+ ## Citation
731
+
732
+ ### BibTeX
733
+
734
+ #### Sentence Transformers
735
+ ```bibtex
736
+ @inproceedings{reimers-2019-sentence-bert,
737
+ title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
738
+ author = "Reimers, Nils and Gurevych, Iryna",
739
+ booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
740
+ month = "11",
741
+ year = "2019",
742
+ publisher = "Association for Computational Linguistics",
743
+ url = "https://arxiv.org/abs/1908.10084",
744
+ }
745
+ ```
746
+
747
+ #### MultipleNegativesRankingLoss
748
+ ```bibtex
749
+ @misc{henderson2017efficient,
750
+ title={Efficient Natural Language Response Suggestion for Smart Reply},
751
+ author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
752
+ year={2017},
753
+ eprint={1705.00652},
754
+ archivePrefix={arXiv},
755
+ primaryClass={cs.CL}
756
+ }
757
+ ```
758
+
759
+ <!--
760
+ ## Glossary
761
+
762
+ *Clearly define terms in order to be accessible across audiences.*
763
+ -->
764
+
765
+ <!--
766
+ ## Model Card Authors
767
+
768
+ *Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.*
769
+ -->
770
+
771
+ <!--
772
+ ## Model Card Contact
773
+
774
+ *Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.*
775
+ -->
config.json ADDED
@@ -0,0 +1,23 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "MPNetModel"
4
+ ],
5
+ "attention_probs_dropout_prob": 0.1,
6
+ "bos_token_id": 0,
7
+ "eos_token_id": 2,
8
+ "hidden_act": "gelu",
9
+ "hidden_dropout_prob": 0.1,
10
+ "hidden_size": 768,
11
+ "initializer_range": 0.02,
12
+ "intermediate_size": 3072,
13
+ "layer_norm_eps": 1e-05,
14
+ "max_position_embeddings": 514,
15
+ "model_type": "mpnet",
16
+ "num_attention_heads": 12,
17
+ "num_hidden_layers": 12,
18
+ "pad_token_id": 1,
19
+ "relative_attention_num_buckets": 32,
20
+ "torch_dtype": "float32",
21
+ "transformers_version": "4.52.4",
22
+ "vocab_size": 30527
23
+ }
config_sentence_transformers.json ADDED
@@ -0,0 +1,14 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "model_type": "SentenceTransformer",
3
+ "__version__": {
4
+ "sentence_transformers": "5.0.0",
5
+ "transformers": "4.52.4",
6
+ "pytorch": "2.6.0+cu124"
7
+ },
8
+ "prompts": {
9
+ "query": "",
10
+ "document": ""
11
+ },
12
+ "default_prompt_name": null,
13
+ "similarity_fn_name": "cosine"
14
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:6a3fe86dee3b0d3265329a58a8bd973acb1baf97e46abdf20fb4da9b39ecc8b2
3
+ size 437967672
modules.json ADDED
@@ -0,0 +1,14 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [
2
+ {
3
+ "idx": 0,
4
+ "name": "0",
5
+ "path": "",
6
+ "type": "sentence_transformers.models.Transformer"
7
+ },
8
+ {
9
+ "idx": 1,
10
+ "name": "1",
11
+ "path": "1_Pooling",
12
+ "type": "sentence_transformers.models.Pooling"
13
+ }
14
+ ]
sentence_bert_config.json ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ {
2
+ "max_seq_length": 512,
3
+ "do_lower_case": false
4
+ }
special_tokens_map.json ADDED
@@ -0,0 +1,51 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": {
3
+ "content": "<s>",
4
+ "lstrip": false,
5
+ "normalized": false,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "cls_token": {
10
+ "content": "<s>",
11
+ "lstrip": false,
12
+ "normalized": true,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "eos_token": {
17
+ "content": "</s>",
18
+ "lstrip": false,
19
+ "normalized": false,
20
+ "rstrip": false,
21
+ "single_word": false
22
+ },
23
+ "mask_token": {
24
+ "content": "<mask>",
25
+ "lstrip": true,
26
+ "normalized": false,
27
+ "rstrip": false,
28
+ "single_word": false
29
+ },
30
+ "pad_token": {
31
+ "content": "<pad>",
32
+ "lstrip": false,
33
+ "normalized": false,
34
+ "rstrip": false,
35
+ "single_word": false
36
+ },
37
+ "sep_token": {
38
+ "content": "</s>",
39
+ "lstrip": false,
40
+ "normalized": true,
41
+ "rstrip": false,
42
+ "single_word": false
43
+ },
44
+ "unk_token": {
45
+ "content": "[UNK]",
46
+ "lstrip": false,
47
+ "normalized": false,
48
+ "rstrip": false,
49
+ "single_word": false
50
+ }
51
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,66 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "0": {
4
+ "content": "<s>",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false,
9
+ "special": true
10
+ },
11
+ "1": {
12
+ "content": "<pad>",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false,
17
+ "special": true
18
+ },
19
+ "2": {
20
+ "content": "</s>",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false,
25
+ "special": true
26
+ },
27
+ "3": {
28
+ "content": "<unk>",
29
+ "lstrip": false,
30
+ "normalized": true,
31
+ "rstrip": false,
32
+ "single_word": false,
33
+ "special": true
34
+ },
35
+ "104": {
36
+ "content": "[UNK]",
37
+ "lstrip": false,
38
+ "normalized": false,
39
+ "rstrip": false,
40
+ "single_word": false,
41
+ "special": true
42
+ },
43
+ "30526": {
44
+ "content": "<mask>",
45
+ "lstrip": true,
46
+ "normalized": false,
47
+ "rstrip": false,
48
+ "single_word": false,
49
+ "special": true
50
+ }
51
+ },
52
+ "bos_token": "<s>",
53
+ "clean_up_tokenization_spaces": false,
54
+ "cls_token": "<s>",
55
+ "do_lower_case": true,
56
+ "eos_token": "</s>",
57
+ "extra_special_tokens": {},
58
+ "mask_token": "<mask>",
59
+ "model_max_length": 512,
60
+ "pad_token": "<pad>",
61
+ "sep_token": "</s>",
62
+ "strip_accents": null,
63
+ "tokenize_chinese_chars": true,
64
+ "tokenizer_class": "MPNetTokenizer",
65
+ "unk_token": "[UNK]"
66
+ }
vocab.txt ADDED
The diff for this file is too large to render. See raw diff