Update README.md
Browse files
README.md
CHANGED
@@ -14,7 +14,7 @@ license: apache-2.0
|
|
14 |
|
15 |
## 简介 Brief Introduction
|
16 |
|
17 |
-
基于simcse无监督版本,用搜集整理的中文
|
18 |
|
19 |
**Erlangshen-SimCSE-110M-Chinese** is based on the unsupervised version of simcse, And training simcse supervised task with collected and sorted chinese NLI data for. It has good effect on the task in Chinese sentences pair.
|
20 |
|
@@ -22,7 +22,7 @@ license: apache-2.0
|
|
22 |
|
23 |
| 需求 Demand | 任务 Task | 系列 Series | 模型 Model | 参数 Parameter | 额外 Extra |
|
24 |
| :----: | :----: | :----: | :----: | :----: | :----: |
|
25 |
-
| 通用 General | 自然语言生成 NLU |
|
26 |
|
27 |
## 模型信息 Model Information
|
28 |
|
@@ -45,21 +45,30 @@ In order to obtain a general sentence-embedding-model, we use a large number of
|
|
45 |
### 加载模型 Loading Models
|
46 |
|
47 |
```python
|
48 |
-
from transformers import
|
49 |
-
|
50 |
-
|
51 |
-
text = "Replace me by any text you'd like."
|
52 |
-
encoded_input = tokenizer(text, return_tensors='pt')
|
53 |
-
output = model(**encoded_input)
|
54 |
```
|
55 |
|
56 |
### 使用示例 Usage Examples
|
57 |
|
58 |
```python
|
59 |
-
from
|
60 |
-
|
61 |
-
|
62 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
63 |
```
|
64 |
|
65 |
## 引用 Citation
|
|
|
14 |
|
15 |
## 简介 Brief Introduction
|
16 |
|
17 |
+
基于simcse无监督版本,用搜集整理的中文NLI数据进行simcse有监督任务的训练。在中文句子对任务上有良好的效果。
|
18 |
|
19 |
**Erlangshen-SimCSE-110M-Chinese** is based on the unsupervised version of simcse, And training simcse supervised task with collected and sorted chinese NLI data for. It has good effect on the task in Chinese sentences pair.
|
20 |
|
|
|
22 |
|
23 |
| 需求 Demand | 任务 Task | 系列 Series | 模型 Model | 参数 Parameter | 额外 Extra |
|
24 |
| :----: | :----: | :----: | :----: | :----: | :----: |
|
25 |
+
| 通用 General | 自然语言生成 NLU | 二郎神 Erlangshen | Bert | 110M | 中文 Chinese |
|
26 |
|
27 |
## 模型信息 Model Information
|
28 |
|
|
|
45 |
### 加载模型 Loading Models
|
46 |
|
47 |
```python
|
48 |
+
from transformers import AutoTokenizer,AutoModelForMaskedLM
|
49 |
+
model =AutoModelForMaskedLM.from_pretrained('IDEA-CCNL/Erlangshen-SimCSE-110M-Chinese')
|
50 |
+
tokenizer = AutoTokenizer.from_pretrained('IDEA-CCNL/Erlangshen-SimCSE-110M-Chinese')
|
|
|
|
|
|
|
51 |
```
|
52 |
|
53 |
### 使用示例 Usage Examples
|
54 |
|
55 |
```python
|
56 |
+
from sklearn.metrics.pairwise import cosine_similarity
|
57 |
+
|
58 |
+
texta = '今天天气真不错,我们去散步吧!'
|
59 |
+
textb = '今天天气真糟糕,还是在宅家里写bug吧!'
|
60 |
+
inputs_a = tokenizer(texta,return_tensors="pt")
|
61 |
+
inputs_b = tokenizer(textb,return_tensors="pt")
|
62 |
+
|
63 |
+
outputs_a = model(**inputs_a ,output_hidden_states=True)
|
64 |
+
texta_embedding = outputs_a.hidden_states[-1][:,0,:].squeeze()
|
65 |
+
|
66 |
+
outputs_b = model(**inputs_b ,output_hidden_states=True)
|
67 |
+
textb_embedding = outputs_b.hidden_states[-1][:,0,:].squeeze()
|
68 |
+
|
69 |
+
# if you use cuda, the text_embedding should be textb_embedding.cpu().numpy()
|
70 |
+
silimarity_soce = cosine_similarity(texta_embedding.reshape(1,-1),textb_embedding .reshape(1,-1))[0][0]
|
71 |
+
|
72 |
```
|
73 |
|
74 |
## 引用 Citation
|