utrobinmv commited on
Commit
6033de2
·
1 Parent(s): d95876b

add readme

Browse files
Files changed (1) hide show
  1. README.md +39 -9
README.md CHANGED
@@ -70,6 +70,8 @@ You can conditionally limit the output to a given N number of words, just add th
70
  2) "summary brief 4 words: " - to generate a shortened summary content in the source language
71
  3) "summary big 100 words: " - to generate elongated summary content in the source language
72
 
 
 
73
  The model can understand text in any language from the list: Russian, Chinese or English. It can also translate the result into any language from the list: Russian, Chinese or English.
74
 
75
  For translation into the target language, the target language identifier is specified as a prefix "... to <lang>:". Where lang can take the values: ru, en, zh. The source language may not be specified, in addition, the source text may be multilingual.
@@ -88,6 +90,10 @@ task prefix:
88
 
89
  A training model for compressing a context of 2048 tokens and outputs a summary of up to 200 tokens in big task, 50 tokens in summary, and 20 tokens in brief task.
90
 
 
 
 
 
91
 
92
 
93
  Example resume for English:
@@ -101,6 +107,14 @@ model_name = 'utrobinmv/t5_summary_en_ru_zh_large_2048'
101
  model = T5ForConditionalGeneration.from_pretrained(model_name)
102
  model.eval()
103
  model.to(device)
 
 
 
 
 
 
 
 
104
  tokenizer = T5Tokenizer.from_pretrained(model_name)
105
 
106
  text = """Videos that say approved vaccines are dangerous and cause autism, cancer or infertility are among those that will be taken down, the company said. The policy includes the termination of accounts of anti-vaccine influencers. Tech giants have been criticised for not doing more to counter false health information on their sites. In July, US President Joe Biden said social media platforms were largely responsible for people's scepticism in getting vaccinated by spreading misinformation, and appealed for them to address the issue. YouTube, which is owned by Google, said 130,000 videos were removed from its platform since last year, when it implemented a ban on content spreading misinformation about Covid vaccines. In a blog post, the company said it had seen false claims about Covid jabs "spill over into misinformation about vaccines in general". The new policy covers long-approved vaccines, such as those against measles or hepatitis B. "We're expanding our medical misinformation policies on YouTube with new guidelines on currently administered vaccines that are approved and confirmed to be safe and effective by local health authorities and the WHO," the post said, referring to the World Health Organization."""
@@ -110,7 +124,7 @@ prefix = 'summary: '
110
  src_text = prefix + text
111
  input_ids = tokenizer(src_text, return_tensors="pt")
112
 
113
- generated_tokens = model.generate(**input_ids.to(device))
114
 
115
  result = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
116
  print(result)
@@ -121,7 +135,7 @@ prefix = 'summary brief: '
121
  src_text = prefix + text
122
  input_ids = tokenizer(src_text, return_tensors="pt")
123
 
124
- generated_tokens = model.generate(**input_ids.to(device))
125
 
126
  result = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
127
  print(result)
@@ -132,7 +146,7 @@ prefix = 'summary big: '
132
  src_text = prefix + text
133
  input_ids = tokenizer(src_text, return_tensors="pt")
134
 
135
- generated_tokens = model.generate(**input_ids.to(device))
136
 
137
  result = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
138
  print(result)
@@ -152,6 +166,14 @@ model_name = 'utrobinmv/t5_summary_en_ru_zh_large_2048'
152
  model = T5ForConditionalGeneration.from_pretrained(model_name)
153
  model.eval()
154
  model.to(device)
 
 
 
 
 
 
 
 
155
  tokenizer = T5Tokenizer.from_pretrained(model_name)
156
 
157
  text = """在北京冬奥会自由式滑雪女子坡面障碍技巧决赛中,中国选手谷爱凌夺得银牌。祝贺谷爱凌!今天上午,自由式滑雪女子坡面障碍技巧决赛举行。决赛分三轮进行,取选手最佳成绩排名决出奖牌。第一跳,中国选手谷爱凌获得69.90分。在12位选手中排名第三。完成动作后,谷爱凌又扮了个鬼脸,甚是可爱。第二轮中,谷爱凌在道具区第三个障碍处失误,落地时摔倒。获得16.98分。网友:摔倒了也没关系,继续加油!在第二跳失误摔倒的情况下,谷爱凌顶住压力,第三跳稳稳发挥,流畅落地!获得86.23分!此轮比赛,共12位选手��赛,谷爱凌第10位出场。网友:看比赛时我比谷爱凌紧张,加油!"""
@@ -161,7 +183,7 @@ prefix = 'summary to en: '
161
  src_text = prefix + text
162
  input_ids = tokenizer(src_text, return_tensors="pt")
163
 
164
- generated_tokens = model.generate(**input_ids.to(device))
165
 
166
  result = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
167
  print(result)
@@ -172,7 +194,7 @@ prefix = 'summary brief to en: '
172
  src_text = prefix + text
173
  input_ids = tokenizer(src_text, return_tensors="pt")
174
 
175
- generated_tokens = model.generate(**input_ids.to(device))
176
 
177
  result = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
178
  print(result)
@@ -183,7 +205,7 @@ prefix = 'summary big to en: '
183
  src_text = prefix + text
184
  input_ids = tokenizer(src_text, return_tensors="pt")
185
 
186
- generated_tokens = model.generate(**input_ids.to(device))
187
 
188
  result = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
189
  print(result)
@@ -203,6 +225,14 @@ model_name = 'utrobinmv/t5_summary_en_ru_zh_large_2048'
203
  model = T5ForConditionalGeneration.from_pretrained(model_name)
204
  model.eval()
205
  model.to(device)
 
 
 
 
 
 
 
 
206
  tokenizer = T5Tokenizer.from_pretrained(model_name)
207
 
208
  text = """Высота башни составляет 324 метра (1063 фута), примерно такая же высота, как у 81-этажного здания, и самое высокое сооружение в Париже. Его основание квадратно, размером 125 метров (410 футов) с любой стороны. Во время строительства Эйфелева башня превзошла монумент Вашингтона, став самым высоким искусственным сооружением в мире, и этот титул она удерживала в течение 41 года до завершения строительство здания Крайслер в Нью-Йорке в 1930 году. Это первое сооружение которое достигло высоты 300 метров. Из-за добавления вещательной антенны на вершине башни в 1957 году она сейчас выше здания Крайслер на 5,2 метра (17 футов). За исключением передатчиков, Эйфелева башня является второй самой высокой отдельно стоящей структурой во Франции после виадука Мийо."""
@@ -212,7 +242,7 @@ prefix = 'summary: '
212
  src_text = prefix + text
213
  input_ids = tokenizer(src_text, return_tensors="pt")
214
 
215
- generated_tokens = model.generate(**input_ids.to(device))
216
 
217
  result = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
218
  print(result)
@@ -223,7 +253,7 @@ prefix = 'summary brief: '
223
  src_text = prefix + text
224
  input_ids = tokenizer(src_text, return_tensors="pt")
225
 
226
- generated_tokens = model.generate(**input_ids.to(device))
227
 
228
  result = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
229
  print(result)
@@ -234,7 +264,7 @@ prefix = 'summary big: '
234
  src_text = prefix + text
235
  input_ids = tokenizer(src_text, return_tensors="pt")
236
 
237
- generated_tokens = model.generate(**input_ids.to(device))
238
 
239
  result = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
240
  print(result)
 
70
  2) "summary brief 4 words: " - to generate a shortened summary content in the source language
71
  3) "summary big 100 words: " - to generate elongated summary content in the source language
72
 
73
+ The word-level restriction works better with small meanings than with large ones.
74
+
75
  The model can understand text in any language from the list: Russian, Chinese or English. It can also translate the result into any language from the list: Russian, Chinese or English.
76
 
77
  For translation into the target language, the target language identifier is specified as a prefix "... to <lang>:". Where lang can take the values: ru, en, zh. The source language may not be specified, in addition, the source text may be multilingual.
 
90
 
91
  A training model for compressing a context of 2048 tokens and outputs a summary of up to 200 tokens in big task, 50 tokens in summary, and 20 tokens in brief task.
92
 
93
+ A prefix in a translation task with a length restriction based on the number of words will look like this: "summary brief to en 4 words: "
94
+
95
+
96
+
97
 
98
 
99
  Example resume for English:
 
107
  model = T5ForConditionalGeneration.from_pretrained(model_name)
108
  model.eval()
109
  model.to(device)
110
+
111
+ generation_config = model.generation_config
112
+
113
+ # for quality generation
114
+ generation_config.length_penalty = 0.6
115
+ generation_config.no_repeat_ngram_size = 2
116
+ generation_config.num_beams = 10
117
+
118
  tokenizer = T5Tokenizer.from_pretrained(model_name)
119
 
120
  text = """Videos that say approved vaccines are dangerous and cause autism, cancer or infertility are among those that will be taken down, the company said. The policy includes the termination of accounts of anti-vaccine influencers. Tech giants have been criticised for not doing more to counter false health information on their sites. In July, US President Joe Biden said social media platforms were largely responsible for people's scepticism in getting vaccinated by spreading misinformation, and appealed for them to address the issue. YouTube, which is owned by Google, said 130,000 videos were removed from its platform since last year, when it implemented a ban on content spreading misinformation about Covid vaccines. In a blog post, the company said it had seen false claims about Covid jabs "spill over into misinformation about vaccines in general". The new policy covers long-approved vaccines, such as those against measles or hepatitis B. "We're expanding our medical misinformation policies on YouTube with new guidelines on currently administered vaccines that are approved and confirmed to be safe and effective by local health authorities and the WHO," the post said, referring to the World Health Organization."""
 
124
  src_text = prefix + text
125
  input_ids = tokenizer(src_text, return_tensors="pt")
126
 
127
+ generated_tokens = model.generate(**input_ids.to(device), generation_config=generation_config)
128
 
129
  result = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
130
  print(result)
 
135
  src_text = prefix + text
136
  input_ids = tokenizer(src_text, return_tensors="pt")
137
 
138
+ generated_tokens = model.generate(**input_ids.to(device), generation_config=generation_config)
139
 
140
  result = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
141
  print(result)
 
146
  src_text = prefix + text
147
  input_ids = tokenizer(src_text, return_tensors="pt")
148
 
149
+ generated_tokens = model.generate(**input_ids.to(device), generation_config=generation_config)
150
 
151
  result = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
152
  print(result)
 
166
  model = T5ForConditionalGeneration.from_pretrained(model_name)
167
  model.eval()
168
  model.to(device)
169
+
170
+ generation_config = model.generation_config
171
+
172
+ # for quality generation
173
+ generation_config.length_penalty = 0.6
174
+ generation_config.no_repeat_ngram_size = 2
175
+ generation_config.num_beams = 10
176
+
177
  tokenizer = T5Tokenizer.from_pretrained(model_name)
178
 
179
  text = """在北京冬奥会自由式滑雪女子坡面障碍技巧决赛中,中国选手谷爱凌夺得银牌。祝贺谷爱凌!今天上午,自由式滑雪女子坡面障碍技巧决赛举行。决赛分三轮进行,取选手最佳成绩排名决出奖牌。第一跳,中国选手谷爱凌获得69.90分。在12位选手中排名第三。完成动作后,谷爱凌又扮了个鬼脸,甚是可爱。第二轮中,谷爱凌在道具区第三个障碍处失误,落地时摔倒。获得16.98分。网友:摔倒了也没关系,继续加油!在第二跳失误摔倒的情况下,谷爱凌顶住压力,第三跳稳稳发挥,流畅落地!获得86.23分!此轮比赛,共12位选手��赛,谷爱凌第10位出场。网友:看比赛时我比谷爱凌紧张,加油!"""
 
183
  src_text = prefix + text
184
  input_ids = tokenizer(src_text, return_tensors="pt")
185
 
186
+ generated_tokens = model.generate(**input_ids.to(device), generation_config=generation_config)
187
 
188
  result = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
189
  print(result)
 
194
  src_text = prefix + text
195
  input_ids = tokenizer(src_text, return_tensors="pt")
196
 
197
+ generated_tokens = model.generate(**input_ids.to(device), generation_config=generation_config)
198
 
199
  result = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
200
  print(result)
 
205
  src_text = prefix + text
206
  input_ids = tokenizer(src_text, return_tensors="pt")
207
 
208
+ generated_tokens = model.generate(**input_ids.to(device), generation_config=generation_config)
209
 
210
  result = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
211
  print(result)
 
225
  model = T5ForConditionalGeneration.from_pretrained(model_name)
226
  model.eval()
227
  model.to(device)
228
+
229
+ generation_config = model.generation_config
230
+
231
+ # for quality generation
232
+ generation_config.length_penalty = 0.6
233
+ generation_config.no_repeat_ngram_size = 2
234
+ generation_config.num_beams = 10
235
+
236
  tokenizer = T5Tokenizer.from_pretrained(model_name)
237
 
238
  text = """Высота башни составляет 324 метра (1063 фута), примерно такая же высота, как у 81-этажного здания, и самое высокое сооружение в Париже. Его основание квадратно, размером 125 метров (410 футов) с любой стороны. Во время строительства Эйфелева башня превзошла монумент Вашингтона, став самым высоким искусственным сооружением в мире, и этот титул она удерживала в течение 41 года до завершения строительство здания Крайслер в Нью-Йорке в 1930 году. Это первое сооружение которое достигло высоты 300 метров. Из-за добавления вещательной антенны на вершине башни в 1957 году она сейчас выше здания Крайслер на 5,2 метра (17 футов). За исключением передатчиков, Эйфелева башня является второй самой высокой отдельно стоящей структурой во Франции после виадука Мийо."""
 
242
  src_text = prefix + text
243
  input_ids = tokenizer(src_text, return_tensors="pt")
244
 
245
+ generated_tokens = model.generate(**input_ids.to(device), generation_config=generation_config)
246
 
247
  result = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
248
  print(result)
 
253
  src_text = prefix + text
254
  input_ids = tokenizer(src_text, return_tensors="pt")
255
 
256
+ generated_tokens = model.generate(**input_ids.to(device), generation_config=generation_config)
257
 
258
  result = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
259
  print(result)
 
264
  src_text = prefix + text
265
  input_ids = tokenizer(src_text, return_tensors="pt")
266
 
267
+ generated_tokens = model.generate(**input_ids.to(device), generation_config=generation_config)
268
 
269
  result = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
270
  print(result)