bart-base-japanese-news(base-sized model)

This repository provides a Japanese BART model. The model was trained by Stockmark Inc.

An introductory article on the model can be found at the following URL.

https://tech.stockmark.co.jp/blog/bart-japanese-base-news/

Model description

BART is a transformer encoder-decoder (seq2seq) model with a bidirectional (BERT-like) encoder and an autoregressive (GPT-like) decoder. BART is pre-trained by (1) corrupting text with an arbitrary noising function, and (2) learning a model to reconstruct the original text.

BART is particularly effective when fine-tuned for text generation (e.g. summarization, translation) but also works well for comprehension tasks (e.g. text classification, question answering).

Intended uses & limitations

You can use the raw model for text infilling. However, the model is mostly meant to be fine-tuned on a supervised dataset.

How to use the model

NOTE: Since we are using a custom tokenizer, please use trust_remote_code=True to initialize the tokenizer.

Simple use

from transformers import AutoTokenizer, BartModel

model_name = "stockmark/bart-base-japanese-news"

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = BartModel.from_pretrained(model_name)

inputs = tokenizer("今日は良い天気です。", return_tensors="pt")
outputs = model(**inputs)

last_hidden_states = outputs.last_hidden_state

Sentence Permutation

import torch
from transformers import AutoTokenizer, BartForConditionalGeneration

model_name = "stockmark/bart-base-japanese-news"

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = BartForConditionalGeneration.from_pretrained(model_name)

if torch.cuda.is_available():
    model = model.to("cuda")

# correct order text is "明日は大雨です。電車は止まる可能性があります。ですから、自宅から働きます。"
text = "電車は止まる可能性があります。ですから、自宅から働きます。明日は大雨です。"

inputs = tokenizer([text], max_length=128, return_tensors="pt", truncation=True)
text_ids = model.generate(inputs["input_ids"].to(model.device), num_beams=3, max_length=128)
output = tokenizer.batch_decode(text_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]

print(output)
# sample output: 明日は大雨です。電車は止まる可能性があります。ですから、自宅から働きます。

Mask filling

import torch
from transformers import AutoTokenizer, BartForConditionalGeneration

model_name = "stockmark/bart-base-japanese-news"

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = BartForConditionalGeneration.from_pretrained(model_name)

if torch.cuda.is_available():
    model = model.to("cuda")

text = "今日の天気は<mask>のため、傘が必要でしょう。"

inputs = tokenizer([text], max_length=128, return_tensors="pt", truncation=True)
text_ids = model.generate(inputs["input_ids"].to(model.device), num_beams=3, max_length=128)
output = tokenizer.batch_decode(text_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]

print(output)
# sample output: 今日の天気は、雨のため、傘が必要でしょう。

Text generation

NOTE: You can use the raw model for text generation. However, the model is mostly meant to be fine-tuned on a supervised dataset.

import torch
from transformers import AutoTokenizer, BartForConditionalGeneration

model_name = "stockmark/bart-base-japanese-news"

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = BartForConditionalGeneration.from_pretrained(model_name)

if torch.cuda.is_available():
   model = model.to("cuda")

text = "自然言語処理(しぜんげんごしょり、略称:NLP)は、人間が日常的に使っている自然言語をコンピュータに処理させる一連の技術であり、人工知能と言語学の一分野である。「計算言語学」(computational linguistics)との類似もあるが、自然言語処理は工学的な視点からの言語処理をさすのに対して、計算言語学は言語学的視点を重視する手法をさす事が多い。"

inputs = tokenizer([text], max_length=512, return_tensors="pt", truncation=True)
text_ids = model.generate(inputs["input_ids"].to(model.device), num_beams=3, min_length=0, max_length=40)
output = tokenizer.batch_decode(text_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]

print(output)
# sample output: 自然言語処理(しぜんげんごしょり、略称:NLP)は、人間が日常的に使っている自然言語をコンピュータに処理させる一連の技術であり、言語学の一分野である。

Training

The model was trained on Japanese News Articles.

Tokenization

The model uses a sentencepiece-based tokenizer. The vocabulary was first trained on a selected subset from the training data using the official sentencepiece training script.

Licenses

The pretrained models are distributed under the terms of the MIT License.

NOTE: Only tokenization_bart_japanese_news.py is Apache License, Version 2.0. Please see tokenization_bart_japanese_news.py for license details.

Contact

If you have any questions, please contact us using our contact form.

Acknowledgement

This comparison study supported with Cloud TPUs from Google’s TensorFlow Research Cloud (TFRC).

Downloads last month
138
Safetensors
Model size
125M params
Tensor type
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.