--- language: - cs - cs tags: - abstractive summarization - mbart-cc25 - Czech license: apache-2.0 datasets: - SumeCzech dataset news-based metrics: - rouge - rougeraw --- # mBART fine-tuned model for Czech abstractive summarization (AT2H-S) This model is a fine-tuned checkpoint of [facebook/mbart-large-cc25](https://huggingface.co/facebook/mbart-large-cc25) on the Czech news dataset to produce Czech abstractive summaries. ## Task The model deals with the task ``Abstract + Text to Headline`` (AT2H) which consists in generating a one- or two-sentence summary considered as a headline from a Czech news text. ## Dataset The model has been trained on the [SumeCzech](https://ufal.mff.cuni.cz/sumeczech) dataset. The dataset includes around 1M Czech news-based documents consisting of a Headline, Abstract, and Full-text sections. Truncation and padding were configured for 512 tokens for the encoder and 64 for the decoder. ## Training The model has been trained on 1x NVIDIA Tesla A100 40GB for 40 hours. During training, the model has seen 2576K documents corresponding to roughly 3 epochs. # Use Assuming you are using the provided Summarizer.ipynb file. ```python def summ_config(): cfg = OrderedDict([ # summarization model - checkpoint from website ("model_name", "krotima1/mbart-at2h-s"), ("inference_cfg", OrderedDict([ ("num_beams", 4), ("top_k", 40), ("top_p", 0.92), ("do_sample", True), ("temperature", 0.89), ("repetition_penalty", 1.2), ("no_repeat_ngram_size", None), ("early_stopping", True), ("max_length", 64), ("min_length", 10), ])), #texts to summarize ("text", [ "Input your Czech text", ] ), ]) return cfg cfg = summ_config() #load model model = AutoModelForSeq2SeqLM.from_pretrained(cfg["model_name"]) tokenizer = AutoTokenizer.from_pretrained(cfg["model_name"]) # init summarizer summarize = Summarizer(model, tokenizer, cfg["inference_cfg"]) summarize(cfg["text"]) ```