|
--- |
|
license: cc-by-nc-4.0 |
|
language: |
|
- ru |
|
- en |
|
base_model: |
|
- d0rj/rut5-base-summ |
|
pipeline_tag: summarization |
|
tags: |
|
- summarization |
|
- natural-language-processing |
|
- text-summarization |
|
- machine-learning |
|
- deep-learning |
|
- transformer |
|
- artificial-intelligence |
|
- text-analysis |
|
- sequence-to-sequence |
|
- pytorch |
|
- tensorflow |
|
- safetensors |
|
- t5 |
|
library_name: transformers |
|
--- |
|
|
|
 |
|
|
|
# Russian Text Summarization Model - LaciaSUM V1 (small) |
|
This model is a fine-tuned version of d0rj/rut5-base-summ designed for the task of automatic text summarization. It has been adapted specifically for processing Russian texts and fine-tuned on a custom CSV dataset containing original texts and their corresponding summaries. |
|
|
|
# Key Features |
|
* Objective: Automatic abstractive summarization of texts. |
|
* Base Model: d0rj/rut5-base-summ. |
|
* Dataset: A custom CSV file with columns Text (original text) and Summarize (summary). |
|
* Preprocessing: Before tokenization, the prefix summarize: is added to the original text, which helps the model focus on the summarization task. |
|
# Training Settings: |
|
* Number of epochs: 9. |
|
* Batch size: 4 per device. |
|
* Warmup steps: 1000. |
|
* FP16 training enabled (if CUDA is available). |
|
* Hardware: Training was performed on an RTX 3070 (approximately 40 minutes of training). |
|
|
|
# Description |
|
The model was fine-tuned using the Transformers library along with the Seq2SeqTrainer from Hugging Face. The training script includes: |
|
|
|
Custom Dataset: The SummarizationDataset class reads the CSV file (ensuring correct encoding and separator), trims extra spaces from column names, and tokenizes both the source text and the target summary. |
|
Token Processing: To improve loss computation, padding tokens in the target text are replaced with -100. |
|
|
|
This model is suitable for rapid prototyping and practical applications in automatic summarization of Russian documents, news articles, and other text formats. |
|
**The model also supports English language, but its support was not tested** |
|
|
|
# Example Usage |
|
|
|
```python |
|
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM |
|
|
|
# Load the tokenizer and the model |
|
tokenizer = AutoTokenizer.from_pretrained("your_username/model_name") |
|
model = AutoModelForSeq2SeqLM.from_pretrained("your_username/model_name") |
|
|
|
# Example text to summarize |
|
text = "Your long text that needs summarizing." |
|
|
|
# Add the prefix as during training |
|
input_text = "summarize: " + text |
|
inputs = tokenizer(input_text, return_tensors="pt", max_length=512, truncation=True) |
|
|
|
# Generate the summary |
|
summary_ids = model.generate(inputs["input_ids"], max_length=150, num_beams=4, early_stopping=True) |
|
summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True) |
|
|
|
print("Summary:", summary) |
|
``` |
|
|
|
**Created by LaciaStudio | LaciaAI** |