--- license: cc-by-nc-4.0 language: - ru - en base_model: - d0rj/rut5-base-summ pipeline_tag: summarization tags: - summarization - natural-language-processing - text-summarization - machine-learning - deep-learning - transformer - artificial-intelligence - text-analysis - sequence-to-sequence - pytorch - tensorflow - safetensors - t5 library_name: transformers --- ![Official LaciaSUM Logo](https://huggingface.co/LaciaStudio/Lacia_sum_small_v1/resolve/main/LaciaSUM.png) # Russian Text Summarization Model - LaciaSUM V1 (small) This model is a fine-tuned version of d0rj/rut5-base-summ designed for the task of automatic text summarization. It has been adapted specifically for processing Russian texts and fine-tuned on a custom CSV dataset containing original texts and their corresponding summaries. # Key Features * Objective: Automatic abstractive summarization of texts. * Base Model: d0rj/rut5-base-summ. * Dataset: A custom CSV file with columns Text (original text) and Summarize (summary). * Preprocessing: Before tokenization, the prefix summarize: is added to the original text, which helps the model focus on the summarization task. # Training Settings: * Number of epochs: 9. * Batch size: 4 per device. * Warmup steps: 1000. * FP16 training enabled (if CUDA is available). * Hardware: Training was performed on an RTX 3070 (approximately 40 minutes of training). # Description The model was fine-tuned using the Transformers library along with the Seq2SeqTrainer from Hugging Face. The training script includes: Custom Dataset: The SummarizationDataset class reads the CSV file (ensuring correct encoding and separator), trims extra spaces from column names, and tokenizes both the source text and the target summary. Token Processing: To improve loss computation, padding tokens in the target text are replaced with -100. This model is suitable for rapid prototyping and practical applications in automatic summarization of Russian documents, news articles, and other text formats. **The model also supports English language, but its support was not tested** # Example Usage ```python from transformers import AutoTokenizer, AutoModelForSeq2SeqLM # Load the tokenizer and the model tokenizer = AutoTokenizer.from_pretrained("your_username/model_name") model = AutoModelForSeq2SeqLM.from_pretrained("your_username/model_name") # Example text to summarize text = "Your long text that needs summarizing." # Add the prefix as during training input_text = "summarize: " + text inputs = tokenizer(input_text, return_tensors="pt", max_length=512, truncation=True) # Generate the summary summary_ids = model.generate(inputs["input_ids"], max_length=150, num_beams=4, early_stopping=True) summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True) print("Summary:", summary) ``` **Created by LaciaStudio | LaciaAI**