Lacia_sum_small_v1 / README.md
Studio
Update README.md
7df1e61 verified
|
raw
history blame
2.85 kB
---
license: cc-by-nc-4.0
language:
- ru
- en
base_model:
- d0rj/rut5-base-summ
pipeline_tag: summarization
tags:
- summarization
- natural-language-processing
- text-summarization
- machine-learning
- deep-learning
- transformer
- artificial-intelligence
- text-analysis
- sequence-to-sequence
- pytorch
- tensorflow
- safetensors
- t5
library_name: transformers
---
![Official LaciaSUM Logo](https://huggingface.co/LaciaStudio/Lacia_sum_small_v1/resolve/main/LaciaSUM.png)
# Russian Text Summarization Model - LaciaSUM V1 (small)
This model is a fine-tuned version of d0rj/rut5-base-summ designed for the task of automatic text summarization. It has been adapted specifically for processing Russian texts and fine-tuned on a custom CSV dataset containing original texts and their corresponding summaries.
# Key Features
* Objective: Automatic abstractive summarization of texts.
* Base Model: d0rj/rut5-base-summ.
* Dataset: A custom CSV file with columns Text (original text) and Summarize (summary).
* Preprocessing: Before tokenization, the prefix summarize: is added to the original text, which helps the model focus on the summarization task.
# Training Settings:
* Number of epochs: 9.
* Batch size: 4 per device.
* Warmup steps: 1000.
* FP16 training enabled (if CUDA is available).
* Hardware: Training was performed on an RTX 3070 (approximately 40 minutes of training).
# Description
The model was fine-tuned using the Transformers library along with the Seq2SeqTrainer from Hugging Face. The training script includes:
Custom Dataset: The SummarizationDataset class reads the CSV file (ensuring correct encoding and separator), trims extra spaces from column names, and tokenizes both the source text and the target summary.
Token Processing: To improve loss computation, padding tokens in the target text are replaced with -100.
This model is suitable for rapid prototyping and practical applications in automatic summarization of Russian documents, news articles, and other text formats.
**The model also supports English language, but its support was not tested**
# Example Usage
```python
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
# Load the tokenizer and the model
tokenizer = AutoTokenizer.from_pretrained("your_username/model_name")
model = AutoModelForSeq2SeqLM.from_pretrained("your_username/model_name")
# Example text to summarize
text = "Your long text that needs summarizing."
# Add the prefix as during training
input_text = "summarize: " + text
inputs = tokenizer(input_text, return_tensors="pt", max_length=512, truncation=True)
# Generate the summary
summary_ids = model.generate(inputs["input_ids"], max_length=150, num_beams=4, early_stopping=True)
summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
print("Summary:", summary)
```
**Created by LaciaStudio | LaciaAI**