basilepp19
/

bloom-1b7-it-dolly-evalita

Text Generation

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

bloom-1b7-it-dolly-evalita / README.md

basilepp19's picture

Update README.md

80f7851 about 1 year ago

|

history blame contribute delete

3.1 kB

	---
	license: bigscience-bloom-rail-1.0
	language:
	- it
	---
	# Model Card for Model ID

	<!-- Provide a quick summary of what the model is/does. -->

	This model is obtained by adapting bloom-1b7 to the Italian language. Among the languages supported by the BLOOM model, there is no Italian, making its use
	in that context challenging. We adapt the original BLOOM model using the MAD-X language adaptation strategy.
	Then, the adapted model is fine-tuned over an Italian translation of the dolly dataset and two classification task prompts. To deal with this step, we decided to use data
	from two well-known EVALITA tasks: AMI2020 (misogyny detection) and HASPEEDE-v2-2020 (hate-speech detection).

	## Model Details

	### Model Description

	We adapt the bloom-1b7 to the Italian language using the MAD-X language adaptation strategy.
	To produce a valuable model, we follow the same procedure proposed in: https://arxiv.org/abs/2212.09535

	We use default script parameters and select a sample of 100,000 examples in the Italian language. We decided to sample data from the Filtered Oscar Dataset for
	the Italian Language released by Sarti.

	Then, the adopted model is fine-tuned over an Italian translation of the dolly dataset and two classification task prompts using two well-known EVALITA tasks:
	AMI2020 (misogyny detection) and HASPEEDE-v2-2020 (hate-speech detection).

	We transformed the training data of the two tasks into an LLM prompt following a template. For the AMI task, we used the following template:

	instruction: Nel testo seguente si esprime odio contro le donne? Rispondi sì o no., input: \<text\>, output: \<sì/no\>.

	Similarly, for HASPEEDE we used:

	instruction: “Il testo seguente incita all’odio? Rispondi sì o no., input: \<text\>, output: \<sì/no\>.

	To fill these templates, we mapped the label "1" with the word "sì" and the label "0" with the word "no", \<text\> is just the sentence from the
	dataset to classify.

	The dolly dataset is automatically translated into Italian using an open-source machine translation tool: https://pypi.org/project/argostranslate/

	To fine-tune the adapted model, we use the script available here: https://github.com/hyintell/BLOOM-fine-tuning/tree/main

	**It is important to underline that when you use the adapted LLM or one of its fine-tuned models is necessary to use the tokenizer of the adapted model.
	The BLOOM model adapted to the Italian language is available here: https://huggingface.co/basilepp19/bloom-1b7_it.**

	- Developed by: Pierpaolo Basile, Pierluigi Cassotti, Marco Polignano, Lucia Siciliani, Giovanni Semeraro. Department of Computer Science, University of Bari Aldo Moro, Italy
	- Model type: BLOOM
	- Language(s) (NLP): Italian
	- License: BigScience BLOOM RAIL 1.0

	## Citation

	Pierpaolo Basile, Pierluigi Cassotti, Marco Polignano, Lucia Siciliani, Giovanni Semeraro. On the impact of Language Adaptation for Large Language Models: A
	case study for the Italian language using only open resources. Proceedings of the Ninth Italian Conference on Computational Linguistics (CLiC-it 2023).