File size: 9,743 Bytes
4f99d6c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
29bc598
 
0350a12
 
b72e612
0350a12
 
 
abae57e
f86fada
0350a12
 
46c8235
51dd1dd
0350a12
 
 
 
df4d0f2
4a7513e
 
 
3131632
df4d0f2
7ec44ef
 
df59690
 
 
 
4880c2f
5360ea0
61b728e
 
 
7ec44ef
 
 
 
 
 
 
 
 
 
 
 
 
61b728e
 
 
7ec44ef
 
 
c671503
7ec44ef
 
 
 
 
 
4880c2f
 
 
 
 
 
 
6c6b684
dc7f9d8
4880c2f
 
 
 
 
dc7f9d8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c671503
dc7f9d8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4880c2f
 
 
 
 
 
7ec44ef
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
---
license: apache-2.0
language:
- en
- it
base_model:
- mistralai/Magistral-Small-2506
pipeline_tag: text-generation
library_name: transformers
tags:
- ita
- italian
- anita
- magistral
- 24b
- uniba
- bari
- italy
- italia
- Conversational
- LLaMantino
---

<img src="https://huggingface.co/m-polignano/ANITA-NEXT-24B-Magistral-2506-ITA/resolve/main/Anita-Next_full.png" alt="anita_next" border="0" width="600px">
<hr>
<!--<img src="https://i.ibb.co/6mHSRm3/llamantino53.jpg" width="200"/>-->
<h3><i>"Built on <b>mistral/Magistral-Small-2506</b>"</i></i></h3>
<p style="text-align:justify;"><b>ANITA-NEXT-24B-Magistral-2506-ITA</b> is a <b>Thinking Model</b> of the <a href="https://arxiv.org/abs/2405.07101"><b>ANITA</b></a> - <i>Large Language Models family</i>.
The model is a fine-tuned version of <a href="https://huggingface.co/mistralai/Magistral-Small-2506"><b>Magistral-Small-2506</b></a> (a fine-tuned <b>Mistral model</b>).
This model version aims to be the a <b>Multilingual Model</b> 🏁  (EN 🇺🇸 + ITA🇮🇹) to further fine-tuning on Specific Tasks in Italian.</p>

❗❗❗Use at your own risk. The model may generate hallucinations, incorrect, invented, offensive, unethical or dangerous responses. We are not responsible for any dangerous/offensive/criminal use. The model is release for research only purposes.❗❗❗


The 🌟**ANITA project**🌟 *(**A**dvanced **N**atural-based interaction for the **ITA**lian language)*
wants to provide Italian NLP researchers with an improved model for the Italian Language 🇮🇹 use cases.

The **NEXT** family includes **four models**:
- m-polignano/ANITA-NEXT-24B-Magistral-2506-ITA - **General Purpose**
- m-polignano/ANITA-NEXT-24B-Dolphin-Mistral-UNCENSORED-ITA - **Uncensored**
- m-polignano/ANITA-NEXT-24B-Magistral-2506-VISION-ITA - **Vision-Language**
- m-polignano/ANITA-NEXT-20B-gpt-oss-ITA - **Agentic Ready**

<hr>

**GGUF - OLLAMA**: [m-polignano/ANITA-NEXT-24B-Magistral-2506-ITA-GGUF](https://huggingface.co/m-polignano/ANITA-NEXT-24B-Magistral-2506-ITA-GGUF)

<hr>

**Colab Demo:** [A100 - 40GB - Colab Notebook](https://colab.research.google.com/drive/1mhZLAdpOr3TRq-ZTG4XD52J98orHj_na?usp=sharing)<br>
The Model runs on a single GPU, 19.56GB of VRAM by using a *4bit Quantization*.

<hr>

## Specifications

- **Model developers**: <br><a href="https://marcopoli.github.io/">Ph.D. Marco Polignano</a> - University of Bari Aldo Moro, Italy <br> <a href="https://huggingface.co/swap-uniba">SWAP Research Group</a> <br>
- **Variations**: The model release has been **supervised fine-tuning (SFT)** using **QLoRA** 4bit, on instruction-based datasets. **DPO** approach over the *mlabonne/orpo-dpo-mix-40k* dataset is used to align with human preferences for helpfulness and safety.
- **Input**: Models input text only.
- **Language**: Multilingual 🏁 + Italian 🇮🇹
- **Output**: Models generate text and code only.
- **Model Architecture**: *Mistral architecture*.
- **Context length**: 128k, but degradate after 40k.
- **Library Used**: [Transformers 4.56.0.dev0] (https://huggingface.co/docs/transformers/index)
  
<hr>

## Playground
To use the model directly, there are many ways to get started, choose one of the following ways to experience it.

### Prompt Template

```
<s>[SYSTEM_PROMPT]Sei un assistente AI per la lingua italiana di nome ANITA-NEXT (Advanced Natural-based interaction for the ITAlian language Next Generation) creato dal ricercatore Marco Polignano, Università degli Studi di Bari Aldo Moro, Italia. Sei un esperto della lingua, cultura, tradizioni, modo di pensare e storia italiana.

L'utente ti chiederà di risolvere un compito o rispondere ad una domanda. Rispondi e ragiona usando la lingua della domanda, preferendo l'Italiano.
Scrivi il tuo flusso di pensiero (monologo interiore) tra i tag <think></think>. Ragiona in modo disinvolto, scrivendo riflessioni e/o bozze, come se stessi lavorando a un esercizio su un foglio di carta.
Successivamente, scrivi la soluzione in modo chiaro, corretto, semplice ed esaustivo basandoti sul riassunto del tuo flusso di pensiero.
Se necessario, usa la notazione markdown per formattare la risposta.[/SYSTEM_PROMPT][INST]{ USER Prompt }[/INST]<think>{ ASSIST Thinking }</think>{ ASSIST Prompt }</s>
```
### Transformers

For direct use with `transformers`, you can easily get started with the following steps.

- Firstly, you need to install transformers via the command below with `pip`.

  ```bash
  pip install -U --no-deps bitsandbytes accelerate xformers transformers peft trl cut_cross_entropy unsloth_zoo
  pip install sentencepiece protobuf "datasets>=3.4.1,<4.0.0" "huggingface_hub>=0.34.0" hf_transfer
  ```

- Right now, you can start using the model directly.

  ```python
    from transformers import AutoModelForCausalLM, AutoTokenizer
    import torch
    from transformers import BitsAndBytesConfig
    
    
    nf4_config = BitsAndBytesConfig(
       load_in_4bit=True,
       bnb_4bit_quant_type="nf4",
       bnb_4bit_use_double_quant=True,
       bnb_4bit_compute_dtype=torch.bfloat16
    )
    
    model_dir = "m-polignano/ANITA-NEXT-24B-Magistral-2506-ITA"
    tokenizer = AutoTokenizer.from_pretrained(model_dir, use_fast=True)
    model = AutoModelForCausalLM.from_pretrained(
        model_dir,
        quantization_config=nf4_config,
        device_map="auto",
        torch_dtype=torch.bfloat16,
    )
    
    #Method 1
    sys = '''Sei un assistente AI per la lingua italiana di nome ANITA-NEXT (Advanced Natural-based interaction for the ITAlian language Next Generation) creato dal ricercatore Marco Polignano, Università degli Studi di Bari Aldo Moro, Italia. Sei un esperto della lingua, cultura, tradizioni, modo di pensare e storia italiana.
    
    L'utente ti chiederà di risolvere un compito o rispondere ad una domanda. Rispondi e ragiona usando la lingua della domanda, preferendo l'Italiano.
    Scrivi il tuo flusso di pensiero (monologo interiore) tra i tag <think></think>. Ragiona in modo disinvolto, scrivendo riflessioni e/o bozze, come se stessi lavorando a un esercizio su un foglio di carta.
    Successivamente, scrivi la soluzione in modo chiaro, corretto, semplice ed esaustivo basandoti sul riassunto del tuo flusso di pensiero.
    Se necessario, usa la notazione markdown per formattare la risposta.'''
    messages = [
        {"role" : "system", "content" : sys},
        {"role" : "user", "content" : "Chi è Carlo Magno?"}
    ]
    prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    inputs = tokenizer(prompt, return_tensors="pt", add_special_tokens=False)
    for k,v in inputs.items():
        inputs[k] = v.cuda()
    outputs = model.generate(**inputs, max_new_tokens=32786, do_sample=True, top_p=0.9, temperature=0.7)
    results = tokenizer.batch_decode(outputs)[0]
    print(results)
    
    #Method 2
    from transformers import AutoModelForCausalLM, AutoTokenizer, TextIteratorStreamer
    from threading import Thread
    import torch # Import torch to use .cuda() if needed
    
    messages = [
        {"role" : "user", "content" : "Chi è Marco Polo?"}
    ]
    prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    inputs = tokenizer(prompt, return_tensors="pt", add_special_tokens=False)
    
    # Move inputs to CUDA if your model is on CUDA
    for k,v in inputs.items():
        inputs[k] = v.cuda()
    
    # --- 4. Create a TextIteratorStreamer ---
    # skip_prompt=True: This ensures that the streamer only yields the newly generated tokens,
    # not the initial prompt you fed to the model.
    # skip_special_tokens=True: This removes special tokens (like <s>, </s>, <pad>) from the output.
    streamer = TextIteratorStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)
    
    # --- 5. Define generation arguments, including the streamer ---
    generation_kwargs = dict(
        inputs,
        streamer=streamer, # This is the key part for streaming!
        max_new_tokens=32786,
        do_sample=True,
        top_p=0.9,
        temperature=0.7,
        # Add any other generation arguments you need
    )
    
    # --- 6. Run model.generate in a separate thread ---
    # This is crucial because model.generate is a blocking call.
    # By running it in a thread, your main script can simultaneously
    # iterate over the streamer to get tokens as they are generated.
    thread = Thread(target=model.generate, kwargs=generation_kwargs)
    thread.start()
    
    # --- 7. Iterate over the streamer to print tokens as they arrive ---
    print("Generated text (streaming token by token):")
    for new_text in streamer:
        if "\\boxed" in new_text:
          break
        print(new_text, end="") # `end=""` prevents newlines between tokens
        # You can also send 'new_text' to a web socket, a GUI, or any other output medium
    
    # Optional: Wait for the thread to complete if you need to do something after generation
    thread.join()

  ```

<hr>

  
## Citation instructions
```bibtex
@misc{polignano2024advanced,
      title={Advanced Natural-based interaction for the ITAlian language: LLaMAntino-3-ANITA}, 
      author={Marco Polignano and Pierpaolo Basile and Giovanni Semeraro},
      year={2024},
      eprint={2405.07101},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}
```
```bibtex
@article{rastogi2025magistral,
  title={Magistral},
  author={Rastogi, Abhinav and Jiang, Albert Q and Lo, Andy and Berrada, Gabrielle and Lample, Guillaume and Rute, Jason and Barmentlo, Joep and Yadav, Karmesh and Khandelwal, Kartik and Chandu, Khyathi Raghavi and others},
  journal={arXiv preprint arXiv:2506.10910},
  year={2025}
}
```