Instruct/Base clarification + formatting
First of all, please clarify if this model is base/completion, or instruct-tuned. Additionally, it would be useful to know if few-shot capabilities were tested. A prompting format, if any tested, would be nice too.
Please understand that models without instruction capabilities are quite hard to test in practical environment, such models will never get enough attention even when deserved.
It seems to be a base model and seems to have some few-shot abilities as it can do 1-shot translation. But yeah, an instruct version would be nice.
First of all, please clarify if this model is base/completion, or instruct-tuned. Additionally, it would be useful to know if few-shot capabilities were tested. A prompting format, if any tested, would be nice too.
Please understand that models without instruction capabilities are quite hard to test in practical environment, such models will never get enough attention even when deserved.
Hi! Sorry for the confusion - it is a base model, I thought it is conveyed by "Foundational". We are currently working on 1) evaluating it, 2) instruction-tuning it to grounded tasks. The resulting models will be published as they become available.
Creating a general open-question answering model is very low on our priority list, but we might do so in a few months. Aligning it with human preferences and ethics is very difficult, and not doing the alignment poses ethical risks.
Aligning it with human preferences and ethics is very difficult, and not doing the alignment poses ethical risks.
This is unfortunate to hear; please look at Mistral and learn by their mistakes. They've released Mistral Small 3 in a more censored state than should've, and they had to spend 2 more iterations fixing it. While MS 3.2 is relatively popular, it's not nearly as popular as it could be.
The "ethics" you talk about will only limit the adoption of your model and hurt your research in the future, as less community interest means less testing, less feedback and less improvement over time. It's refreshing that you communicate (unlike Mistral, again), but if you follow the same mistakes, it will lead you nowhere.
Creating a general open-question answering model is very low on our priority list, but we might do so in a few months. Aligning it with human preferences and ethics is very difficult, and not doing the alignment poses ethical risks.
Don't waste too much resources on Ethics, beyond legal liabilities. In the line of what Notafraud said, there's not much interest for dumbified models around here, and it's harder to de-dumbify them later than it is to dumbify them in the first instance. Up to those adopting your foundation models to align/censor them for their purpose and related audience.
Ok! We will see what we can do!
@TBergmanis If you have trained the model on some instruct data (as it seems to be the case with v1.1), which instruct template did you use? The tokenizer config did not seem to have any special tokens for this purpose, so I assume the format was like Alpaca that doesn't use special tokens? (If not, the instruct format tokens should be added to the tokenizer).
I’ll ask this here - there’s no point creating a new thread for the same topic.
If someone were to build an instruction-tuned model based on this, how “smart” would it really be? I recall Madara mentioning in a podcast that the core issue is data - even with access to training infrastructure in the EU, the lack of high-quality, diverse training data makes it pointless to use such a model as a base for a general purpose chatbot. User interest would be low. The only viable use cases would be niche applications like support staff, agents, or internal tools, MCP where broad knowledge isn’t critical.
That, more or less, aligns with what Tilde has already said: the real value lies in tool use for language and text processing, translation, and similar tasks.
In a way, this creates a misunderstanding between what’s promised to the public and what’s actually delivered. I’m not attacking anyone specifically. This is a general reflection on the current state of LLM development in the EU as far as I know.
Most people, ranging from “soccer moms” to members of the European Parliament, don’t understand much about LLMs. When they hear “an EU vendor is creating an LLM.”, their immediate thought is: “Ohh - another ChatGPT alternative!” But from my own testing, almost no EU company has actually built an open model that’s competitive with ChatGPT, Claude, Grok, or Gemini. (It’s not a fair comparison, but that’s how it’s judged)
Even when compared to other open models like Qwen3, Gemma3, or Llama, the gap remains. The closest contender is Mistral, but even it has its limitations.
I’m getting philosophical here, but my point is this: I hope the EU steps up and invests more in projects like this. If we want to be competitive globally, we need at least an instruction-tuned, multimodal model that’s accessible to everyone - home users, schools, businesses, governments etc. That’s a lot to ask, and it is a lot. But that’s how far behind we are.
Another issue: the average person doesn’t grasp how much work, time, and money goes into building an LLM - It can’t be done in a day, it doesn’t cost €1,000.
If someone were to build an instruction-tuned model based on this, how “smart” would it really be? I recall Madara mentioning in a podcast that the core issue is data - even with access to training infrastructure in the EU, the lack of high-quality, diverse training data makes it pointless to use such a model as a base for a general purpose chatbot.
That's called "skill issue".
User interest would be low. The only viable use cases would be niche applications like support staff, agents, or internal tools, MCP where broad knowledge isn’t critical.
Almost all modern LLMs have instruct versions that are stable and useful even in production. Text analysis has much wider application than you might think - and in the end, these are still language models.
The earlier mentioned censorship is what truly stops the adoption - it's neither for the people, nor for companies.
That, more or less, aligns with what Tilde has already said: the real value lies in tool use for language and text processing, translation, and similar tasks.
In a way, this creates a misunderstanding between what’s promised to the public and what’s actually delivered. I’m not attacking anyone specifically. This is a general reflection on the current state of LLM development in the EU as far as I know.
No, it's people that have spent too much time with LLMs from back in a day - modern instruct-tuned LLMs are perfectly capable of the aforementioned tasks.
Most people, ranging from “soccer moms” to members of the European Parliament, don’t understand much about LLMs. When they hear “an EU vendor is creating an LLM.”, their immediate thought is: “Ohh - another ChatGPT alternative!” But from my own testing, almost no EU company has actually built an open model that’s competitive with ChatGPT, Claude, Grok, or Gemini. (It’s not a fair comparison, but that’s how it’s judged)
Even when compared to other open models like Qwen3, Gemma3, or Llama, the gap remains. The closest contender is Mistral, but even it has its limitations.
Again, nobody needs LLMs to be on the same level as "ChatGPT, Claude, Grok, or Gemini" - we need capable LLMs that can follow complex prompts and don't break within 64k context; LLMs that don't teach us how to be a "good citizen". Companies need reliable (again, prompt adherence and no refusals) tools that can be hosted on their own corporate servers to use corporate data.
Hosting such a model in a service with crawling will be good enough even for general public - Mistral does it, for example.
I’m getting philosophical here, but my point is this: I hope the EU steps up and invests more in projects like this. If we want to be competitive globally, we need at least an instruction-tuned, multimodal model that’s accessible to everyone - home users, schools, businesses, governments etc. That’s a lot to ask, and it is a lot. But that’s how far behind we are.
Multimodality slows adoption too - you'll see how unpopular Omni will be in production. First of all, it needs to be a stable, reliable LLM that becomes popular among enthusiasts and companies, then also given as a service.
Another issue: the average person doesn’t grasp how much work, time, and money goes into building an LLM - It can’t be done in a day, it doesn’t cost €1,000.
"The average person" sees new articles about another initiative/collaboration that funds billions to "AI companies", and that's where their expectations are built. In their eyes it's a rich industry, and anything that;s not usable immediately will not be valued.
If someone were to build an instruction-tuned model based on this, how “smart” would it really be? I recall Madara mentioning in a podcast that the core issue is data - even with access to training infrastructure in the EU, the lack of high-quality, diverse training data makes it pointless to use such a model as a base for a general purpose chatbot.
That's called "skill issue".
No, it's mostly a data issue. For many European languages there exist no high quality instruction finetuning data, while there is high quality translation data, for example. While synthetic data generated with big models can help some, it also has quality issues large enough that I think it is a valid question to ask whether it is worth doing or not when compared to e.g. a translation model that would do translation better.
Ultimately I believe it is mostly about the lack of affordable data workers. Labor is expensive in Europe and having people manually write or annotate data is just not economically viable. For English, you can use any data workers in the whole world (if you don't take ethical issues into account), while for small European languages you can only use data workers from small and expensive European countries.
Also, as a side, even if you did somehow manage to get funding for training a model, techniques like RLHF require the annotations to be done again and again for each new model as the core idea of RLHF is to train the model on its own outputs. For this reason, using pre-made instruction data can only get you so far and the level of commercial models is still unattainable.
Almost all modern LLMs have instruct versions that are stable and useful even in production. Text analysis has much wider application than you might think - and in the end, these are still language models.
For English. For small European languages, most LLMs are horrible for the reasons I mentioned above.
Ultimately I believe it is mostly about the lack of affordable data workers. Labor is expensive in Europe and having people manually write or annotate data is just not economically viable. For English, you can use any data workers in the whole world (if you don't take ethical issues into account), while for small European languages you can only use data workers from small and expensive European countries.
Look like I've missed this point, thanks for pointing out. It would be helpful to have an initiative to create such a dataset first of all then, because it would enable European companies to at least start working towards great LLMs.
Ultimately I believe it is mostly about the lack of affordable data workers. Labor is expensive in Europe and having people manually write or annotate data is just not economically viable. For English, you can use any data workers in the whole world (if you don't take ethical issues into account), while for small European languages you can only use data workers from small and expensive European countries.
Look like I've missed this point, thanks for pointing out. It would be helpful to have an initiative to create such a dataset first of all then, because it would enable European companies to at least start working towards great LLMs.
Good points from @fergusq . I’ve been thinking a lot about datasets - they’re the make-or-break part of building strong EU LLMs.
If we want models that actually work across all EU languages, we need high-quality training data. And that means we can’t just rely on English-only sources. You can’t feed an LLM a bunch of English encyclopedias and expect it to answer fluently in, say, Lithuanian or Catalan - that just doesn’t work. The model needs real, contextual data in each language it’s supposed to handle.
So here’s the challenge: we can’t expect every EU company to build its own language datasets from scratch. That’s inefficient, expensive, and leads to duplication. Instead, we need a coordinated, EU-wide effort to collect, curate, and share language data - ideally through a public or semi-public consortium.
This doesn’t mean open access to everything - privacy laws (like GDPR), data ownership, and commercial interests are there. But it does mean creating fair, transparent pathways so that trusted institutions, researchers, and developers can access the data they need - responsibly and sustainably.
That’s why I think every EU country should have a dedicated institution - a university, language center, or public cultural body - responsible for preparing and contributing language-specific data. No one knows your language, your culture, or your regional nuances better than the people who live it every day. Those subtleties - idioms, humor, traditions - are what make a model truly usable.
And yes, even with great data, you’re not done. As @fergusq mentioned, fine-tuning, alignment, and ongoing maintenance are still needed. So that’s why I don’t see the point in building multiple weak EU LLMs that all end up underperforming. We’d be better off focusing on one strong, open, well-supported model - built on shared data, with national contributions, and maintained by a collaborative EU effort.
That way, we get both scale and cultural depth.
If we want models that actually work across all EU languages, we need high-quality training data. And that means we can’t just rely on English-only sources. You can’t feed an LLM a bunch of English encyclopedias and expect it to answer fluently in, say, Lithuanian or Catalan - that just doesn’t work. The model needs real, contextual data in each language it’s supposed to handle.
That, as a general statement, is not quite true. There are tasks like grounded question answering for which models learn abilities that transfer well across languages. For example, we trained a model for in-context question answering on English Trivia QA data and tested it on Latvian and Russian data. We got 94% accuracy when evaluating the semantic equivalence of the reference and the hypothesis answer.
Why does this work? Providing concise answers for questions grounded in context is a 90% reading comprehension and simple reasoning task. Providing long-form answers could be trickier because getting good-quality data is harder and therefore costs more.
There are other tasks that require cultural knowledge or a specific form, depending on the language, culture, and traditions.
So here’s the challenge: we can’t expect every EU company to build its own language datasets from scratch. That’s inefficient, expensive, and leads to duplication. Instead, we need a coordinated, EU-wide effort to collect, curate, and share language data - ideally through a public or semi-public consortium.
We plan to publish the data we use for instruction tuning.
Just want to mention here that there is a huge difference between comprehension and grammatically correct text inference. It's easy to see by comparing Mistral Nemo or Mistral Small 24b models to any other LLM in similar size. Other models will "understand" your non-english text, but will not be able to generate the answer correctly. The tokenizer itself should be ready to handle non-english grammar, but also the training data needs more variety - just see how formal and constrained in style Mistral Small 24b becomes.
So yes, using something like English Trivia QA to train is ok, but may not be good enough, especially in the long run and with focus on the whole diversity of languages.
If we want models that actually work across all EU languages, we need high-quality training data. And that means we can’t just rely on English-only sources. You can’t feed an LLM a bunch of English encyclopedias and expect it to answer fluently in, say, Lithuanian or Catalan - that just doesn’t work. The model needs real, contextual data in each language it’s supposed to handle.
That, as a general statement, is not quite true. There are tasks like grounded question answering for which models learn abilities that transfer well across languages. For example, we trained a model for in-context question answering on English Trivia QA data and tested it on Latvian and Russian data. We got 94% accuracy when evaluating the semantic equivalence of the reference and the hypothesis answer.
Why does this work? Providing concise answers for questions grounded in context is a 90% reading comprehension and simple reasoning task. Providing long-form answers could be trickier because getting good-quality data is harder and therefore costs more.
There are other tasks that require cultural knowledge or a specific form, depending on the language, culture, and traditions.
So here’s the challenge: we can’t expect every EU company to build its own language datasets from scratch. That’s inefficient, expensive, and leads to duplication. Instead, we need a coordinated, EU-wide effort to collect, curate, and share language data - ideally through a public or semi-public consortium.
We plan to publish the data we use for instruction tuning.
As I used Qwen3 to correct grammar and style, as usual, it misunderstood me in a few places and amplified some parts of the text… I didn’t proofread it enough before posting. My original argument wasn’t "You can’t feed an LLM a bunch of English encyclopedias and expect it to answer fluently in, say, Lithuanian or Catalan - that just doesn’t work." Rather, it was more along the lines of: as far as I know, it won’t produce good (culturally correct) answers, or might even be poor. This aligns with @notafraud comment - I’ve noticed the same thing. Most popular LLMs can understand Latvian, for example, but the answers they generate are hardly passable by my standards. Still, it can be useful sometimes.
I did more digging into “grounded question answering,” and that’s good news - within its limits.
Good to hear about the publishing of instruct-tuning data, I’m looking forward to it. All of this is new and hard to understand, especially for non-academics.