fraug-library (FRAUG)

lbourdois

posted an update 3 months ago

Post

2929

We introduce FAT5 (Flash Attention T5) ⚡

An implementation of T5 in PyTorch with UL2 objective optimized for GPGPU for both training and inference thanks to 13 different optimizations.
The main one is that we have designed a CUDA kernel to expand the Flash Attention by @tridao with RPE biases and supports other PE such as RoPE, ALiBi or FIRE.
The result kernel is 2 times faster than a SPDA implementation.
We also use Triton kernels to optimize certain parts of the architecture, such as the cross-entropy and RMSNorm layer.

The various kernels have been carefully built to be compatible with BF16 and torch.compile to go even faster and achieve efficient pretraining.

All other optimizations are described in a 📝 subsequent blog post available on @huggingface 🤗: CATIE-AQ/FAT5-report.

This methodology enabled us to efficiently pretrain as a proof of concept a FAT5 with 147M parameters in French in a reasonable time (1,461H for 419B tokens), with limited resources (1 A100 i.e. a computational budget of ~ €1,900) and a low carbon footprint (13.5kg eq CO2).

The model's weights are also available on Hugging Face: CATIE-AQ/FAT5-small.
Not very useful in practice, it's a PoC and not an instructed model (it's planned for later).

All the code is available on GitHub if you want to pretrain your own model in your own language or for a specific domain: https://github.com/catie-aq/flashT5 ⭐

Ending by indicating that was a joint project with @BorisAlbar at hf.co/CATIE-AQ.

lbourdois

updated 5 datasets 8 months ago

lbourdois

posted an update about 1 year ago

Post

3812

I stopped procrastinating and finally took the time to write the second article of my series of blog posts on SSM: https://huggingface.co/blog/lbourdois/ssm-2022.
In this blog post, I review the history of SSM models released in 2022, with over 14 models discussed in a synthetic format.
They are separated into two parts: "theoretical" (DSS, S4D, GSS, Mega, S5, etc.) and "applications" (Sashimi, ViS4mer, CCNN, etc.).

To understand everything, it's best to have read the introduction to S4 to SSM blog post first: https://huggingface.co/blog/lbourdois/get-on-the-ssm-train.
All the articles in the series are listed in this space: lbourdois/SSM_blog_posts

Wishing you a good reading :)

2 replies

·

lbourdois

posted an update over 1 year ago

Post

The most widely used French NER models on HF ( Jean-Baptiste/camembert-ner and cmarkea/distilcamembert-base-ner) are trained on a single dataset (WikiNER) which on the one hand contains leaks and therefore distorts the true results of these models, and on the other hand overspecializes them in a particular domain (= texts from Wikipedia). They are also only available in a base version (110M parameters).

That's why I've trained new NER models in French both on more data (x3), as well as in base and large versions (336M). They are available in 3 entities (PER, ORG, LOC) or 4 entities (PER, ORG, LOC, MISC):
- CATIE-AQ/NERmembert-base-4entities
- CATIE-AQ/NERmembert-large-4entities
- CATIE-AQ/NERmembert-base-3entities
- CATIE-AQ/NERmembert-large-3entities

Datasets without leaks are also available:
- CATIE-AQ/frenchNER_4entities
- CATIE-AQ/frenchNER_3entities

lbourdois

posted an update over 1 year ago

Post

Let me introduce you LLE: Leaks, leaks everywhere!

A quick experiment I've carried out on around 600 datasets from the HF Hub, the results are stored in lbourdois/LLE, and the methodology is described in
https://huggingface.co/blog/lbourdois/lle