Urchade Zaratiana

urchade

AI & ML interests

None yet

Organizations

Emergent Methods's profile picture fastino's profile picture GLiNER Community's profile picture

urchade's activity

reacted to tomaarsen's post with ๐Ÿ”ฅ 11 months ago
view post
Post
4043
@Omartificial-Intelligence-Space has trained and released 6 Arabic embedding models for semantic similarity. 4 of them outperform all previous models on the STS17 Arabic-Arabic task!

๐Ÿ“š Trained on a large dataset of 558k Arabic triplets translated from the AllNLI triplet dataset: Omartificial-Intelligence-Space/Arabic-NLi-Triplet
6๏ธโƒฃ 6 different base models: AraBERT, MarBERT, LaBSE, MiniLM, paraphrase-multilingual-mpnet-base, mpnet-base, ranging from 109M to 471M parameters.
๐Ÿช† Trained with a Matryoshka loss, allowing you to truncate embeddings with minimal performance loss: smaller embeddings are faster to compare.
๐Ÿ“ˆ Outperforms all commonly used multilingual models like intfloat/multilingual-e5-large, sentence-transformers/paraphrase-multilingual-mpnet-base-v2, and sentence-transformers/LaBSE.

Check them out here:
- Omartificial-Intelligence-Space/Arabic-mpnet-base-all-nli-triplet
- Omartificial-Intelligence-Space/Arabic-all-nli-triplet-Matryoshka
- Omartificial-Intelligence-Space/Arabert-all-nli-triplet-Matryoshka
- Omartificial-Intelligence-Space/Arabic-labse-Matryoshka
- Omartificial-Intelligence-Space/Marbert-all-nli-triplet-Matryoshka
- Omartificial-Intelligence-Space/Arabic-MiniLM-L12-v2-all-nli-triplet
Or the collection with all: https://huggingface.co/collections/Omartificial-Intelligence-Space/arabic-matryoshka-embedding-models-666f764d3b570f44d7f77d4e

My personal favourite is likely Omartificial-Intelligence-Space/Arabert-all-nli-triplet-Matryoshka: a very efficient 135M parameters & scores #1 on mteb/leaderboard.
  • 1 reply
ยท
reacted to turiabu's post with ๐Ÿค— about 1 year ago
view post
Post
2224
Can anyone see my post on๐Ÿค—?
Reply with ๐Ÿค—
ยท
reacted to tomaarsen's post with ๐Ÿค—โค๏ธ๐Ÿ”ฅ about 1 year ago
view post
Post
2129
โ€ผ๏ธSentence Transformers v3.0 is out! You can now train and finetune embedding models with multi-GPU training, bf16 support, loss logging, callbacks & much more. I also release 50+ datasets to train on.

1๏ธโƒฃ Training Refactor
Embedding models can now be trained using an extensive trainer with a lot of powerful features:
- MultiGPU Training (Data Parallelism (DP) and Distributed Data Parallelism (DDP))
- bf16 training support; loss logging
- Evaluation datasets + evaluation loss
- Improved callback support + an excellent Weights & Biases integration
- Gradient checkpointing, gradient accumulation
- Improved model card generation
- Resuming from a training checkpoint without performance loss
- Hyperparameter Optimization
and much more!
Read my detailed blogpost to learn about the components that make up this new training approach: https://huggingface.co/blog/train-sentence-transformers

2๏ธโƒฃ Similarity Score
Not sure how to compare embeddings? Don't worry, you can now use model.similarity(embeddings1, embeddings2) and you'll get your similarity scores immediately. Model authors can specify their desired similarity score, so you don't have to worry about it anymore!

3๏ธโƒฃ Additional Kwargs
Sentence Transformers relies on various Transformers instances (AutoModel, AutoTokenizer, AutoConfig), but it was hard to provide valuable keyword arguments to these (like 'torch_dtype=torch.bfloat16' to load a model a lower precision for 2x inference speedup). This is now easy!

4๏ธโƒฃ Hyperparameter Optimization
Sentence Transformers now ships with HPO, allowing you to effectively choose your hyperparameters for your data and task.

5๏ธโƒฃ Dataset Release
To help you out with finetuning models, I've released 50+ ready-to-go datasets that can be used with training or finetuning embedding models: sentence-transformers/embedding-model-datasets-6644d7a3673a511914aa7552

Full release notes: https://github.com/UKPLab/sentence-transformers/releases/tag/v3.0.0
replied to their post about 1 year ago
replied to their post about 1 year ago
view reply

Hi @meduri30

Thank you for your interest in GLiNER, I am looking forward for your domain specific version ๐Ÿ˜€

I have started to work on RE
I have an initial version (Beta) you can try in colab. You can check this repo: https://github.com/urchade/GraphER

For now, the results are now robust but it can work for some domain I think.

posted an update about 1 year ago
view post
Post
10494
**Release Announcement: gliner_multi_pii-v1**

I am pleased to announce the release of gliner_multi_pii-v1, a model developed for recognizing a wide range of Personally Identifiable Information (PII). This model is the result of fine-tuning the urchade/gliner_multi-v2.1 on synthetic dataset (urchade/synthetic-pii-ner-mistral-v1).

**Model Features:**
- Capable of identifying multiple PII types including addresses, passport numbers, emails, social security numbers, and more.
- Designed to assist with data protection and compliance across various domains.
- Multilingual (English, French, Spanish, German, Italian, Portugese)

Link: urchade/gliner_multi_pii-v1

from gliner import GLiNER

model = GLiNER.from_pretrained("urchade/gliner_multi_pii-v1")

text = """
Harilala Rasoanaivo, un homme d'affaires local d'Antananarivo, a enregistrรฉ une nouvelle sociรฉtรฉ nommรฉe "Rasoanaivo Enterprises" au Lot II M 92 Antohomadinika. Son numรฉro est le +261 32 22 345 67, et son adresse รฉlectronique est [email protected]. Il a fourni son numรฉro de sรฉcu 501-02-1234 pour l'enregistrement.
"""

labels = ["work", "booking number", "personally identifiable information", "driver licence", "person",  "address", "company",  "email", "passport number", "Social Security Number", "phone number"]
entities = model.predict_entities(text, labels)

for entity in entities:
    print(entity["text"], "=>", entity["label"])


Harilala Rasoanaivo => person
Rasoanaivo Enterprises => company
Lot II M 92 Antohomadinika => full address
+261 32 22 345 67 => phone number
[email protected] => email
501-02-1234 => Social Security Number

  • 1 reply
ยท
replied to their post about 1 year ago
replied to their post about 1 year ago
posted an update about 1 year ago
view post
Post
11114
**Some updates on GLiNER**

๐Ÿ†• A new commercially permissible multilingual version is available urchade/gliner_multiv2.1

๐Ÿ› A subtle bug that causes performance degradation on some models has been corrected. Thanks to @yyDing1 for raising the issue.

from gliner import GLiNER

# Initialize GLiNER
model = GLiNER.from_pretrained("urchade/gliner_multiv2.1")

text = "This is a text about Bill Gates and Microsoft."

# Labels for entity prediction
labels = ["person", "organization", "email"]

entities = model.predict_entities(text, labels, threshold=0.5)

for entity in entities:
    print(entity["text"], "=>", entity["label"])
ยท
reacted to tomaarsen's post with ๐Ÿ”ฅ about 1 year ago
view post
Post
๐ŸŽ‰Today, the 5000th Sentence Transformer model was uploaded to Hugging Face! Embedding models are extremely versatile, so it's no wonder that they're still being trained.

Here's a few resources to get you started with them:
- All Sentence Transformer models: https://huggingface.co/models?library=sentence-transformers&sort=trending
- Sentence Transformer documentation: https://sbert.net/
- Massive Text Embedding Benchmark (MTEB) Leaderboard: mteb/leaderboard

The embedding space is extremely active right now, so if you're using an embedding model for your retrieval, semantic similarity, reranking, classification, clustering, etc., then be sure to keep an eye out on the trending Sentence Transformer models & new models on MTEB.

Also, I'm curious if you've ever used Sentence Transformers via a third party library, like a RAG framework or vector database. I'm quite interested in more integrations to bring everyone free, efficient & powerful embedding models!
reacted to giux78's post with โค๏ธ about 1 year ago
view post
Post
Super work from @DeepMount00 :

๐Ÿš€ ๐ƒ๐ข๐ฌ๐œ๐จ๐ฏ๐ž๐ซ ๐”๐ง๐ข๐ฏ๐ž๐ซ๐ฌ๐š๐ฅ ๐๐ž๐ซ: ๐€ ๐†๐ฅ๐ข๐๐ž๐ซ-๐๐š๐ฌ๐ž๐ ๐ˆ๐ญ๐š๐ฅ๐ข๐š๐ง ๐๐„๐‘

Introducing ๐”๐ง๐ข๐ฏ๐ž๐ซ๐ฌ๐š๐ฅ ๐๐ž๐ซ ๐Ÿ๐จ๐ซ ๐ˆ๐ญ๐š๐ฅ๐ข๐š๐ง ๐‹๐š๐ง๐ ๐ฎ๐š๐ ๐ž, a revolutionary Named Entity Recognition (NER) model evolved from the GliNer architecture and meticulously tailored for the Italian language. This advanced model is a beacon of efficiency and versatility, engineered to ๐ซ๐ž๐œ๐จ๐ ๐ง๐ข๐ณ๐ž ๐š๐ง๐ฒ ๐ž๐ง๐ญ๐ข๐ญ๐ฒ ๐ญ๐ฒ๐ฉ๐ž within the rich nuances of Italian, using a bidirectional transformer encoder. It stands out as an ideal solution for those navigating the challenges of resource-limited environments or seeking an efficient alternative to the cumbersome Large Language Models (LLMs).
๐‘๐ฎ๐ง๐ฌ ๐Ÿ๐š๐ฌ๐ญ ๐š๐ฅ๐ฌ๐จ ๐จ๐ง ๐‚๐๐”!

Experience this Italian-focused innovation live on Hugging Face Spaces:
DeepMount00/universal_ner_ita

Paper: https://arxiv.org/abs/2311.08526 Urchade Zaratiana et all. great work!
ยท
replied to tomaarsen's post about 1 year ago
reacted to tomaarsen's post with โค๏ธ over 1 year ago
view post
Post
I remember very well that about two years ago, 0-shot named entity recognition (i.e. where you can choose any labels on the fly) was completely infeasible. Fast forward a year, and Universal-NER/UniNER-7B-all surprised me by showing that 0-shot NER is possible! However, I had a bunch of concerns that prevented me from ever adopting it myself. For example, the model was 7B parameters, only worked with 1 custom label at a time, and it had a cc-by-nc-4.0 license.

Since then, a little known research paper introduced GLiNER, which was a modified & finetuned variant of the microsoft/deberta-v3-base line of models. Notably, GLiNER outperforms UniNER-7B, despite being almost 2 orders of magnitude smaller! It also allows for multiple labels at once, supports nested NER, and the models are Apache 2.0.

Very recently, the models were uploaded to Hugging Face, and I was inspired to create a demo for the English model. The demo runs on CPU, and can still very efficiently compute labels with great performance. I'm very impressed at the models.

There are two models right now:
* base (english): urchade/gliner_base
* multi (multilingual): urchade/gliner_multi

And my demo to experiment with the base model can be found here: https://huggingface.co/spaces/tomaarsen/gliner_base
ยท
replied to their post over 1 year ago
view reply

oh, ok. I forgot to update it, thanks!

replied to their post over 1 year ago
view reply

yes, I uploaded the weight are hosted on huggingface. It should be visible on my profile :)

posted an update over 1 year ago
view post
Post
Hi everyone,

I'd like to share our project on open-type Named Entity Recognition (NER). Our model uses a transformer encoder (BERT-like), making the computation overhead very minimal compared to use of LLMs. I've developed a demo that runs on CPU on Google Colab.

Colab Demo: https://colab.research.google.com/drive/1mhalKWzmfSTqMnR0wQBZvt9-ktTsATHB?usp=sharing

Code: https://github.com/urchade/GLiNER

Paper: https://arxiv.org/abs/2311.08526
ยท