Model Card for Model ID

Model Details

Model Description

Developed by: Mohammad Essam (metga97)
Model type: BERT-style encoder
Language(s): Arabic (MSA + Egyptian dialect)
License: MIT
Finetuned from model: metga97/Modern-EgyBert-Base

Uses

This model is intended to be used for generating sentence embeddings for downstream tasks:

Sentence similarity
Semantic retrieval
Clustering of Arabic sentences
Intent classification
Duplicate detection

How to Get Started with the Model

Use the code below to get started with the model.

from transformers import AutoTokenizer, AutoModel
import torch

tokenizer = AutoTokenizer.from_pretrained("metga97/Modern-EgyBert-Embedding")
model = AutoModel.from_pretrained("metga97/Modern-EgyBert-Embedding")

text = ["الجو النهارده جميل"]
inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True)

with torch.no_grad():
    outputs = model(**inputs)
    last_hidden = outputs.last_hidden_state

# Mean Pooling
attention_mask = inputs["attention_mask"]
input_mask_expanded = attention_mask.unsqueeze(-1).expand(last_hidden.size()).float()
sum_embeddings = torch.sum(last_hidden * input_mask_expanded, dim=1)
sum_mask = torch.clamp(input_mask_expanded.sum(1), min=1e-9)
sentence_embedding = sum_embeddings / sum_mask

print(sentence_embedding.shape)  # torch.Size([1, 768])