RexBERT-base

TL;DR: An encoder-only transformer (ModernBERT-style) for e-commerce applications, trained in three phases—Pre-training, Context Extension, and Decay—to power product search, attribute extraction, classification, and embeddings use cases. The model has been trained on 2.3T+ tokens along with 350B+ e-commerce-specific tokens

Quick Start
Intended Uses & Limitations
Model Description
Training Recipe
Data Overview
Evaluation
Usage Examples
Model Architecture & Compatibility
Efficiency & Deployment Tips
Responsible & Safe Use
License
Maintainers & Contact
Citation

Quick Start

import torch
from transformers import AutoTokenizer, AutoModel, AutoModelForMaskedLM, pipeline

MODEL_ID = "thebajajra/RexBERT-base"

# Tokenizer
tok = AutoTokenizer.from_pretrained(MODEL_ID, use_fast=True)

# 1) Fill-Mask (if MLM head is present)
mlm = pipeline("fill-mask", model=MODEL_ID, tokenizer=tok)
print(mlm("These running shoes are great for [MASK] training."))

# 2) Feature extraction (CLS or mean-pooled embeddings)
enc = AutoModel.from_pretrained(MODEL_ID)
inputs = tok(["wireless mouse", "ergonomic mouse pad"], padding=True, truncation=True, return_tensors="pt")
with torch.no_grad():
    out = enc(**inputs, output_hidden_states=True)
# Mean-pool last hidden state for sentence embeddings
emb = (out.last_hidden_state * inputs.attention_mask.unsqueeze(-1)).sum(dim=1) / inputs.attention_mask.sum(dim=1, keepdim=True)

Intended Uses & Limitations

Use cases

Product & query retrieval/semantic search (titles, descriptions, attributes)
Attribute extraction / slot filling (brand, color, size, material)
Classification (category assignment, unsafe/regulated item filtering, review sentiment)
Reranking and query understanding (spelling/ASR normalization, acronym expansion)

Out of scope

Long-form generation (use a decoder/seq-to-seq LM instead)
High-stakes decisions without human review (pricing, compliance, safety flags)

Target users

Search/recs engineers, e-commerce data teams, ML researchers working on domain-specific encoders

Model Description

RexBERT-base is an encoder-only, 150M parameter transformer trained with a masked-language-modeling objective and optimized for e-commerce related text. The three-phase training curriculum improves general language understanding, extends context handling, and then specializes on a very large corpus of commerce data to capture domain-specific terminology and entity distributions.

Training Recipe

RexBERT-base was trained in three phases:

Pre-training
General-purpose MLM pre-training on diverse English text for robust linguistic representations.
Context Extension
Continued training with increased max sequence length to better handle long product pages, concatenated attribute blocks, multi-turn queries, and facet strings. This preserves prior capabilities while expanding context handling.
Decay on 350B+ e-commerce tokens
Final specialization stage on 350B+ domain-specific tokens (product catalogs, queries, reviews, taxonomy/attributes). Learning rate and sampling weights are annealed (decayed) to consolidate domain knowledge and stabilize performance on commerce tasks.

Training details (fill in):

Optimizer / LR schedule: TODO
Effective batch size / steps per phase: TODO
Context lengths per phase (e.g., 512 → 1k/2k): TODO
Tokenizer/vocab: TODO
Hardware & wall-clock: TODO
Checkpoint tags: TODO (e.g., pretrain, ext, decay)

Data Overview

Dataset: Ecom-niverse
Domain mix:

We identified 9 E-commerce overlapping domains which have significant amount of relevant tokens but required filteration. Below is the domain list and their filtered size

Domain	Size (GBs)
Hobby	114
News	66
Health	66
Entertainment	64
Travel	52
Food	22
Automotive	19
Sports	12
Music and Dance	7

Additionally, there are 6 more domains which had almost complete overlap and were picked directly out of FineFineWeb.

Domain	Size (GBs)
Fashion	37
Beauty	37
Celebrity	28
Movie	26
Photo	15
Painting	2

By focusing on these domains, we narrow the search space to parts of the web data where shopping-related text is likely to appear. However, even within a chosen domain, not every item is actually about buying or selling, many may be informational articles, news, or unrelated discussions. Thus, a more fine-grained filtering within each domain is required to extract only the e-commerce-specific lines. We accomplish this by training lightweight classifiers per domain to distinguish e-commerce context vs. non-e-commerce content.

Evaluation

Token Classification

With 2–3x fewer parameters, RexBERT surpasses the performance of the ModernBERT series.

Semantic Similarity

RexBERT models outperform all the models in their parameter/size category.

Usage Examples

1) Masked language modeling

from transformers import AutoModelForMaskedLM, AutoTokenizer, pipeline

m = AutoModelForMaskedLM.from_pretrained("thebajajra/RexBERT-base")
t = AutoTokenizer.from_pretrained("thebajajra/RexBERT-base")
fill = pipeline("fill-mask", model=m, tokenizer=t)

fill("Best [MASK] headphones under $100.")

2) Embeddings / feature extraction

import torch
from transformers import AutoTokenizer, AutoModel

tok = AutoTokenizer.from_pretrained("thebajajra/RexBERT-base")
enc = AutoModel.from_pretrained("thebajajra/RexBERT-base")

texts = ["nike air zoom pegasus 40", "running shoes pegasus zoom nike"]
batch = tok(texts, padding=True, truncation=True, return_tensors="pt")

with torch.no_grad():
    out = enc(**batch)
# Mean-pool last hidden state
attn = batch["attention_mask"].unsqueeze(-1)
emb = (out.last_hidden_state * attn).sum(1) / attn.sum(1)
# Normalize for cosine similarity (recommended for retrieval)
emb = torch.nn.functional.normalize(emb, p=2, dim=1)

3) Text classification fine-tune

from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer

tok = AutoTokenizer.from_pretrained("thebajajra/RexBERT-base")
model = AutoModelForSequenceClassification.from_pretrained("thebajajra/RexBERT-base", num_labels=NUM_LABELS)

# Prepare your Dataset objects: train_ds, val_ds (text→label)
args = TrainingArguments(
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    learning_rate=3e-5,
    num_train_epochs=3,
    evaluation_strategy="steps",
    fp16=True,
    report_to="none",
    load_best_model_at_end=True,
)

trainer = Trainer(model=model, args=args, train_dataset=train_ds, eval_dataset=val_ds, tokenizer=tok)
trainer.train()

Model Architecture & Compatibility

Architecture: Encoder-only, ModernBERT-style base model.
Libraries: Works with 🤗 Transformers; supports fill-mask and feature-extraction pipelines.
Context length: Increased during the Context Extension phase—ensure max_position_embeddings in config.json matches your desired max length.
Files: config.json, tokenizer files, and (optionally) heads for MLM or classification.
Export: Standard PyTorch weights; you can export ONNX / TorchScript for production if needed.

Responsible & Safe Use

Biases: Commerce data can encode brand, price, and region biases; audit downstream classifiers/retrievers for disparate error rates across categories/regions.
Sensitive content: Add filters for adult/regulated items; document moderation thresholds if you release classifiers.
Privacy: Do not expose PII; ensure training data complies with terms and applicable laws.
Misuse: This model is not a substitute for legal/compliance review for listings.

License

License: apache-2.0.

Maintainers & Contact

Authors: Rahul Bajaj, Anuj Garg

Downloads last month: 1,052

Safetensors

Model size

0.1B params

Tensor type

F32

Model tree for thebajajra/RexBERT-base

Finetunes

2 models

Quantizations

1 model

Dataset used to train thebajajra/RexBERT-base

Collection including thebajajra/RexBERT-base

RexBERT

Collection

Collection for RexBERT encoders and checkpoints • 6 items • Updated Sep 21 • 7

thebajajra
/

RexBERT-base