TFLite version of sentence-transformers/all-MiniLM-L6-v2

This repository contains the sentence-transformers/all-MiniLM-L6-v2 model, converted to the TensorFlow Lite (TFLite) format for efficient on-device inference.

Two versions of the model are provided:

all-MiniLM-L6-v2.tflite: The standard Float32 model.
all-MiniLM-L6-v2-quant.tflite: An INT8 quantized version which is ~4x smaller and significantly faster on CPU, making it ideal for mobile and edge applications.

Model Details

Model Description

This is a sentence-transformers model that maps sentences & paragraphs to a 384-dimensional dense vector space. It can be used for tasks like clustering, semantic search, or sentence similarity. This specific version has been converted to TFLite to enable high-performance applications on edge devices like mobile phones, Raspberry Pi, or embedded systems.

The conversion was performed using the standard TensorFlow Lite converter with dynamic range quantization for the INT8 version.

Original Model by: UKPLab / Sentence-Transformers
Model type: Sentence-Embedding Transformer
Language(s): English
License: apache-2.0
Converted from: sentence-transformers/all-MiniLM-L6-v2

Model Sources

Original Model: https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2
Paper (Sentence-BERT): Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

Uses

Direct Use

The primary use of this model is to compute sentence embeddings on-device. It is particularly well-suited for:

Semantic search in mobile apps.
Text clustering and classification on edge devices.
Finding similar items based on text descriptions in offline-first applications.
Running as a lightweight microservice on CPU-constrained hardware.

Out-of-Scope Use

This is not a generative model. It cannot write text. Its knowledge is frozen based on the original training data and it has no information about events after its training cut-off date.

Bias, Risks, and Limitations

The biases, risks, and limitations of this model are inherited from the base model, all-MiniLM-L6-v2. The original model was trained on a large corpus of text from the internet and may reflect the societal and historical biases present in that data. Users should be aware of this when using the model in downstream applications.

How to Get Started with the Model

You can use the TFLite models with the tensorflow-lite Python library or directly in a mobile application (Android/iOS).

Python Example

# 1. Install necessary libraries
!pip install tensorflow
!pip install huggingface_hub
!pip install tokenizers

# 2. Import libraries
import tensorflow as tf
from huggingface_hub import hf_hub_download
from tokenizers import Tokenizer
import numpy as np

# 3. Download model and tokenizer from the Hub
REPO_ID = "YourUsername/YourModelRepoName" # <-- Change this to your repo ID!
TFLITE_MODEL_FILENAME = "all-MiniLM-L6-v2-quant.tflite"
TOKENIZER_FILENAME = "tokenizer.json" # Make sure you upload the tokenizer.json file!

model_path = hf_hub_download(repo_id=REPO_ID, filename=TFLITE_MODEL_FILENAME)
tokenizer_path = hf_hub_download(repo_id=REPO_ID, filename=TOKENIZER_FILENAME)

# 4. Load the TFLite model and tokenizer
tokenizer = Tokenizer.from_file(tokenizer_path)
tokenizer.enable_padding(pad_id=0, pad_token="[PAD]", length=128)
interpreter = tf.lite.Interpreter(model_path=model_path)

# 5. Prepare and run inference
sentences = ["This is an example sentence.", "Here is another one."]

# Tokenize input and get shapes
encoded_input = [tokenizer.encode(s) for s in sentences]
input_ids = tf.constant([e.ids for e in encoded_input])
attention_mask = tf.constant([e.attention_mask for e in encoded_input])

# Resize interpreter inputs and allocate tensors
input_details = interpreter.get_input_details()
interpreter.resize_tensor_input(input_details['index'], attention_mask.shape)
interpreter.resize_tensor_input(input_details['index'], input_ids.shape)
interpreter.allocate_tensors()

# Set input tensors
interpreter.set_tensor(input_details['index'], attention_mask)
interpreter.set_tensor(input_details['index'], input_ids)

# Run inference
interpreter.invoke()

# Get output and normalize
output_details = interpreter.get_output_details()
embeddings = interpreter.get_tensor(output_details['index'])
normalized_embeddings = tf.math.l2_normalize(embeddings, axis=1).numpy()

print("Embeddings generated successfully!")
print(f"Shape: {normalized_embeddings.shape}")
print(f"First embedding: {normalized_embeddings[:5]}...")

Downloads last month: 30

Model tree for Nihal2000/all-MiniLM-L6-v2-quant.tflite

Base model

sentence-transformers/all-MiniLM-L6-v2

Finetuned

(583)

this model

Nihal2000
/

all-MiniLM-L6-v2-quant.tflite