Model seems to be incredibly slow on CPU
Using this on google colab, getting the embedding of a sentence with 1400 words took 31 minutes. Is this normal behavior?
When I run it locally, I see that only 1 core is being used, all other cores are dormant. Other embedding models use all my cores. Is this the expected behavior? We want to use this model in production, and having a 30 minute latency for 1 sentence is ludicrously high.
Getting the sentence embedding for around 250 sentences took around 7 hours.
Here's the CPU utilization when model.encode() is running
Am I doing something wrong? Is there a flag to enable multithreading?
Hi @adi751 , I'll look into the issue. In the meantime, you can try using sentence-transformers for inference, it should be much faster
I am using sentence transformers.
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("jinaai/jina-embeddings-v3", trust_remote_code=True)
model.encode(text) #text is a sentence of 1400 words
Hmm yes, it does take an unusually long time on Colab. Does the same issue occur when you run it locally? I just ran this code snippet on my machine, and it took only 1.5 seconds, while bge-m3 took 1.3 seconds. In general, our model is expected to be slightly slower than bge because we use relative positional embeddings and LoRA adapters.
Yes, the issue first showed up locally, and I could see that the resource utilization was whack. I tested on colab just to rule out any weird configuration on my local machine.
Are there any dependent libraries other than einops that are needed? only 1 core being used at a time is extremely weird to me, and torch should handle the parallel tensor operations . I don't see why only 1 core is being used...
edit: Are these the latency figures for cpu or gpu?
I ran it on a CPU.
As for additional dependencies, I don't think there are any more than what's listed in the README. I tested this in a freshly initialized venv. Here's what it looks like, in case it's helpful:
certifi==2024.8.30
charset-normalizer==3.3.2
einops==0.8.0
filelock==3.16.1
fsspec==2024.9.0
huggingface-hub==0.25.1
idna==3.10
Jinja2==3.1.4
joblib==1.4.2
MarkupSafe==2.1.5
mpmath==1.3.0
networkx==3.3
numpy==2.1.1
packaging==24.1
pillow==10.4.0
PyYAML==6.0.2
regex==2024.9.11
requests==2.32.3
safetensors==0.4.5
scikit-learn==1.5.2
scipy==1.14.1
sentence-transformers==3.1.1
sympy==1.13.3
threadpoolctl==3.5.0
tokenizers==0.19.1
torch==2.4.1
tqdm==4.66.5
transformers==4.44.2
typing_extensions==4.12.2
urllib3==2.2.3
I have created fresh environments and installed sentence_transformers, torch, einops
on 3 separate machines(local laptop, EC2 server, Colab). I'm facing similar latencies on all 3 systems. The CPU utilization behavior is similar on EC2.
edit: here's the pastebin for my pip freeze: https://p.ip.fi/XxE1
I observed the exact same behavior on my machine. Only one or two cores in use (I have 16 cores available)
It takes forever to process some sentences
I experience the same issues. Very slow.
Ok so the issue seems to be that many CPUs lack support for efficient bf16 operations, causing the workload to run on a single core instead of distributing across all available cores. As a result, operations like F.linear take around 400x longer with bf16 compared to fp32. To fix this, I’m updating the implementation to use fp32 by default when running on CPU. Lmk if the inference is faster now!
Yes, much much faster now, and utilizing all cores.
Thanks a ton for the quick fix!