Model seems to be incredibly slow on CPU

#34
by adi751 - opened

Using this on google colab, getting the embedding of a sentence with 1400 words took 31 minutes. Is this normal behavior?

When I run it locally, I see that only 1 core is being used, all other cores are dormant. Other embedding models use all my cores. Is this the expected behavior? We want to use this model in production, and having a 30 minute latency for 1 sentence is ludicrously high.

Getting the sentence embedding for around 250 sentences took around 7 hours.

Here's the CPU utilization when model.encode() is running
image.png

Am I doing something wrong? Is there a flag to enable multithreading?

Here's how my CPU utilization looks like when I use a different embedding model(BGE-large-en) on the same input sentence:

image.png

Jina AI org

Hi @adi751 , I'll look into the issue. In the meantime, you can try using sentence-transformers for inference, it should be much faster

I am using sentence transformers.

from sentence_transformers import SentenceTransformer
model = SentenceTransformer("jinaai/jina-embeddings-v3", trust_remote_code=True)
model.encode(text) #text is a sentence of 1400 words
Jina AI org

Hmm yes, it does take an unusually long time on Colab. Does the same issue occur when you run it locally? I just ran this code snippet on my machine, and it took only 1.5 seconds, while bge-m3 took 1.3 seconds. In general, our model is expected to be slightly slower than bge because we use relative positional embeddings and LoRA adapters.

Yes, the issue first showed up locally, and I could see that the resource utilization was whack. I tested on colab just to rule out any weird configuration on my local machine.

Are there any dependent libraries other than einops that are needed? only 1 core being used at a time is extremely weird to me, and torch should handle the parallel tensor operations . I don't see why only 1 core is being used...

edit: Are these the latency figures for cpu or gpu?

Jina AI org

I ran it on a CPU.

As for additional dependencies, I don't think there are any more than what's listed in the README. I tested this in a freshly initialized venv. Here's what it looks like, in case it's helpful:
certifi==2024.8.30
charset-normalizer==3.3.2
einops==0.8.0
filelock==3.16.1
fsspec==2024.9.0
huggingface-hub==0.25.1
idna==3.10
Jinja2==3.1.4
joblib==1.4.2
MarkupSafe==2.1.5
mpmath==1.3.0
networkx==3.3
numpy==2.1.1
packaging==24.1
pillow==10.4.0
PyYAML==6.0.2
regex==2024.9.11
requests==2.32.3
safetensors==0.4.5
scikit-learn==1.5.2
scipy==1.14.1
sentence-transformers==3.1.1
sympy==1.13.3
threadpoolctl==3.5.0
tokenizers==0.19.1
torch==2.4.1
tqdm==4.66.5
transformers==4.44.2
typing_extensions==4.12.2
urllib3==2.2.3

I have created fresh environments and installed sentence_transformers, torch, einops on 3 separate machines(local laptop, EC2 server, Colab). I'm facing similar latencies on all 3 systems. The CPU utilization behavior is similar on EC2.

edit: here's the pastebin for my pip freeze: https://p.ip.fi/XxE1

I observed the exact same behavior on my machine. Only one or two cores in use (I have 16 cores available)
It takes forever to process some sentences

I experience the same issues. Very slow.

Jina AI org

Ok so the issue seems to be that many CPUs lack support for efficient bf16 operations, causing the workload to run on a single core instead of distributing across all available cores. As a result, operations like F.linear take around 400x longer with bf16 compared to fp32. To fix this, I’m updating the implementation to use fp32 by default when running on CPU. Lmk if the inference is faster now!

Yes, much much faster now, and utilizing all cores.

Thanks a ton for the quick fix!

adi751 changed discussion status to closed

Sign up or log in to comment