Model seems to be incredibly slow on CPU

#34

by adi751 - opened Sep 24, 2024

Sep 24, 2024

Using this on google colab, getting the embedding of a sentence with 1400 words took 31 minutes. Is this normal behavior?

When I run it locally, I see that only 1 core is being used, all other cores are dormant. Other embedding models use all my cores. Is this the expected behavior? We want to use this model in production, and having a 30 minute latency for 1 sentence is ludicrously high.

Getting the sentence embedding for around 250 sentences took around 7 hours.

Here's the CPU utilization when model.encode() is running

Am I doing something wrong? Is there a flag to enable multithreading?

adi751

Sep 24, 2024

Here's how my CPU utilization looks like when I use a different embedding model(BGE-large-en) on the same input sentence:

jupyterjazz

Jina AI org Sep 24, 2024

Hi @adi751 , I'll look into the issue. In the meantime, you can try using sentence-transformers for inference, it should be much faster

adi751

Sep 24, 2024

I am using sentence transformers.

from sentence_transformers import SentenceTransformer
model = SentenceTransformer("jinaai/jina-embeddings-v3", trust_remote_code=True)
model.encode(text) #text is a sentence of 1400 words

jupyterjazz

Jina AI org Sep 24, 2024

Hmm yes, it does take an unusually long time on Colab. Does the same issue occur when you run it locally? I just ran this code snippet on my machine, and it took only 1.5 seconds, while bge-m3 took 1.3 seconds. In general, our model is expected to be slightly slower than bge because we use relative positional embeddings and LoRA adapters.

adi751

Sep 24, 2024

•

edited Sep 24, 2024

Yes, the issue first showed up locally, and I could see that the resource utilization was whack. I tested on colab just to rule out any weird configuration on my local machine.

Are there any dependent libraries other than einops that are needed? only 1 core being used at a time is extremely weird to me, and torch should handle the parallel tensor operations . I don't see why only 1 core is being used...

edit: Are these the latency figures for cpu or gpu?

jupyterjazz

Jina AI org Sep 24, 2024

I ran it on a CPU.

As for additional dependencies, I don't think there are any more than what's listed in the README. I tested this in a freshly initialized venv. Here's what it looks like, in case it's helpful:
certifi==2024.8.30
charset-normalizer==3.3.2
einops==0.8.0
filelock==3.16.1
fsspec==2024.9.0
huggingface-hub==0.25.1
idna==3.10
Jinja2==3.1.4
joblib==1.4.2
MarkupSafe==2.1.5
mpmath==1.3.0
networkx==3.3
numpy==2.1.1
packaging==24.1
pillow==10.4.0
PyYAML==6.0.2
regex==2024.9.11
requests==2.32.3
safetensors==0.4.5
scikit-learn==1.5.2
scipy==1.14.1
sentence-transformers==3.1.1
sympy==1.13.3
threadpoolctl==3.5.0
tokenizers==0.19.1
torch==2.4.1
tqdm==4.66.5
transformers==4.44.2
typing_extensions==4.12.2
urllib3==2.2.3

adi751

Sep 24, 2024

•

edited Sep 24, 2024

I have created fresh environments and installed sentence_transformers, torch, einops on 3 separate machines(local laptop, EC2 server, Colab). I'm facing similar latencies on all 3 systems. The CPU utilization behavior is similar on EC2.

edit: here's the pastebin for my pip freeze: https://p.ip.fi/XxE1

cnmoro

Sep 24, 2024

I observed the exact same behavior on my machine. Only one or two cores in use (I have 16 cores available)
It takes forever to process some sentences

hveigz

Sep 25, 2024

I experience the same issues. Very slow.

jupyterjazz

Jina AI org Sep 25, 2024

Ok so the issue seems to be that many CPUs lack support for efficient bf16 operations, causing the workload to run on a single core instead of distributing across all available cores. As a result, operations like F.linear take around 400x longer with bf16 compared to fp32. To fix this, I’m updating the implementation to use fp32 by default when running on CPU. Lmk if the inference is faster now!

adi751

Sep 25, 2024

Yes, much much faster now, and utilizing all cores.

Thanks a ton for the quick fix!

adi751 changed discussion status to closed Sep 25, 2024

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment