File size: 11,246 Bytes
a2a30ce
 
 
 
db490d7
 
 
 
 
ca67a0d
db490d7
f10f3d2
 
 
1807ea4
e17bacd
db490d7
7faf56a
de44978
 
006f15d
 
 
8bb0749
db490d7
707634e
 
 
6c6b419
4383e8a
806cfb4
 
6941498
56334a0
9ba4185
21ae04b
1c41c3d
e8268ae
0053fa9
cc26a33
216fd54
27728dc
559c1a0
 
27728dc
 
80f0687
27728dc
0ea6c62
e0e7e10
4eef087
0053fa9
81d7818
3e36a38
7ba2d01
 
87787b1
e72cca2
19c08a9
13adbac
261a825
4a633dc
87787b1
19c08a9
df900ba
 
aee8285
13db164
bf821af
aee8285
 
 
 
0ea6c62
830f2b2
40a4b41
44c8c8d
e8f36b1
44c8c8d
1e40a95
3dd8202
19c08a9
81d7818
 
 
da8a392
d1761d2
4ec7dbe
81d7818
216fd54
edbeb79
81d7818
 
edbeb79
81d7818
c97f69a
edbeb79
7ba2d01
0053fa9
13adbac
19c08a9
81d7818
 
2d290e4
 
225a69e
0fb7a38
 
 
 
 
 
 
3e36a38
2d290e4
3e36a38
 
7951727
268611f
 
 
8d5c583
7951727
81d7818
3e36a38
6abb2f6
3e36a38
7951727
6abb2f6
 
a0643c9
3e36a38
6abb2f6
3e36a38
7951727
 
a0643c9
da8a392
0fb7a38
216fd54
 
 
 
1e5c486
216fd54
 
 
 
1e5c486
216fd54
 
 
235d99a
216fd54
bb7daf7
 
964ac28
bb7daf7
216fd54
 
836c278
216fd54
836c278
 
 
235d99a
1e5c486
836c278
1e5c486
836c278
1e5c486
836c278
1e5c486
0053fa9
42ff04c
a0643c9
0053fa9
42ff04c
18879bc
ec25569
 
e3703b0
 
 
c9b8cc1
42ff04c
 
8f5f194
c9b8cc1
42ff04c
 
 
a1b1441
 
 
 
216fd54
bf9e7bf
 
 
 
 
 
 
 
372acd1
bf9e7bf
 
 
 
 
 
31f3d48
865a8a5
e0e7e10
3dcce4a
865a8a5
e467d8b
a1b1441
 
 
 
 
 
5942394
806cfb4
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
---
library_name: sentence-transformers
pipeline_tag: sentence-similarity
tags:
- sentence-transformers
- sentence-similarity
- feature-extraction
- embedder
- embedding
- models
- GGUF
- Bert
- Nomic
- Gist
- BGE
- Jina
- text-embeddings-inference
- RAG
- Rerank
- similarity
- PDF
- Parsing
- Parser
misc:
- text-embeddings-inference
language:
- en
- de
architecture:

---

# <b>All models tested with ALLM(AnythingLLM) with LM-Studio as server, all models should be work with ollama</b>
<b> the setup for local documents described below is allmost the same, GPT4All has only one model (nomic), and koboldcpp is not build in right now but in development</b><br>

(sometimes the results are more truthful if the “chat with document only” option is used)<br>
BTW embedder is only a part of a good RAG<br>
<b>&#x21e8;</b> give me a ❤️, if you like  ;)<br>
<br>
<b>My short impression:</b>
<ul style="line-height: 1.05;">
<li>nomic-embed-text (up to 2048t context length)</li> 
<li>mxbai-embed-large</li>
<li>mug-b-1.6</li>
<li>snowflake-arctic-embed-l-v2.0 (up to 8192t context length)</li>
<li>Ger-RAG-BGE-M3 (german, up to 8192t context length)</li>
<li>german-roberta</li>
<li>bge-m3 (up to 8192t context length)</li>
</ul>
Working well, all other its up to you! Some models are very similar! (jina and qwen based not yet supported by LM-Studio)<br>
With the same setting, these embedders found same 6-7 snippets out of 10 from a book. This means that only 3-4 snippets were different, but I didn't test it extensively.
<br>
<br>
...

# Short hints for using (Example for a large context with many expected hits):
Set your (Max Tokens)context-lenght 16000t main-LLM-model, set your embedder-model (Max Embedding Chunk Length) 1024t,set (Max Context Snippets) 14, 
 in ALLM set also (Text splitting & Chunking Preferences - Text Chunk Size) 1024 character parts and (Search Preference) "accuracy".
<br>

-> Ok what that mean!<br>
Your document will be embedd in x times 1024t chunks(snippets),<br>
You can receive 14-snippets a 1024t (~14000t) from your document ~10000words(10pages) and ~2000t left (from 16000t) for the answer ~1000words (2 pages)
<br>
You can play and set for your needs, eg 8-snippets a 2048t, or 28-snippets a 512t ... (every time you change the chunk-length the document must be embedd again). With these settings everything fits best for ONE answer, if you need more for a conversation, you should set lower and/or disable the document.
<ul style="line-height: 1.05;">
english vs german differ 50%<br>
~5000 characters is one page of a book (no matter ger/en) but words in german are longer, that means per word more token<br>
the example is english, for german you can add apox 50% more token (1000 words ~1800t)<br>
<li>1200t (~1000 words ~5000 chracter) ~0.1GB, this is aprox one page with small font</li>
<li>8000t (~6000 words) ~0.8GB VRAM usage</li>
<li>16000t (~12000 words) ~1.5GB VRAM usage</li>
<li>32000t (~24000 words) ~3GB VRAM usage</li>
</ul>
<br>
here is a tokenizer calculator<br>
<a href="https://quizgecko.com/tools/token-counter">https://quizgecko.com/tools/token-counter</a><br>
and a Vram calculator - (you need the original model link NOT the GGUF)<br>
<a href="https://huggingface.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator">https://huggingface.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator</a><br>

...
<br>

# How embedding and search works:

You have a txt/pdf file maybe 90000words(~300pages) a book. You ask the model lets say "what is described in chapter called XYZ in relation to person ZYX". 
Now it searches for keywords or similar semantic terms in the document. if it has found them, lets say word and meaning around “XYZ and ZYX” , 
now a piece of text 1024token around this word “XYZ/ZYX” is cut out at this point.  (In reality, it's all done with coded numbers, but dosnt matter - the principle)<br>
This text snippet is then used for your answer. <br>
<ul style="line-height: 1.05;">
<li>If, for example, the word “XYZ” occurs 50 times in one file, not all 50 are used for answer, only the number of snippets with a fast ranking are used</li>
<li>If only one snippet corresponds to your question all other snippets can negatively influence your answer because they do not fit the topic (usually 4 to 32 snippet are fine)</li>
<li>If you expect multible search results in your docs try 16-snippets or more, if you expect only 2 than dont use more!</li>
<li>If you use chunk-length ~2048(chars) you receive more content, if you use ~512chars you receive more facts BUT lower chunk-length are more chunks and need much longer time.</li>
<li>A question for "summary of the document" is most time not useful, if the document has an introduction or summaries its searching there if you have luck.</li>
<li>If a book has a table of contents or a bibliography, I would delete these pages as they often contain relevant search terms but do not help answer your question.</li>
<li>If the documents small like 10-20 Pages, its better you copy the whole text inside the CHAT, some options called "pin".</li>
</ul>
<br>
...
<br>

# Nevertheless, the <b>main model is also important</b>! 
Especially to deal with the context length and I don't mean just the theoretical number you can set.
Some models can handle 128k or 1M tokens, but even with 16k or 32k input the response with the same snippets as input is worse than with other well developed models.<br>
<br>
llama3.1, llama3.2, qwen2.5, deepseek-r1-distill, gemma-3, granite, SauerkrautLM-Nemo(german) ... <br>
(llama3 or phi3.5 are not working well) <br><br>
<b>&#x21e8;</b> best models for english and german:<br>
granit3.2-8b (2b version also) - https://huggingface.co/ibm-research/granite-3.2-8b-instruct-GGUF<br>
Chocolatine-2-14B (other versions also) - https://huggingface.co/mradermacher/Chocolatine-2-14B-Instruct-DPO-v2.0b11-GGUF<br>
QwQ-LCoT- (7/14b) - https://huggingface.co/mradermacher/QwQ-LCoT-14B-Conversational-GGUF<br><br>

...
# Important -> The Systemprompt (some examples):
<li> The system prompt is weighted with a certain amount of influence around your question. You can easily test it once without or with a nonsensical system prompt.</li>

"You are a helpful assistant who provides an overview of ... under the aspects of ... . 
You use attached excerpts from the collection to generate your answers! 
Weight each individual excerpt in order, with the most important excerpts at the top and the less important ones further down. 
The context of the entire article should not be given too much weight.  
Answer the user's question!  
After your answer, briefly explain why you included excerpts (1 to X) in your response and justify briefly if you considered some of them unimportant!"<br>
<i>(change it for your needs, this example works well when I consult a book about a person and a term related to them, the explanation part was just a test for myself)</i><br>

or:<br>

"You are an imaginative storyteller who crafts compelling narratives with depth, creativity, and coherence. 
Your goal is to develop rich, engaging stories that captivate readers, staying true to the themes, tone, and style appropriate for the given prompt.
You use attached excerpts from the collection to generate your answers!
When generating stories, ensure the coherence in characters, setting, and plot progression. Be creative and introduce imaginative twists and unique perspectives."<br>

or:<br>

"You are are a warm and engaging companion who loves to talk about cooking, recipes and the joy of food. 
Your aim is to share delicious recipes, cooking tips and the stories behind different cultures in a personal, welcoming and knowledgeable way."<br>
<br>
btw. <b>Jinja</b> templates very new ... the usual templates with usual models are fine, but merged models have a lot of optimization potential (but dont ask me iam not a coder)<br>
<br><br>

...
<br>
# DOC/PDF 2 TXT<br>
Prepare your documents by yourself!<br>
Bad Input = bad Output!<br>
In most cases, it is not immediately obvious how the document is made available to the embedder.
in nearly all cases images and tables, page-numbers, chapters and sections/paragraph-format not well implement.
An easy start is to use a python based pdf-parser (it give a lot).<br>
option only for simple txt/tables converting:
<ul style="line-height: 1.05;">
<li>pdfplumber</li>
<li>fitz/PyMuPDF</li>
<li>Camelot</li>
</ul>
All in all you can tune a lot your code and you can manual add OCR.<br>
my option:<br>
<a href="https://huggingface.co/kalle07/pdf2txt_parser_converter">https://huggingface.co/kalle07/pdf2txt_parser_converter</a>

<br><br>
option all in all solution for the future:
<ul style="line-height: 1.05;">
<li>docling - (opensource on github)</li>
</ul>
it give some ready to use examples, which are already pretty good, ~10-20 code-lines.
<br>
<a href="https://github.com/docling-project/docling/tree/main/docs/examples">https://github.com/docling-project/docling/tree/main/docs/examples</a><br>
also for OCR it download automatic some models. the only thing i haven't found yet (maybe it doesn't exist) is to read out the font-type, which works very well with <b>fitz</b>, for example.
<br><br>
large option to play with many types of (UI-Based)
<ul style="line-height: 1.05;">
<li>Parsemy PDF</li>
</ul>
<a href="https://github.com/genieincodebottle/parsemypdf">https://github.com/genieincodebottle/parsemypdf</a><br>
<br>

...
<br>
# only Indexing option<br>
One hint for fast search on 10000s of PDF (its only indexing not embedding) you can use it as a simple way to find your top 5-10 articles or books, you can then make these available to an LLM.<br>
Jabref - https://github.com/JabRef/jabref/tree/v6.0-alpha?tab=readme-ov-file <br>
https://builds.jabref.org/main/ <br>
or<br>
docfetcher - https://docfetcher.sourceforge.io/en/index.html (yes old but very useful)
<br><br>
...
<br>
" on discord <b>sevenof9</b> "
<br><br>
...
<br>


# (ALL licenses and terms of use go to original author)

...

<ul style="line-height: 1.05;">
<li>avemio/German-RAG-BGE-M3-MERGED-x-SNOWFLAKE-ARCTIC-HESSIAN-AI (German, English)</li>
<li>maidalun1020/bce-embedding-base_v1 (English and Chinese)</li>
<li>maidalun1020/bce-reranker-base_v1 (English, Chinese, Japanese and Korean)</li>
<li>BAAI/bge-reranker-v2-m3 (English and Chinese)</li>
<li>BAAI/bge-reranker-v2-gemma (English and Chinese)</li>
<li>BAAI/bge-m3 (English and Chinese)</li>
<li>avsolatorio/GIST-large-Embedding-v0 (English)</li>
<li>ibm-granite/granite-embedding-278m-multilingual (English, German, Spanish, French, Japanese, Portuguese, Arabic, Czech, Italian, Korean, Dutch, and Chinese)</li>
<li>ibm-granite/granite-embedding-125m-english</li>
<li>Labib11/MUG-B-1.6 (?)</li>
<li>mixedbread-ai/mxbai-embed-large-v1 (multi)</li>
<li>nomic-ai/nomic-embed-text-v1.5 (English, multi)</li>
<li>Snowflake/snowflake-arctic-embed-l-v2.0 (English, multi)</li>
<li>intfloat/multilingual-e5-large-instruct (100 languages)</li>
<li>T-Systems-onsite/german-roberta-sentence-transformer-v2</li>
<li>mixedbread-ai/mxbai-embed-2d-large-v1</li>
<li>jinaai/jina-embeddings-v2-base-en</li>
<li>Qwen/Qwen3-Embedding-0.6B</li>
<li>HIT-TMG/KaLM-embedding-multilingual-mini-instruct-v1.5</li>
  
</ul>