writinwaters
commited on
Commit
·
4d5d480
1
Parent(s):
3256beb
Updated RAGFlow's dataset configuration UI (#3376)
Browse files### What problem does this PR solve?
### Type of change
- [x] Documentation Update
- docs/guides/configure_knowledge_base.md +2 -2
- docs/quickstart.mdx +1 -1
- web/src/locales/en.ts +12 -14
docs/guides/configure_knowledge_base.md
CHANGED
@@ -107,8 +107,8 @@ RAGFlow features visibility and explainability, allowing you to view the chunkin
|
|
107 |
|
108 |

|
109 |
|
110 |
-
:::caution NOTE
|
111 |
-
You can add keywords to a file chunk to increase its
|
112 |
:::
|
113 |
|
114 |
4. In Retrieval testing, ask a quick question in **Test text** to double check if your configurations work:
|
|
|
107 |
|
108 |

|
109 |
|
110 |
+
:::caution NOTE
|
111 |
+
You can add keywords to a file chunk to increase its ranking for queries containing those keywords. This action increases its keyword weight and can improve its position in search list.
|
112 |
:::
|
113 |
|
114 |
4. In Retrieval testing, ask a quick question in **Test text** to double check if your configurations work:
|
docs/quickstart.mdx
CHANGED
@@ -307,7 +307,7 @@ RAGFlow features visibility and explainability, allowing you to view the chunkin
|
|
307 |

|
308 |
|
309 |
:::caution NOTE
|
310 |
-
You can add keywords to a file chunk to
|
311 |
:::
|
312 |
|
313 |
4. In Retrieval testing, ask a quick question in **Test text** to double check if your configurations work:
|
|
|
307 |

|
308 |
|
309 |
:::caution NOTE
|
310 |
+
You can add keywords to a file chunk to improve its ranking for queries containing those keywords. This action increases its keyword weight and can improve its position in search list.
|
311 |
:::
|
312 |
|
313 |
4. In Retrieval testing, ask a quick question in **Test text** to double check if your configurations work:
|
web/src/locales/en.ts
CHANGED
@@ -158,9 +158,9 @@ export default {
|
|
158 |
html4excel: 'Excel to HTML',
|
159 |
html4excelTip: `Excel will be parsed into HTML table or not. If it's FALSE, every row in Excel will be formed as a chunk.`,
|
160 |
autoKeywords: 'Auto-keyword',
|
161 |
-
autoKeywordsTip: `Extract N keywords for each chunk to
|
162 |
autoQuestions: 'Auto-question',
|
163 |
-
autoQuestionsTip: `Extract N questions for each chunk to
|
164 |
},
|
165 |
knowledgeConfiguration: {
|
166 |
titleDescription:
|
@@ -210,13 +210,13 @@ export default {
|
|
210 |
We assume that the manual has a hierarchical section structure, using the lowest section titles as basic unit for chunking documents. Therefore, figures and tables in the same section will not be separated, which may result in larger chunk sizes.
|
211 |
</p>`,
|
212 |
naive: `<p>Supported file formats are <b>DOCX, EXCEL, PPT, IMAGE, PDF, TXT, MD, JSON, EML, HTML</b>.</p>
|
213 |
-
<p>This method chunks files using
|
214 |
<p>
|
215 |
<li>Use vision detection model to split the texts into smaller segments.</li>
|
216 |
<li>Then, combine adjacent segments until the token count exceeds the threshold specified by 'Chunk token number', at which point a chunk is created.</li></p>`,
|
217 |
paper: `<p>Only <b>PDF</b> file is supported.</p><p>
|
218 |
Papers will be split by section, such as <i>abstract, 1.1, 1.2</i>. </p><p>
|
219 |
-
This approach enables the LLM to summarize the paper more effectively and provide more comprehensive, understandable responses.
|
220 |
However, it also increases the context for AI conversations and adds to the computational cost for the LLM. So during a conversation, consider reducing the value of ‘<b>topN</b>’.</p>`,
|
221 |
presentation: `<p>Supported file formats are <b>PDF</b>, <b>PPTX</b>.</p><p>
|
222 |
Every page in the slides is treated as a chunk, with its thumbnail image stored.</p><p>
|
@@ -261,25 +261,23 @@ export default {
|
|
261 |
<li>Every row in table will be treated as a chunk.</li>
|
262 |
</ul>`,
|
263 |
picture: `
|
264 |
-
<p>Image files are supported
|
265 |
-
|
266 |
</p><p>
|
267 |
-
If the text extracted by OCR is
|
268 |
</p>`,
|
269 |
one: `
|
270 |
<p>Supported file formats are <b>DOCX, EXCEL, PDF, TXT</b>.
|
271 |
</p><p>
|
272 |
-
|
273 |
</p><p>
|
274 |
-
|
275 |
</p>`,
|
276 |
knowledgeGraph: `<p>Supported file formats are <b>DOCX, EXCEL, PPT, IMAGE, PDF, TXT, MD, JSON, EML</b>
|
277 |
|
278 |
-
<p>
|
279 |
-
|
280 |
-
<p>
|
281 |
-
|
282 |
-
Mind the entiry type you need to specify.</p>`,
|
283 |
useRaptor: 'Use RAPTOR to enhance retrieval',
|
284 |
useRaptorTip:
|
285 |
'Recursive Abstractive Processing for Tree-Organized Retrieval, see https://huggingface.co/papers/2401.18059 for more information',
|
|
|
158 |
html4excel: 'Excel to HTML',
|
159 |
html4excelTip: `Excel will be parsed into HTML table or not. If it's FALSE, every row in Excel will be formed as a chunk.`,
|
160 |
autoKeywords: 'Auto-keyword',
|
161 |
+
autoKeywordsTip: `Extract N keywords for each chunk to increase their ranking for queries containing those keywords. You can check or update the added keywords for a chunk from the chunk list. Be aware that extra tokens will be consumed by the LLM specified in 'System model settings'.`,
|
162 |
autoQuestions: 'Auto-question',
|
163 |
+
autoQuestionsTip: `Extract N questions for each chunk to increase their ranking for queries containing those questions. You can check or update the added questions for a chunk from the chunk list. This feature will not disrupt the chunking process if an error occurs, except that it may add an empty result to the original chunk. Be aware that extra tokens will be consumed by the LLM specified in 'System model settings'.`,
|
164 |
},
|
165 |
knowledgeConfiguration: {
|
166 |
titleDescription:
|
|
|
210 |
We assume that the manual has a hierarchical section structure, using the lowest section titles as basic unit for chunking documents. Therefore, figures and tables in the same section will not be separated, which may result in larger chunk sizes.
|
211 |
</p>`,
|
212 |
naive: `<p>Supported file formats are <b>DOCX, EXCEL, PPT, IMAGE, PDF, TXT, MD, JSON, EML, HTML</b>.</p>
|
213 |
+
<p>This method chunks files using a 'naive' method: </p>
|
214 |
<p>
|
215 |
<li>Use vision detection model to split the texts into smaller segments.</li>
|
216 |
<li>Then, combine adjacent segments until the token count exceeds the threshold specified by 'Chunk token number', at which point a chunk is created.</li></p>`,
|
217 |
paper: `<p>Only <b>PDF</b> file is supported.</p><p>
|
218 |
Papers will be split by section, such as <i>abstract, 1.1, 1.2</i>. </p><p>
|
219 |
+
This approach enables the LLM to summarize the paper more effectively and to provide more comprehensive, understandable responses.
|
220 |
However, it also increases the context for AI conversations and adds to the computational cost for the LLM. So during a conversation, consider reducing the value of ‘<b>topN</b>’.</p>`,
|
221 |
presentation: `<p>Supported file formats are <b>PDF</b>, <b>PPTX</b>.</p><p>
|
222 |
Every page in the slides is treated as a chunk, with its thumbnail image stored.</p><p>
|
|
|
261 |
<li>Every row in table will be treated as a chunk.</li>
|
262 |
</ul>`,
|
263 |
picture: `
|
264 |
+
<p>Image files are supported, with video support coming soon.</p><p>
|
265 |
+
This method employs an OCR model to extract texts from images.
|
266 |
</p><p>
|
267 |
+
If the text extracted by the OCR model is deemed insufficient, a specified visual LLM will be used to provide a description of the image.
|
268 |
</p>`,
|
269 |
one: `
|
270 |
<p>Supported file formats are <b>DOCX, EXCEL, PDF, TXT</b>.
|
271 |
</p><p>
|
272 |
+
This method treats each document in its entirety as a chunk.
|
273 |
</p><p>
|
274 |
+
Applicable when you require the LLM to summarize the entire document, provided it can handle that amount of context length.
|
275 |
</p>`,
|
276 |
knowledgeGraph: `<p>Supported file formats are <b>DOCX, EXCEL, PPT, IMAGE, PDF, TXT, MD, JSON, EML</b>
|
277 |
|
278 |
+
<p>This approach chunks files using the 'naive'/'General' method. It splits a document into segements and then combines adjacent segments until the token count exceeds the threshold specified by 'Chunk token number', at which point a chunk is created.</p>
|
279 |
+
<p>The chunks are then fed to the LLM to extract nodes and relationships for a knowledge graph and a mind map.</p>
|
280 |
+
<p>Ensure that you set the <b>Entity types</b>.</p>`,
|
|
|
|
|
281 |
useRaptor: 'Use RAPTOR to enhance retrieval',
|
282 |
useRaptorTip:
|
283 |
'Recursive Abstractive Processing for Tree-Organized Retrieval, see https://huggingface.co/papers/2401.18059 for more information',
|