writinwaters commited on
Commit
4d5d480
·
1 Parent(s): 3256beb

Updated RAGFlow's dataset configuration UI (#3376)

Browse files

### What problem does this PR solve?



### Type of change

- [x] Documentation Update

docs/guides/configure_knowledge_base.md CHANGED
@@ -107,8 +107,8 @@ RAGFlow features visibility and explainability, allowing you to view the chunkin
107
 
108
  ![update chunk](https://github.com/infiniflow/ragflow/assets/93570324/1d84b408-4e9f-46fd-9413-8c1059bf9c76)
109
 
110
- :::caution NOTE
111
- You can add keywords to a file chunk to increase its relevance. This action increases its keyword weight and can improve its position in search list.
112
  :::
113
 
114
  4. In Retrieval testing, ask a quick question in **Test text** to double check if your configurations work:
 
107
 
108
  ![update chunk](https://github.com/infiniflow/ragflow/assets/93570324/1d84b408-4e9f-46fd-9413-8c1059bf9c76)
109
 
110
+ :::caution NOTE
111
+ You can add keywords to a file chunk to increase its ranking for queries containing those keywords. This action increases its keyword weight and can improve its position in search list.
112
  :::
113
 
114
  4. In Retrieval testing, ask a quick question in **Test text** to double check if your configurations work:
docs/quickstart.mdx CHANGED
@@ -307,7 +307,7 @@ RAGFlow features visibility and explainability, allowing you to view the chunkin
307
  ![update chunk](https://github.com/infiniflow/ragflow/assets/93570324/1d84b408-4e9f-46fd-9413-8c1059bf9c76)
308
 
309
  :::caution NOTE
310
- You can add keywords to a file chunk to increase its relevance. This action increases its keyword weight and can improve its position in search list.
311
  :::
312
 
313
  4. In Retrieval testing, ask a quick question in **Test text** to double check if your configurations work:
 
307
  ![update chunk](https://github.com/infiniflow/ragflow/assets/93570324/1d84b408-4e9f-46fd-9413-8c1059bf9c76)
308
 
309
  :::caution NOTE
310
+ You can add keywords to a file chunk to improve its ranking for queries containing those keywords. This action increases its keyword weight and can improve its position in search list.
311
  :::
312
 
313
  4. In Retrieval testing, ask a quick question in **Test text** to double check if your configurations work:
web/src/locales/en.ts CHANGED
@@ -158,9 +158,9 @@ export default {
158
  html4excel: 'Excel to HTML',
159
  html4excelTip: `Excel will be parsed into HTML table or not. If it's FALSE, every row in Excel will be formed as a chunk.`,
160
  autoKeywords: 'Auto-keyword',
161
- autoKeywordsTip: `Extract N keywords for each chunk to improve their ranking for queries containing those keywords. You can check or update the added keywords for a chunk from the chunk list. Be aware that extra tokens will be consumed by the LLM specified in 'System model settings'.`,
162
  autoQuestions: 'Auto-question',
163
- autoQuestionsTip: `Extract N questions for each chunk to improve their ranking for queries containing those questions. You can check or update the added questions for a chunk from the chunk list. This feature will not disrupt the chunking process if an error occurs, except that it may add an empty result to the original chunk. Be aware that extra tokens will be consumed by the LLM specified in 'System model settings'.`,
164
  },
165
  knowledgeConfiguration: {
166
  titleDescription:
@@ -210,13 +210,13 @@ export default {
210
  We assume that the manual has a hierarchical section structure, using the lowest section titles as basic unit for chunking documents. Therefore, figures and tables in the same section will not be separated, which may result in larger chunk sizes.
211
  </p>`,
212
  naive: `<p>Supported file formats are <b>DOCX, EXCEL, PPT, IMAGE, PDF, TXT, MD, JSON, EML, HTML</b>.</p>
213
- <p>This method chunks files using the 'naive' way: </p>
214
  <p>
215
  <li>Use vision detection model to split the texts into smaller segments.</li>
216
  <li>Then, combine adjacent segments until the token count exceeds the threshold specified by 'Chunk token number', at which point a chunk is created.</li></p>`,
217
  paper: `<p>Only <b>PDF</b> file is supported.</p><p>
218
  Papers will be split by section, such as <i>abstract, 1.1, 1.2</i>. </p><p>
219
- This approach enables the LLM to summarize the paper more effectively and provide more comprehensive, understandable responses.
220
  However, it also increases the context for AI conversations and adds to the computational cost for the LLM. So during a conversation, consider reducing the value of ‘<b>topN</b>’.</p>`,
221
  presentation: `<p>Supported file formats are <b>PDF</b>, <b>PPTX</b>.</p><p>
222
  Every page in the slides is treated as a chunk, with its thumbnail image stored.</p><p>
@@ -261,25 +261,23 @@ export default {
261
  <li>Every row in table will be treated as a chunk.</li>
262
  </ul>`,
263
  picture: `
264
- <p>Image files are supported. Video is coming soon.</p><p>
265
- If the picture has text in it, OCR is applied to extract the text as its text description.
266
  </p><p>
267
- If the text extracted by OCR is not enough, visual LLM is used to get the descriptions.
268
  </p>`,
269
  one: `
270
  <p>Supported file formats are <b>DOCX, EXCEL, PDF, TXT</b>.
271
  </p><p>
272
- For a document, it will be treated as an entire chunk, no split at all.
273
  </p><p>
274
- If you want to summarize something that needs all the context of an article and the selected LLM's context length covers the document length, you can try this method.
275
  </p>`,
276
  knowledgeGraph: `<p>Supported file formats are <b>DOCX, EXCEL, PPT, IMAGE, PDF, TXT, MD, JSON, EML</b>
277
 
278
- <p>After files being chunked, it uses chunks to extract knowledge graph and mind map of the entire document. This method apply the naive ways to chunk files:
279
- Successive text will be sliced into pieces each of which is around 512 token number.</p>
280
- <p>Next, chunks will be transmited to LLM to extract nodes and relationships of a knowledge graph, and a mind map.</p>
281
-
282
- Mind the entiry type you need to specify.</p>`,
283
  useRaptor: 'Use RAPTOR to enhance retrieval',
284
  useRaptorTip:
285
  'Recursive Abstractive Processing for Tree-Organized Retrieval, see https://huggingface.co/papers/2401.18059 for more information',
 
158
  html4excel: 'Excel to HTML',
159
  html4excelTip: `Excel will be parsed into HTML table or not. If it's FALSE, every row in Excel will be formed as a chunk.`,
160
  autoKeywords: 'Auto-keyword',
161
+ autoKeywordsTip: `Extract N keywords for each chunk to increase their ranking for queries containing those keywords. You can check or update the added keywords for a chunk from the chunk list. Be aware that extra tokens will be consumed by the LLM specified in 'System model settings'.`,
162
  autoQuestions: 'Auto-question',
163
+ autoQuestionsTip: `Extract N questions for each chunk to increase their ranking for queries containing those questions. You can check or update the added questions for a chunk from the chunk list. This feature will not disrupt the chunking process if an error occurs, except that it may add an empty result to the original chunk. Be aware that extra tokens will be consumed by the LLM specified in 'System model settings'.`,
164
  },
165
  knowledgeConfiguration: {
166
  titleDescription:
 
210
  We assume that the manual has a hierarchical section structure, using the lowest section titles as basic unit for chunking documents. Therefore, figures and tables in the same section will not be separated, which may result in larger chunk sizes.
211
  </p>`,
212
  naive: `<p>Supported file formats are <b>DOCX, EXCEL, PPT, IMAGE, PDF, TXT, MD, JSON, EML, HTML</b>.</p>
213
+ <p>This method chunks files using a 'naive' method: </p>
214
  <p>
215
  <li>Use vision detection model to split the texts into smaller segments.</li>
216
  <li>Then, combine adjacent segments until the token count exceeds the threshold specified by 'Chunk token number', at which point a chunk is created.</li></p>`,
217
  paper: `<p>Only <b>PDF</b> file is supported.</p><p>
218
  Papers will be split by section, such as <i>abstract, 1.1, 1.2</i>. </p><p>
219
+ This approach enables the LLM to summarize the paper more effectively and to provide more comprehensive, understandable responses.
220
  However, it also increases the context for AI conversations and adds to the computational cost for the LLM. So during a conversation, consider reducing the value of ‘<b>topN</b>’.</p>`,
221
  presentation: `<p>Supported file formats are <b>PDF</b>, <b>PPTX</b>.</p><p>
222
  Every page in the slides is treated as a chunk, with its thumbnail image stored.</p><p>
 
261
  <li>Every row in table will be treated as a chunk.</li>
262
  </ul>`,
263
  picture: `
264
+ <p>Image files are supported, with video support coming soon.</p><p>
265
+ This method employs an OCR model to extract texts from images.
266
  </p><p>
267
+ If the text extracted by the OCR model is deemed insufficient, a specified visual LLM will be used to provide a description of the image.
268
  </p>`,
269
  one: `
270
  <p>Supported file formats are <b>DOCX, EXCEL, PDF, TXT</b>.
271
  </p><p>
272
+ This method treats each document in its entirety as a chunk.
273
  </p><p>
274
+ Applicable when you require the LLM to summarize the entire document, provided it can handle that amount of context length.
275
  </p>`,
276
  knowledgeGraph: `<p>Supported file formats are <b>DOCX, EXCEL, PPT, IMAGE, PDF, TXT, MD, JSON, EML</b>
277
 
278
+ <p>This approach chunks files using the 'naive'/'General' method. It splits a document into segements and then combines adjacent segments until the token count exceeds the threshold specified by 'Chunk token number', at which point a chunk is created.</p>
279
+ <p>The chunks are then fed to the LLM to extract nodes and relationships for a knowledge graph and a mind map.</p>
280
+ <p>Ensure that you set the <b>Entity types</b>.</p>`,
 
 
281
  useRaptor: 'Use RAPTOR to enhance retrieval',
282
  useRaptorTip:
283
  'Recursive Abstractive Processing for Tree-Organized Retrieval, see https://huggingface.co/papers/2401.18059 for more information',