openbmb
/

MiniCPM4-Survey

@@ -58,7 +58,7 @@ Download [MiniCPM4-Survey](https://huggingface.co/openbmb/MiniCPM4-Survey) from
 We recommend using [MiniCPM-Embedding-Light](https://huggingface.co/openbmb/MiniCPM-Embedding-Light) as the embedding model, which can be downloaded from Hugging Face and placed in `model/MiniCPM-Embedding-Light`.
 ### Perpare the environment
-You can download the [paper data](https://www.kaggle.com/datasets/Cornell-University/arxiv) from Kaggle, then extract it. You can run `python dataset_process.py` to process the data and generate the retrieval database. Then you can run `python build_index.py` to build the retrieval database.
 ```
 cd ./code
@@ -66,7 +66,7 @@ curl -L -o ~/Downloads/arxiv.zip\
    https://www.kaggle.com/api/v1/datasets/download/Cornell-University/arxiv
 unzip ~/Downloads/arxiv.zip -d .
 mkdir data
-python ./src/preprocess/dataset_process.py
 mkdir index
 python ./src/preprocess/build_index.py
 ```
@@ -151,14 +151,14 @@ MiniCPM4-Survey是由[THUNLP](https://nlp.csai.tsinghua.edu.cn)、中国人民
 我们建议使用[MiniCPM-Embedding-Light](https://huggingface.co/openbmb/MiniCPM-Embedding-Light)作为表征模型，放在model/MiniCPM-Embedding-Light中。
 ### 准备环境
-从 Kaggle 下载论文数据，然后解压。运行`python dataset_process.py`，处理数据并生成检索数据库。然后运行`python build_index.py`，构建检索数据库。
 ``` bash
 cd ./code
 curl -L -o ~/Downloads/arxiv.zip\
    https://www.kaggle.com/api/v1/datasets/download/Cornell-University/arxiv
 unzip ~/Downloads/arxiv.zip -d .
 mkdir data
-python ./src/preprocess/dataset_process.py
 mkdir index
 python ./src/preprocess/build_index.py
 ```

 We recommend using [MiniCPM-Embedding-Light](https://huggingface.co/openbmb/MiniCPM-Embedding-Light) as the embedding model, which can be downloaded from Hugging Face and placed in `model/MiniCPM-Embedding-Light`.
 ### Perpare the environment
+You can download the [paper data](https://www.kaggle.com/datasets/Cornell-University/arxiv) from Kaggle, then extract it. You can run `python data_process.py` to process the data and generate the retrieval database. Then you can run `python build_index.py` to build the retrieval database.
 ```
 cd ./code
    https://www.kaggle.com/api/v1/datasets/download/Cornell-University/arxiv
 unzip ~/Downloads/arxiv.zip -d .
 mkdir data
+python ./src/preprocess/data_process.py
 mkdir index
 python ./src/preprocess/build_index.py
 ```
 我们建议使用[MiniCPM-Embedding-Light](https://huggingface.co/openbmb/MiniCPM-Embedding-Light)作为表征模型，放在model/MiniCPM-Embedding-Light中。
 ### 准备环境
+从 Kaggle 下载论文数据，然后解压。运行`python data_process.py`，处理数据并生成检索数据库。然后运行`python build_index.py`，构建检索数据库。
 ``` bash
 cd ./code
 curl -L -o ~/Downloads/arxiv.zip\
    https://www.kaggle.com/api/v1/datasets/download/Cornell-University/arxiv
 unzip ~/Downloads/arxiv.zip -d .
 mkdir data
+python ./src/preprocess/data_process.py
 mkdir index
 python ./src/preprocess/build_index.py
 ```

code/requirements.txt ADDED Viewed

	@@ -0,0 +1,8 @@

+openai
+vllm
+jsonlines
+faiss-cpu
+# faiss-gpu
+fastapi
+uvicorn
+yarl