Victor Geneste commited on
Commit
2d18c85
·
unverified ·
1 Parent(s): 40e9eac

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +127 -12
README.md CHANGED
@@ -1,18 +1,8 @@
1
- ---
2
- title: EU Explorer (MDA Assignment)
3
- emoji: 🤖
4
- colorFrom: purple
5
- colorTo: indigo
6
- sdk: docker
7
- app_port: 4444
8
- pinned: false
9
- ---
10
-
11
  # Interactive Retrieval-Augmented Generation for Semantic Exploration of Horizon Europe Research Data
12
 
13
- **A Cloud-Native Web Application for Multilingual Question Answering and Research Trend Analysis**
14
 
15
- This project presents a scalable, cloud-native system that leverages Retrieval-Augmented Generation (RAG) to provide semantic access to the Horizon Europe research project database (CORDIS). Combining dense and sparse retrieval methods with advanced multilingual language models, the system enables users to ask natural language questions and receive document-grounded answers, complete with citations.
16
 
17
  The backend, built using FastAPI and integrated with tools like FAISS, Whoosh, and LangChain, supports both semantic and keyword search, hybrid retrieval, and re-ranking. A user-facing web application and chatbot interface make the system interactive and intuitive, allowing researchers, policymakers, and the public to explore EU-funded research projects in an intelligent, multilingual, and conversational manner.
18
 
@@ -27,6 +17,131 @@ The backend, built using FastAPI and integrated with tools like FAISS, Whoosh, a
27
  - [Usage](#usage)
28
  - [Web Application](#web-application)
29
  - [API Endpoints](#api-endpoints)
 
30
  - [Retrieval-Augmented Generation Pipeline](#retrieval-augmented-generation-pipeline)
31
  - [Limitations and Future Work](#limitations-and-future-work)
32
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  # Interactive Retrieval-Augmented Generation for Semantic Exploration of Horizon Europe Research Data
2
 
3
+ **A Web Application for Question Answering and Research Trend Analysis**
4
 
5
+ This project presents a scalable system that leverages Retrieval-Augmented Generation (RAG) to provide semantic access to the Horizon Europe research project database (CORDIS). Combining dense and sparse retrieval methods with advanced multilingual language models, the system enables users to ask natural language questions and receive document-grounded answers, complete with citations.
6
 
7
  The backend, built using FastAPI and integrated with tools like FAISS, Whoosh, and LangChain, supports both semantic and keyword search, hybrid retrieval, and re-ranking. A user-facing web application and chatbot interface make the system interactive and intuitive, allowing researchers, policymakers, and the public to explore EU-funded research projects in an intelligent, multilingual, and conversational manner.
8
 
 
17
  - [Usage](#usage)
18
  - [Web Application](#web-application)
19
  - [API Endpoints](#api-endpoints)
20
+ - [Predictive Modelling](#predictive-modelling)
21
  - [Retrieval-Augmented Generation Pipeline](#retrieval-augmented-generation-pipeline)
22
  - [Limitations and Future Work](#limitations-and-future-work)
23
 
24
+ ---
25
+
26
+ ## Overview
27
+
28
+ This repository contains an application for semantic exploration and trend analysis of the Horizon Europe research dataset. It enables:
29
+
30
+ - Multilingual question answering over the CORDIS database.
31
+ - Research trend analysis.
32
+ - Document-grounded answers with citations.
33
+ - Both semantic and keyword search.
34
+
35
+ ## Dataset: Horizon Europe Projects
36
+
37
+ The system is built around data from the Horizon Europe research program (CORDIS), including metadata and deliverables for EU-funded projects. Data processing scripts and notebooks are provided for cleaning and transforming the CSV datasets into efficient formats (e.g., parquet).
38
+
39
+ ## Features
40
+
41
+ - **Retrieval-Augmented Generation (RAG):** Combines dense and sparse retrieval for robust search.
42
+ - **Multilingual Support:** Uses advanced language models for question answering in multiple languages.
43
+ - **Hybrid Search:** Supports semantic (vector-based) and keyword (Whoosh) retrieval, including hybrid and re-ranking.
44
+ - **Web Interface & Chatbot:** Intuitive UI for interactive exploration.
45
+ - **API Access:** RESTful endpoints for programmatic access.
46
+
47
+ ## System Architecture
48
+
49
+ - **Backend:** FastAPI-based, integrating FAISS (vector search), Whoosh (keyword search), and LangChain (RAG pipeline).
50
+ - **Frontend:** Web app and chatbot for user queries and result visualization.
51
+ - **Data Pipeline:** Notebooks and scripts for ingesting, cleaning, and transforming CORDIS data.
52
+
53
+ ## Technologies Used
54
+
55
+ - FastAPI
56
+ - FAISS
57
+ - Whoosh
58
+ - LangChain
59
+ - Polars
60
+ - Python 3.10+
61
+ - Docker, Cloud deployment tools
62
+
63
+ ## Installation
64
+
65
+ 1. Clone the repository:
66
+ ```bash
67
+ git clone https://github.com/Romainkul/MDA.git
68
+ cd MDA
69
+ ```
70
+
71
+ 2. (Optional) Create a virtual environment:
72
+ ```bash
73
+ python -m venv venv
74
+ source venv/bin/activate # On Windows use venv\Scripts\activate
75
+ ```
76
+
77
+ 3. Install dependencies:
78
+ ```bash
79
+ pip install -r requirements.txt
80
+ ```
81
+
82
+ 4. Prepare datasets:
83
+ - Place Horizon Europe (CORDIS) CSV files in the appropriate data directory.
84
+ - Run provided Jupyter notebooks/scripts (e.g., `DataExploration.ipynb`) to clean and convert data.
85
+
86
+ 5. Start the backend API:
87
+ ```bash
88
+ cd backend
89
+ uvicorn main.app --host ::1 --reload
90
+ ```
91
+
92
+ 6. Run the frontend:
93
+ ```bash
94
+ cd frontend
95
+ npm run dev
96
+ ```
97
+ ## Usage
98
+
99
+ ### Web Application
100
+
101
+ - Start the backend API as above.
102
+ - Access the web UI via your browser at `http://localhost:8000`.
103
+
104
+ ### API Endpoints
105
+
106
+ - Documentation is available at `http://localhost:8000/docs` (FastAPI Swagger UI).
107
+ - Example endpoints include:
108
+ - `/api/rag`
109
+ - `/api/projects`
110
+ - `/api/filters`
111
+ - `/api/project/id/organizations`
112
+ - `/api/stats`
113
+
114
+ ## Predictive Modelling
115
+
116
+ This script provides an end-to-end pipeline for status prediction in the MDA project. It features:
117
+
118
+ - **Data Preparation**: Cleans and engineers features, including handling multi-label and text fields.
119
+ - **Text Embedding**: Uses Sentence Transformers with SVD for dimensionality reduction.
120
+ - **ML Pipeline**: Builds a scikit-learn pipeline with preprocessing, anomaly detection, resampling, feature selection, and model calibration.
121
+ - **Model Training & Tuning**: Supports Optuna-based hyperparameter optimization.
122
+ - **Evaluation & Explanation**: Outputs classification metrics, SHAP explanations, and monitors data drift using Evidently.
123
+ - **Scoring**: Loads saved models to predict and explain results on new data.
124
+
125
+ Run the script to train the model, evaluate it, save artifacts, and score incoming data.
126
+
127
+ ## Retrieval-Augmented Generation Pipeline
128
+
129
+ - **Data Ingestion:** Clean and preprocess CORDIS project and deliverable datasets.
130
+ - **Indexing:** Build FAISS (dense) and Whoosh (sparse) indexes.
131
+ - **Hybrid Retrieval:** Combine results from both indexes, optionally re-rank.
132
+ - **Generation:** Use a multilingual language model to generate grounded answers with citations.
133
+
134
+ ## Limitations and Future Work
135
+
136
+ - Current language model and retrieval performance may be improved.
137
+ - Improve predictive modelling
138
+ - UI/UX enhancements planned.
139
+ - Additional analytics and trend visualizations under development.
140
+ - Support for more languages and larger datasets.
141
+
142
+ ---
143
+
144
+ ## Acknowledgements
145
+
146
+ - European Union Open Data Portal (CORDIS)
147
+ - Open-source contributors and projects (FastAPI, FAISS, Whoosh, LangChain, Polars)