File size: 6,379 Bytes
1d0f7f6
 
 
 
 
 
 
 
 
 
 
79cfa91
 
2d18c85
79cfa91
2d18c85
79cfa91
 
 
 
 
 
 
 
 
 
 
 
 
 
2d18c85
79cfa91
 
 
2d18c85
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7972e5a
 
 
 
 
 
2d18c85
 
 
 
 
 
7972e5a
2d18c85
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7972e5a
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
---
title: EU Explorer (MDA Assignment)   
emoji: 🤖                         
colorFrom: purple                 
colorTo: indigo                   
sdk: docker                       
app_port: 4444                    
pinned: false
---
Hugginface spaces setup

# Interactive Retrieval-Augmented Generation for Semantic Exploration of Horizon Europe Research Data

**A Web Application for Question Answering and Research Trend Analysis**

This project presents a scalable system that leverages Retrieval-Augmented Generation (RAG) to provide semantic access to the Horizon Europe research project database (CORDIS). Combining dense and sparse retrieval methods with advanced multilingual language models, the system enables users to ask natural language questions and receive document-grounded answers, complete with citations.

The backend, built using FastAPI and integrated with tools like FAISS, Whoosh, and LangChain, supports both semantic and keyword search, hybrid retrieval, and re-ranking. A user-facing web application and chatbot interface make the system interactive and intuitive, allowing researchers, policymakers, and the public to explore EU-funded research projects in an intelligent, multilingual, and conversational manner.

## Table of Contents

- [Overview](#overview)
- [Dataset: Horizon Europe Projects](#dataset-horizon-europe-projects)
- [Features](#features)
- [System Architecture](#system-architecture)
- [Technologies Used](#technologies-used)
- [Installation](#installation)
- [Usage](#usage)
  - [Web Application](#web-application)
  - [API Endpoints](#api-endpoints)
- [Predictive Modelling](#predictive-modelling)
- [Retrieval-Augmented Generation Pipeline](#retrieval-augmented-generation-pipeline)
- [Limitations and Future Work](#limitations-and-future-work)

---

## Overview

This repository contains an application for semantic exploration and trend analysis of the Horizon Europe research dataset. It enables:

- Multilingual question answering over the CORDIS database.
- Research trend analysis.
- Document-grounded answers with citations.
- Both semantic and keyword search.

## Dataset: Horizon Europe Projects

The system is built around data from the Horizon Europe research program (CORDIS), including metadata and deliverables for EU-funded projects. Data processing scripts and notebooks are provided for cleaning and transforming the CSV datasets into efficient formats (e.g., parquet).

## Features

- **Retrieval-Augmented Generation (RAG):** Combines dense and sparse retrieval for robust search.
- **Multilingual Support:** Uses advanced language models for question answering in multiple languages.
- **Hybrid Search:** Supports semantic (vector-based) and keyword (Whoosh) retrieval, including hybrid and re-ranking.
- **Web Interface & Chatbot:** Intuitive UI for interactive exploration.
- **API Access:** RESTful endpoints for programmatic access.

## System Architecture

- **Backend:** FastAPI-based, integrating FAISS (vector search), Whoosh (keyword search), and LangChain (RAG pipeline).
- **Frontend:** Web app and chatbot for user queries and result visualization.
- **Data Pipeline:** Notebooks and scripts for ingesting, cleaning, and transforming CORDIS data.

## Technologies Used

- FastAPI
- FAISS
- Whoosh
- LangChain
- Polars
- Python 3.10+
- Docker, Cloud deployment tools

## Installation

1. Clone the repository:
    ```bash
    git clone https://github.com/Romainkul/MDA.git
    cd MDA
    ```

2. (Optional) Create a virtual environment:
    ```bash
    python -m venv venv
    source venv/bin/activate  # On Windows use venv\Scripts\activate
    ```

3. Install dependencies:
    ```bash
    pip install -r requirements.txt
    ```

4. Prepare datasets:
    - Place Horizon Europe (CORDIS) CSV files in the appropriate data directory.
    - Run provided Jupyter notebooks/scripts (e.g., `DataExploration.ipynb`) to clean and convert data.

5. Start the backend API:
    ```bash
    cd backend
    uvicorn main.app --host ::1 --reload
    ```

6. Run the frontend:
    ```bash
    cd frontend
    npm run dev
    ```

7. Alternatively it can be launched as a Docker Image:
```bash
docker build -t mda_eu_project:latest .
```

## Usage

### Web Application

- Start the backend API as above.
- Access the web UI via your browser at `http://localhost:8000`.
- With Docker, the 8000 becomes 4444/api due to the reverse proxy.

### API Endpoints

- Documentation is available at `http://localhost:8000/docs` (FastAPI Swagger UI).
- Example endpoints include:
    - `/api/rag`
    - `/api/projects`
    - `/api/filters`
    - `/api/project/id/organizations`
    - `/api/stats`
 
## Predictive Modelling

This script provides an end-to-end pipeline for status prediction in the MDA project. It features:

- **Data Preparation**: Cleans and engineers features, including handling multi-label and text fields.
- **Text Embedding**: Uses Sentence Transformers with SVD for dimensionality reduction.
- **ML Pipeline**: Builds a scikit-learn pipeline with preprocessing, anomaly detection, resampling, feature selection, and model calibration.
- **Model Training & Tuning**: Supports Optuna-based hyperparameter optimization.
- **Evaluation & Explanation**: Outputs classification metrics, SHAP explanations, and monitors data drift using Evidently.
- **Scoring**: Loads saved models to predict and explain results on new data.
  
Run the script to train the model, evaluate it, save artifacts, and score incoming data.

## Retrieval-Augmented Generation Pipeline

- **Data Ingestion:** Clean and preprocess CORDIS project and deliverable datasets.
- **Indexing:** Build FAISS (dense) and Whoosh (sparse) indexes.
- **Hybrid Retrieval:** Combine results from both indexes, optionally re-rank.
- **Generation:** Use a multilingual language model to generate grounded answers with citations.

## Limitations and Future Work

- Current language model and retrieval performance may be improved.
- Improve predictive modelling
- UI/UX enhancements planned.
- Additional analytics and trend visualizations under development.
- Support for more languages and larger datasets.

---

## Acknowledgements

- European Union Open Data Portal (CORDIS)
- Open-source contributors and projects (FastAPI, FAISS, Whoosh, LangChain, Polars)
- Course and teachers of Modern Data Analytics which/who made this project possible