File size: 4,398 Bytes
fef7cdd
 
 
d12bc5f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
fef7cdd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
---
license: apache-2.0
---
# Model Card for MatroidNN

## Model Details

### Model Description

**Model type:** Neural Network with Matroid-based Feature Selection (MatroidNN)

**Version:** 1.0

**Framework:** PyTorch

**Last updated:** February 27, 2025

### Overview

MatroidNN is a neural network architecture that incorporates matroid theory for feature selection. It addresses the challenge of feature redundancy by selecting a maximally independent set of features based on matroid theory principles before training the neural network.

### Model Architecture

- **Feature Selection Component**: MatroidFeatureSelector using correlation-based dependency analysis
- **Neural Network**: 3-layer feedforward network with batch normalization and dropout
- **Input**: Varies based on the number of features selected by the matroid selector
- **Hidden Layers**: Configurable hidden layer sizes (default 64 → 32)
- **Output**: Multi-class classification (configurable number of classes)
- **Parameters**: ~5K-10K parameters (varies based on input/output dimensions)

## Uses

### Direct Use

MatroidNN is designed for classification tasks where feature redundancy is a potential issue. It's particularly useful for:

- High-dimensional datasets with correlated features
- Feature selection in biological/medical data
- Financial prediction with multicollinear variables
- Any classification task where feature independence is desired

### Out-of-Scope Use

This model is not intended for:
- Regression tasks (without modification)
- Time series prediction (without temporal adaptations)
- Raw image or text classification (without appropriate feature extraction)

## Training Data

The model was developed and tested using synthetic data with deliberate feature dependencies. For real-world applications, the model should be retrained on domain-specific data.

### Training Dataset

- **Type**: Synthetic data with controlled dependencies
- **Size**: 1000 samples (default), configurable
- **Features**: 20 initial features (default), configurable
- **Classes**: 3 classes (default), configurable
- **Distribution**: Equal class distribution in the synthetic data

## Performance

### Metrics

On synthetic test data with 3 classes:
- **Accuracy**: 94.0%
- **Macro-average F1-score**: 0.93
- **Per-class metrics**:
  - Class 0: Precision 0.96, Recall 1.00, F1 0.98
  - Class 1: Precision 0.86, Recall 0.86, F1 0.86
  - Class 2: Precision 0.97, Recall 0.93, F1 0.95

### Factors

Performance may vary based on:
- Feature correlation structure in the dataset
- Number of initial features and their information content
- Class distribution balance
- Rank threshold parameter in the MatroidFeatureSelector

## Limitations

- The matroid-based feature selection uses correlation as a proxy for independence, which may not capture all forms of dependency
- The current implementation assumes numerical features and may require adaptation for categorical features
- Feature selection is performed once before training and does not adapt during training
- The rank threshold parameter requires careful tuning based on the dataset

## Ethical Considerations

- Feature selection might unintentionally exclude features that are important for fairness considerations
- The model inherits any biases present in the training data
- Results should be interpreted with caution in high-stakes applications, with human oversight

## Technical Specifications

### Hardware Requirements

- Training: CUDA-capable GPU recommended for larger datasets
- Inference: CPU sufficient for most applications

### Software Requirements

- Python 3.8+
- PyTorch 1.8+
- NumPy 1.20+
- scikit-learn 0.24+

### Training Hyperparameters

- **Batch size**: 32 (default)
- **Learning rate**: 0.001 (default)
- **Optimizer**: Adam
- **Loss function**: Cross-Entropy Loss
- **Epochs**: Early stopping based on validation loss (patience=10)
- **Feature selection rank threshold**: 0.7 (default, configurable)

## How to Use

```python
from matroid_nn import MatroidFeatureSelector, MatroidNN

# Initialize feature selector
selector = MatroidFeatureSelector(rank_threshold=0.7)

# Apply feature selection
X_train_selected = selector.fit_transform(X_train)
X_test_selected = selector.transform(X_test)

# Create and train model
model = MatroidNN(
    input_size=X_train_selected.shape[1],
    hidden_size=64,
    output_size=num_classes
)