File size: 3,992 Bytes
e1841f0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
fba23f2
2a0f548
fba23f2
2a0f548
fba23f2
2a0f548
fba23f2
2a0f548
fba23f2
 
 
 
 
2a0f548
fba23f2
 
 
 
 
2a0f548
fba23f2
 
 
 
 
2a0f548
fba23f2
2a0f548
fba23f2
 
 
 
 
 
 
 
 
 
2a0f548
 
 
fba23f2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2a0f548
fba23f2
2a0f548
fba23f2
 
2a0f548
fba23f2
2a0f548
fba23f2
2a0f548
fba23f2
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
---
language: en
license: apache-2.0
tags:
- conversational
- dialogue-generation
- rotary-embedding
- hrom
- HROM-V1.5
- swiglu
pipeline_tag: text2text-generation
base_model:
- TimurHromek/HROM-V1
---

# HROM-V1.5: Hybrid Rotary-Optimized Model

## Architectural Overview

HROM-V1.5 implements several key innovations in transformer architecture design:

### Core Components

1. **Rotary Position Embeddings (RoPE)**
   - Position-aware attention mechanism without absolute position embeddings
   - Relative position encoding via rotation matrices
   - Stable gradient propagation for long sequences
   - Dynamic sequence length handling (0-512 tokens)

2. **SwiGLU Activation**
   - Swish-gated linear unit variant
   - 2/3 reduction in parameter count versus standard FFN
   - Improved gradient flow compared to ReLU/GELU
   - Formula: `SwiGLU(x) = x * gelu(gate)`

3. **Attention Mechanism**
   - 8-head attention with 96-dimension heads
   - Combined causal + padding mask support
   - Scaled dot-product with 1/√d_k normalization
   - Attention dropout (p=0.1)

### Model Specifications

| Component          | Specification                          |
|--------------------|----------------------------------------|
| Layers             | 8                                      |
| Hidden Dimension   | 768                                    |
| FFN Dimension      | 2048 (SwiGLU-activated)                |
| Attention Heads    | 8                                      |
| Head Dimension     | 96                                     |
| Vocabulary Size    | 32,000                                 |
| Max Sequence Length| 512 tokens                             |
| Dropout Rate       | 0.1                                    |

## Training Configuration

### Dataset Composition
- **DailyDialog**: 11k conversational samples
- **EmpatheticDialogues**: 18k emotionally-rich exchanges  
- **BlendedSkillTalk**: 5k multi-skill interactions
- **Persona-Chat**: 18k personality-driven dialogues

### Optimization Parameters
- **Batch Size**: 16 (effective 128 via 8-step gradient accumulation)
- **Learning Rate**: 2e-5 with linear warmup (1k steps)
- **Optimizer**: AdamW (β1=0.9, β2=0.95)
- **Weight Decay**: 0.1
- **Epochs**: 30
- **Gradient Clipping**: 1.0

## Technical Implementation

### Position Encoding
```python
class RotaryEmbedding(nn.Module):
    def __init__(self, dim):
        super().__init__()
        inv_freq = 1.0 / (10000 ** (torch.arange(0, dim, 2).float() / dim))
        self.register_buffer("inv_freq", inv_freq)

    def forward(self, seq_len):
        t = torch.arange(seq_len, device=self.inv_freq.device).type_as(self.inv_freq)
        freqs = torch.einsum("i, j -> i j", t, self.inv_freq)
        if seq_len == 0:
             return torch.empty((0, self.inv_freq.shape[0] * 2), device=self.inv_freq.device)
        # Defensive reshape only if necessary
        if freqs.shape[0] != seq_len and seq_len > 0:
             freqs = freqs.reshape(seq_len, -1)
        elif seq_len == 0: # Handle edge case for empty sequences
            return torch.empty((0, self.inv_freq.shape[0]*2), device=self.inv_freq.device, dtype=self.inv_freq.dtype)

        return torch.cat((freqs, freqs), dim=-1)

```

### SwiGLU Implementation
```python
class SwiGLU(nn.Module):
    def forward(self, x):
        x, gate = x.chunk(2, dim=-1)
        return x * nn.functional.gelu(gate)
```

## License

Apache License 2.0  
Copyright 2025 Timur Hromek

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this software except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.