gpriday commited on
Commit
81ac0b7
·
verified ·
1 Parent(s): f956a91

Upload folder using huggingface_hub

Browse files
.ipynb_checkpoints/README-checkpoint.md ADDED
@@ -0,0 +1,202 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: en
3
+ license: apache-2.0
4
+ tags:
5
+ - text-generation
6
+ - domain-names
7
+ - reformer
8
+ - character-level
9
+ datasets:
10
+ - custom
11
+ metrics:
12
+ - loss
13
+ model-index:
14
+ - name: domain-generator-reformer
15
+ results:
16
+ - task:
17
+ type: text-generation
18
+ name: Domain Name Generation
19
+ metrics:
20
+ - type: loss
21
+ value: 0.9716
22
+ name: Validation Loss
23
+ ---
24
+
25
+ # Domain Name Generator - Reformer Character-Level Model
26
+
27
+ A character-level Reformer model trained to generate domain names based on descriptive tags. The model takes a set of content and style tags as input and generates appropriate, creative domain names.
28
+
29
+ ## Model Description
30
+
31
+ This model is a fine-tuned version of `google/reformer-enwik8` specifically adapted for domain name generation. It uses a pure tag-based approach where both content descriptors (e.g., "tech", "health") and style descriptors (e.g., "modern", "minimal") are treated as equal tags.
32
+
33
+ ### Key Features
34
+ - **Character-level generation**: Generates domains character by character for maximum flexibility
35
+ - **Tag-based prompting**: Uses 3-4 descriptive tags to guide generation
36
+ - **Style-aware**: Understands style tags like "modern", "minimal", "playful"
37
+ - **Position-independent**: Tag order doesn't matter due to training-time shuffling
38
+
39
+ ## Model Details
40
+
41
+ - **Architecture**: Reformer with LSH attention
42
+ - **Base Model**: google/reformer-enwik8
43
+ - **Model Size**: ~597M parameters
44
+ - **Vocabulary Size**: 258 (byte-level encoding)
45
+ - **Max Sequence Length**: 256 characters
46
+ - **Hidden Size**: 1024
47
+ - **Layers**: 12
48
+ - **Attention Heads**: 8
49
+
50
+ ## Training Details
51
+
52
+ ### Training Data
53
+ - **Primary Dataset**: 250k real domains from BrandBucket
54
+ - **Synthetic Dataset**: 1.75M AI-generated domains
55
+ - **Total Examples**: ~2M domains
56
+ - **Data Split**: 80% synthetic, 20% real
57
+
58
+ ### Training Configuration
59
+ - **Epochs**: 5
60
+ - **Batch Size**: 256 (128 × 2 gradient accumulation)
61
+ - **Learning Rate**: 5e-05
62
+ - **Tag Dropout**: 10%
63
+ - **Style Tag Probability**: 30%
64
+ - **Hardware**: NVIDIA H100 GPU
65
+ - **Training Time**: 17.6 hours
66
+
67
+ ### Training Results
68
+ - **Final Training Loss**: 1.1113
69
+ - **Best Validation Loss**: 0.9716
70
+ - **Loss Reduction**: 75%
71
+ - **Training Stability**: std=0.0014 (very stable)
72
+
73
+ ## Intended Use
74
+
75
+ ### Primary Use Cases
76
+ - Generate domain names for startups and businesses
77
+ - Brainstorm creative domain ideas based on keywords
78
+ - Explore domain variations with different styles
79
+
80
+ ### Input Format
81
+ ```
82
+ tags: tag1;tag2;tag3 domain:
83
+ ```
84
+
85
+ ### Supported Tags
86
+
87
+ **Content Tags** (examples):
88
+ - `tech`, `ai`, `startup`, `app`, `software`
89
+ - `health`, `wellness`, `fitness`, `medical`
90
+ - `eco`, `green`, `sustainable`, `organic`
91
+ - `fashion`, `beauty`, `style`, `boutique`
92
+ - `food`, `restaurant`, `cafe`, `delivery`
93
+
94
+ **Style Tags**:
95
+ - `modern` - Clean, contemporary
96
+ - `classic` - Traditional, timeless
97
+ - `playful` - Fun, casual
98
+ - `bold` - Strong, impactful
99
+ - `elegant` - Sophisticated, refined
100
+ - `techy` - Technical, digital
101
+ - `eco` - Environmental, green
102
+ - `luxury` - Premium, high-end
103
+ - `minimal` - Simple, short
104
+ - `creative` - Artistic, unique
105
+ - `professional` - Business-oriented
106
+ - `casual` - Relaxed, informal
107
+ - `trendy` - Current, fashionable
108
+ - `simple` - Straightforward
109
+ - `unique` - Distinctive
110
+
111
+ ## Usage
112
+
113
+ ### With Transformers Library
114
+
115
+ ```python
116
+ from transformers import ReformerModelWithLMHead, AutoTokenizer
117
+ import torch
118
+
119
+ # Load model
120
+ model = ReformerModelWithLMHead.from_pretrained("path/to/domain-generator")
121
+ model.eval()
122
+
123
+ # Character encoding (Reformer standard)
124
+ def encode_text(text):
125
+ return [c + 2 for c in text.encode('utf-8')]
126
+
127
+ def decode_ids(ids):
128
+ return bytes([max(0, id - 2) for id in ids if id > 2]).decode('utf-8', errors='ignore')
129
+
130
+ # Generate domain
131
+ prompt = "tags: tech;startup;modern domain:"
132
+ input_ids = torch.tensor([encode_text(prompt)])
133
+
134
+ with torch.no_grad():
135
+ output = model.generate(
136
+ input_ids,
137
+ max_new_tokens=50,
138
+ temperature=1.2,
139
+ top_p=0.95,
140
+ do_sample=True,
141
+ pad_token_id=0,
142
+ eos_token_id=2
143
+ )
144
+
145
+ generated = decode_ids(output[0].tolist())
146
+ domain = generated.split("domain:")[-1].strip()
147
+ print(f"Generated: {domain}")
148
+ ```
149
+
150
+ ### Generation Parameters
151
+ - **Temperature**: 1.2 (recommended for creativity)
152
+ - **Top-p**: 0.95
153
+ - **Max Length**: 50 tokens after prompt
154
+
155
+ ## Examples
156
+
157
+ ### Input → Output Examples
158
+
159
+ ```
160
+ tags: tech;startup;ai → techflow.ai
161
+ tags: eco;sustainable;modern → greenleaf.eco
162
+ tags: health;wellness;minimal → purelife.health
163
+ tags: fashion;luxury;elegant → velvetrose.com
164
+ tags: food;delivery;playful → snackdash.io
165
+ ```
166
+
167
+ ## Limitations
168
+
169
+ - Best results with 3-4 tags (trained range)
170
+ - May occasionally generate non-standard TLDs
171
+ - Domain availability not guaranteed
172
+ - Works best with English keywords
173
+
174
+ ## Ethical Considerations
175
+
176
+ - Generated domains should be checked for trademark conflicts
177
+ - May reflect biases present in training data
178
+ - Should not be used to generate misleading or deceptive domains
179
+
180
+ ## Model Card Contact
181
+
182
+ For questions or issues, please open an issue in the repository.
183
+
184
+ ## Citation
185
+
186
+ If you use this model, please cite:
187
+
188
+ ```bibtex
189
+ @software{domain_generator_reformer,
190
+ title = {Domain Generator - Character-Level Reformer},
191
+ year = {2024},
192
+ publisher = {HuggingFace},
193
+ url = {https://huggingface.co/your-username/domain-generator-reformer}
194
+ }
195
+ ```
196
+
197
+ ## Changelog
198
+
199
+ - **v1.0** (2024-01): Initial release
200
+ - 5 epochs training on combined dataset
201
+ - 0.9716 validation loss
202
+ - Stable generation quality
README.md CHANGED
@@ -1,3 +1,202 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: en
3
+ license: apache-2.0
4
+ tags:
5
+ - text-generation
6
+ - domain-names
7
+ - reformer
8
+ - character-level
9
+ datasets:
10
+ - custom
11
+ metrics:
12
+ - loss
13
+ model-index:
14
+ - name: reformer-character-domain-generator
15
+ results:
16
+ - task:
17
+ type: text-generation
18
+ name: Domain Name Generation
19
+ metrics:
20
+ - type: loss
21
+ value: 0.9716
22
+ name: Validation Loss
23
+ ---
24
+
25
+ # Domain Name Generator - Reformer Character-Level Model
26
+
27
+ A character-level Reformer model trained to generate domain names based on descriptive tags. The model takes a set of content and style tags as input and generates appropriate, creative domain names.
28
+
29
+ ## Model Description
30
+
31
+ This model is a fine-tuned version of `google/reformer-enwik8` specifically adapted for domain name generation. It uses a pure tag-based approach where both content descriptors (e.g., "tech", "health") and style descriptors (e.g., "modern", "minimal") are treated as equal tags.
32
+
33
+ ### Key Features
34
+ - **Character-level generation**: Generates domains character by character for maximum flexibility
35
+ - **Tag-based prompting**: Uses 3-4 descriptive tags to guide generation
36
+ - **Style-aware**: Understands style tags like "modern", "minimal", "playful"
37
+ - **Position-independent**: Tag order doesn't matter due to training-time shuffling
38
+
39
+ ## Model Details
40
+
41
+ - **Architecture**: Reformer with LSH attention
42
+ - **Base Model**: google/reformer-enwik8
43
+ - **Model Size**: ~597M parameters
44
+ - **Vocabulary Size**: 258 (byte-level encoding)
45
+ - **Max Sequence Length**: 256 characters
46
+ - **Hidden Size**: 1024
47
+ - **Layers**: 12
48
+ - **Attention Heads**: 8
49
+
50
+ ## Training Details
51
+
52
+ ### Training Data
53
+ - **Primary Dataset**: 250k real domains from BrandBucket
54
+ - **Synthetic Dataset**: 1.75M AI-generated domains
55
+ - **Total Examples**: ~2M domains
56
+ - **Data Split**: 80% synthetic, 20% real
57
+
58
+ ### Training Configuration
59
+ - **Epochs**: 5
60
+ - **Batch Size**: 256 (128 × 2 gradient accumulation)
61
+ - **Learning Rate**: 5e-05
62
+ - **Tag Dropout**: 10%
63
+ - **Style Tag Probability**: 30%
64
+ - **Hardware**: NVIDIA H100 GPU
65
+ - **Training Time**: 17.6 hours
66
+
67
+ ### Training Results
68
+ - **Final Training Loss**: 1.1113
69
+ - **Best Validation Loss**: 0.9716
70
+ - **Loss Reduction**: 75%
71
+ - **Training Stability**: std=0.0014 (very stable)
72
+
73
+ ## Intended Use
74
+
75
+ ### Primary Use Cases
76
+ - Generate domain names for startups and businesses
77
+ - Brainstorm creative domain ideas based on keywords
78
+ - Explore domain variations with different styles
79
+
80
+ ### Input Format
81
+ ```
82
+ tags: tag1;tag2;tag3 domain:
83
+ ```
84
+
85
+ ### Supported Tags
86
+
87
+ **Content Tags** (examples):
88
+ - `tech`, `ai`, `startup`, `app`, `software`
89
+ - `health`, `wellness`, `fitness`, `medical`
90
+ - `eco`, `green`, `sustainable`, `organic`
91
+ - `fashion`, `beauty`, `style`, `boutique`
92
+ - `food`, `restaurant`, `cafe`, `delivery`
93
+
94
+ **Style Tags**:
95
+ - `modern` - Clean, contemporary
96
+ - `classic` - Traditional, timeless
97
+ - `playful` - Fun, casual
98
+ - `bold` - Strong, impactful
99
+ - `elegant` - Sophisticated, refined
100
+ - `techy` - Technical, digital
101
+ - `eco` - Environmental, green
102
+ - `luxury` - Premium, high-end
103
+ - `minimal` - Simple, short
104
+ - `creative` - Artistic, unique
105
+ - `professional` - Business-oriented
106
+ - `casual` - Relaxed, informal
107
+ - `trendy` - Current, fashionable
108
+ - `simple` - Straightforward
109
+ - `unique` - Distinctive
110
+
111
+ ## Usage
112
+
113
+ ### With Transformers Library
114
+
115
+ ```python
116
+ from transformers import ReformerModelWithLMHead, AutoTokenizer
117
+ import torch
118
+
119
+ # Load model
120
+ model = ReformerModelWithLMHead.from_pretrained("humbleworth/reformer-character-domain-generator")
121
+ model.eval()
122
+
123
+ # Character encoding (Reformer standard)
124
+ def encode_text(text):
125
+ return [c + 2 for c in text.encode('utf-8')]
126
+
127
+ def decode_ids(ids):
128
+ return bytes([max(0, id - 2) for id in ids if id > 2]).decode('utf-8', errors='ignore')
129
+
130
+ # Generate domain
131
+ prompt = "tags: tech;startup;modern domain:"
132
+ input_ids = torch.tensor([encode_text(prompt)])
133
+
134
+ with torch.no_grad():
135
+ output = model.generate(
136
+ input_ids,
137
+ max_new_tokens=50,
138
+ temperature=1.2,
139
+ top_p=0.95,
140
+ do_sample=True,
141
+ pad_token_id=0,
142
+ eos_token_id=2
143
+ )
144
+
145
+ generated = decode_ids(output[0].tolist())
146
+ domain = generated.split("domain:")[-1].strip()
147
+ print(f"Generated: {domain}")
148
+ ```
149
+
150
+ ### Generation Parameters
151
+ - **Temperature**: 1.2 (recommended for creativity)
152
+ - **Top-p**: 0.95
153
+ - **Max Length**: 50 tokens after prompt
154
+
155
+ ## Examples
156
+
157
+ ### Input → Output Examples
158
+
159
+ ```
160
+ tags: tech;startup;ai → techflow.ai
161
+ tags: eco;sustainable;modern → greenleaf.eco
162
+ tags: health;wellness;minimal → purelife.health
163
+ tags: fashion;luxury;elegant → velvetrose.com
164
+ tags: food;delivery;playful → snackdash.io
165
+ ```
166
+
167
+ ## Limitations
168
+
169
+ - Best results with 3-4 tags (trained range)
170
+ - May occasionally generate non-standard TLDs
171
+ - Domain availability not guaranteed
172
+ - Works best with English keywords
173
+
174
+ ## Ethical Considerations
175
+
176
+ - Generated domains should be checked for trademark conflicts
177
+ - May reflect biases present in training data
178
+ - Should not be used to generate misleading or deceptive domains
179
+
180
+ ## Model Card Contact
181
+
182
+ For questions or issues, please open an issue in the repository.
183
+
184
+ ## Citation
185
+
186
+ If you use this model, please cite:
187
+
188
+ ```bibtex
189
+ @software{domain_generator_reformer,
190
+ title = {Domain Generator - Character-Level Reformer},
191
+ year = {2025},
192
+ publisher = {HuggingFace},
193
+ url = {https://huggingface.co/humbleworth/reformer-character-domain-generator}
194
+ }
195
+ ```
196
+
197
+ ## Changelog
198
+
199
+ - **v1.0** (2024-01): Initial release
200
+ - 5 epochs training on combined dataset
201
+ - 0.9716 validation loss
202
+ - Stable generation quality
config.json ADDED
@@ -0,0 +1,68 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "ReformerModelWithLMHead"
4
+ ],
5
+ "attention_head_size": 128,
6
+ "attn_layers": [
7
+ "local",
8
+ "local",
9
+ "lsh",
10
+ "local",
11
+ "local",
12
+ "local",
13
+ "lsh",
14
+ "local",
15
+ "local",
16
+ "local",
17
+ "lsh",
18
+ "local"
19
+ ],
20
+ "axial_norm_std": 1.0,
21
+ "axial_pos_embds": true,
22
+ "axial_pos_embds_dim": [
23
+ 256,
24
+ 768
25
+ ],
26
+ "axial_pos_shape": [
27
+ 16,
28
+ 16
29
+ ],
30
+ "chunk_size_lm_head": 0,
31
+ "classifier_dropout": null,
32
+ "eos_token_id": 2,
33
+ "feed_forward_size": 4096,
34
+ "hash_seed": null,
35
+ "hidden_act": "relu",
36
+ "hidden_dropout_prob": 0.2,
37
+ "hidden_size": 1024,
38
+ "initializer_range": 0.02,
39
+ "is_decoder": true,
40
+ "layer_norm_eps": 1e-12,
41
+ "local_attention_probs_dropout_prob": 0.2,
42
+ "local_attn_chunk_length": 128,
43
+ "local_num_chunks_after": 0,
44
+ "local_num_chunks_before": 1,
45
+ "lsh_attention_probs_dropout_prob": 0.1,
46
+ "lsh_attn_chunk_length": 256,
47
+ "lsh_num_chunks_after": 0,
48
+ "lsh_num_chunks_before": 1,
49
+ "max_position_embeddings": 256,
50
+ "model_type": "reformer",
51
+ "num_attention_heads": 8,
52
+ "num_buckets": 512,
53
+ "num_hashes": 4,
54
+ "num_hidden_layers": 12,
55
+ "output_past": true,
56
+ "pad_token_id": 0,
57
+ "task_specific_params": {
58
+ "text-generation": {
59
+ "do_sample": true,
60
+ "max_length": 100
61
+ }
62
+ },
63
+ "tie_word_embeddings": false,
64
+ "torch_dtype": "float32",
65
+ "transformers_version": "4.53.1",
66
+ "use_cache": true,
67
+ "vocab_size": 258
68
+ }
generation_config.json ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ {
2
+ "_from_model_config": true,
3
+ "eos_token_id": 2,
4
+ "pad_token_id": 0,
5
+ "transformers_version": "4.53.1"
6
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:00b73d5dfd3169de30acef09570180e4d5116696b265a93e22dffd1bf3098f21
3
+ size 595111584