jacob-c commited on
Commit
1f99d4e
·
verified ·
1 Parent(s): b16f442

Upload folder using huggingface_hub

Browse files
LYRICS_GENERATION_GUIDE.md ADDED
@@ -0,0 +1,137 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Lyrics Generation Guide
2
+
3
+ This guide explains how to use the fine-tuned GPT-Neo 2.7B model to generate lyrics in different styles and themes.
4
+
5
+ ## Basic Usage
6
+
7
+ The basic command to generate lyrics is:
8
+
9
+ ```bash
10
+ python generate_lyrics_english.py --artist "Artist Name" --use_cpu
11
+ ```
12
+
13
+ This will generate lyrics in the style of the specified artist.
14
+
15
+ ## Available Parameters
16
+
17
+ The script supports the following parameters:
18
+
19
+ | Parameter | Description | Default |
20
+ |-----------|-------------|---------|
21
+ | `--artist` | The artist name to emulate | "Taylor Swift" |
22
+ | `--theme` | Theme or topic for the lyrics | "" (none) |
23
+ | `--style` | Style of the lyrics (e.g., romantic, upbeat, sad) | "" (none) |
24
+ | `--prompt` | Custom text to start the lyrics | "" (none) |
25
+ | `--max_length` | Maximum length of generated text | 200 |
26
+ | `--temperature` | Generation temperature (higher = more diverse) | 0.7 |
27
+ | `--top_p` | Nucleus sampling probability threshold | 0.9 |
28
+ | `--num_samples` | Number of samples to generate | 1 |
29
+ | `--use_cpu` | Use CPU for inference | False |
30
+ | `--model_path` | Path to the model | Final model path |
31
+ | `--checkpoint_path` | Path to a specific checkpoint | Checkpoint-900 |
32
+
33
+ ## Example Prompts
34
+
35
+ ### 1. Basic Artist Emulation
36
+
37
+ ```bash
38
+ python generate_lyrics_english.py --artist "Taylor Swift" --use_cpu
39
+ ```
40
+
41
+ ### 2. Artist with Theme
42
+
43
+ ```bash
44
+ python generate_lyrics_english.py --artist "Beyonce" --theme "empowerment" --use_cpu
45
+ ```
46
+
47
+ ### 3. Artist with Style
48
+
49
+ ```bash
50
+ python generate_lyrics_english.py --artist "Ed Sheeran" --style "romantic" --use_cpu
51
+ ```
52
+
53
+ ### 4. Artist with Theme and Style
54
+
55
+ ```bash
56
+ python generate_lyrics_english.py --artist "Adele" --theme "heartbreak" --style "emotional" --use_cpu
57
+ ```
58
+
59
+ ### 5. Artist with Custom Starting Prompt
60
+
61
+ ```bash
62
+ python generate_lyrics_english.py --artist "Bruno Mars" --prompt "Dancing in the moonlight" --use_cpu
63
+ ```
64
+
65
+ ### 6. Complete Specification
66
+
67
+ ```bash
68
+ python generate_lyrics_english.py --artist "Lady Gaga" --theme "freedom" --style "upbeat" --prompt "I was born this way" --use_cpu
69
+ ```
70
+
71
+ ## Controlling Generation Parameters
72
+
73
+ ### Temperature
74
+
75
+ The temperature parameter controls the randomness of the generation. Higher values (e.g., 1.0) make the output more diverse but potentially less coherent, while lower values (e.g., 0.5) make the output more focused and deterministic.
76
+
77
+ ```bash
78
+ # More creative output
79
+ python generate_lyrics_english.py --artist "Ariana Grande" --temperature 1.0 --use_cpu
80
+
81
+ # More focused output
82
+ python generate_lyrics_english.py --artist "Drake" --temperature 0.5 --use_cpu
83
+ ```
84
+
85
+ ### Top-p (Nucleus Sampling)
86
+
87
+ The top-p parameter controls the diversity of the generation by considering only the most probable tokens whose cumulative probability exceeds the threshold.
88
+
89
+ ```bash
90
+ # More diverse output
91
+ python generate_lyrics_english.py --artist "Beyonce" --top_p 0.95 --use_cpu
92
+
93
+ # More focused output
94
+ python generate_lyrics_english.py --artist "Beyonce" --top_p 0.7 --use_cpu
95
+ ```
96
+
97
+ ### Maximum Length
98
+
99
+ Control the length of the generated lyrics:
100
+
101
+ ```bash
102
+ # Shorter lyrics
103
+ python generate_lyrics_english.py --artist "Taylor Swift" --max_length 100 --use_cpu
104
+
105
+ # Longer lyrics
106
+ python generate_lyrics_english.py --artist "Taylor Swift" --max_length 300 --use_cpu
107
+ ```
108
+
109
+ ## Batch Generation
110
+
111
+ Generate multiple samples at once:
112
+
113
+ ```bash
114
+ python generate_lyrics_english.py --artist "Taylor Swift" --num_samples 5 --use_cpu
115
+ ```
116
+
117
+ ## Tips for Better Results
118
+
119
+ 1. **Be specific with artists**: The model was trained on specific artists' styles, so using those artists will yield better results.
120
+
121
+ 2. **Combine parameters**: Using a combination of theme, style, and prompt often yields the most interesting and coherent results.
122
+
123
+ 3. **Experiment with temperature**: If the output is too repetitive, try increasing the temperature. If it's too random, try decreasing it.
124
+
125
+ 4. **Use CPU mode for reliability**: While GPU mode is faster, CPU mode is more reliable for avoiding memory issues.
126
+
127
+ 5. **Try different starting prompts**: The starting prompt can significantly influence the direction of the generated lyrics.
128
+
129
+ ## Running the Examples Script
130
+
131
+ For a quick demonstration of different prompt combinations, run:
132
+
133
+ ```bash
134
+ ./english_prompt_examples.sh
135
+ ```
136
+
137
+ This will run through several example prompts with different configurations.
README.md CHANGED
@@ -1,3 +1,129 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: en
3
+ license: mit
4
+ tags:
5
+ - gpt-neo
6
+ - lora
7
+ - text-generation
8
+ - lyrics
9
+ - music
10
+ datasets:
11
+ - smgriffin/modern-pop-lyrics
12
+ ---
13
+
14
+ # GPT-Neo 2.7B LoRA Fine-tuned for Lyrics Generation
15
+
16
+ This model is a fine-tuned version of [EleutherAI/gpt-neo-2.7B](https://huggingface.co/EleutherAI/gpt-neo-2.7B) on a dataset of modern pop lyrics using Low-Rank Adaptation (LoRA). The model can generate lyrics in the style of different artists.
17
+
18
+ ## Model Description
19
+
20
+ This model uses LoRA (Low-Rank Adaptation) to efficiently fine-tune GPT-Neo 2.7B for lyrics generation. LoRA adds trainable rank decomposition matrices to the existing weights, greatly reducing the number of trainable parameters while preserving performance.
21
+
22
+ - **Base model**: [EleutherAI/gpt-neo-2.7B](https://huggingface.co/EleutherAI/gpt-neo-2.7B)
23
+ - **Model type**: Causal language model with LoRA adapters
24
+ - **Language**: English
25
+ - **Training data**: [Modern Pop Lyrics](https://huggingface.co/datasets/smgriffin/modern-pop-lyrics) dataset
26
+ - **LoRA config**:
27
+ - rank (r): 16
28
+ - alpha: 32
29
+ - dropout: 0.05
30
+ - Target modules: q_proj, k_proj, v_proj, o_proj (attention modules)
31
+
32
+ ## How to Use
33
+
34
+ You can use this model to generate lyrics in the style of different artists. Here's a simple example:
35
+
36
+ ```python
37
+ import torch
38
+ from transformers import AutoModelForCausalLM, AutoTokenizer
39
+
40
+ # Load model and tokenizer
41
+ tokenizer = AutoTokenizer.from_pretrained("YOUR_USERNAME/gptneo-2.7Bloratunning")
42
+ model = AutoModelForCausalLM.from_pretrained(
43
+ "YOUR_USERNAME/gptneo-2.7Bloratunning",
44
+ load_in_8bit=True, # For memory efficiency
45
+ device_map="auto",
46
+ torch_dtype=torch.float16,
47
+ )
48
+
49
+ # Set the artist name
50
+ artist = "Taylor Swift"
51
+ prompt = f"Artist: {artist}\nLyrics:"
52
+
53
+ # Tokenize input
54
+ inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
55
+
56
+ # Generate lyrics
57
+ with torch.no_grad():
58
+ outputs = model.generate(
59
+ **inputs,
60
+ max_length=200,
61
+ temperature=0.7,
62
+ top_p=0.9,
63
+ top_k=50,
64
+ num_return_sequences=1,
65
+ pad_token_id=tokenizer.eos_token_id,
66
+ do_sample=True,
67
+ )
68
+
69
+ # Decode and print the generated lyrics
70
+ generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
71
+ print(generated_text)
72
+ ```
73
+
74
+ ## Generation Parameters
75
+
76
+ You can adjust these parameters to control the output:
77
+
78
+ - **temperature**: Controls randomness (higher = more diverse, lower = more deterministic)
79
+ - **top_p**: Nucleus sampling threshold (consider tokens with cumulative probability ≥ top_p)
80
+ - **top_k**: Consider only the top k tokens (set to 0 to disable)
81
+ - **max_length**: Maximum length of generated text
82
+ - **num_return_sequences**: Number of different outputs to generate
83
+
84
+ ## Example Usage
85
+
86
+ ### Basic Generation
87
+
88
+ ```bash
89
+ python generate.py --artist "Taylor Swift"
90
+ ```
91
+
92
+ ### Customized Generation
93
+
94
+ ```bash
95
+ python generate.py --artist "Beyonce" --temperature 0.8 --top_p 0.92 --max_length 300
96
+ ```
97
+
98
+ ## Training Details
99
+
100
+ The model was trained with the following configuration:
101
+
102
+ - **Batch size**: 1 per device
103
+ - **Gradient accumulation steps**: 16
104
+ - **Learning rate**: 2e-4
105
+ - **Training steps**: 900
106
+ - **Sequence length**: 512
107
+ - **Optimizer**: AdamW
108
+ - **Scheduler**: Linear with warmup
109
+
110
+ ## Limitations
111
+
112
+ - The model is trained on a specific set of artists, so it works best with those artists
113
+ - May generate content that reflects biases in the training data
114
+ - Can sometimes produce repetitive content, especially with lower temperature settings
115
+ - Output quality varies based on the specific artist and generation parameters
116
+
117
+ ## Citation
118
+
119
+ If you use this model, please cite:
120
+
121
+ ```bibtex
122
+ @misc{gptneo-2.7Bloratunning,
123
+ author = {YOUR_NAME},
124
+ title = {GPT-Neo 2.7B LoRA Fine-tuned for Lyrics Generation},
125
+ year = {2024},
126
+ publisher = {HuggingFace},
127
+ howpublished = {\url{https://huggingface.co/YOUR_USERNAME/gptneo-2.7Bloratunning}}
128
+ }
129
+ ```
adapter_config.json ADDED
@@ -0,0 +1,34 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "alpha_pattern": {},
3
+ "auto_mapping": null,
4
+ "base_model_name_or_path": "EleutherAI/gpt-neo-2.7B",
5
+ "bias": "none",
6
+ "eva_config": null,
7
+ "exclude_modules": null,
8
+ "fan_in_fan_out": false,
9
+ "inference_mode": true,
10
+ "init_lora_weights": true,
11
+ "layer_replication": null,
12
+ "layers_pattern": null,
13
+ "layers_to_transform": null,
14
+ "loftq_config": {},
15
+ "lora_alpha": 32,
16
+ "lora_bias": false,
17
+ "lora_dropout": 0.05,
18
+ "megatron_config": null,
19
+ "megatron_core": "megatron.core",
20
+ "modules_to_save": null,
21
+ "peft_type": "LORA",
22
+ "r": 16,
23
+ "rank_pattern": {},
24
+ "revision": null,
25
+ "target_modules": [
26
+ "v_proj",
27
+ "q_proj",
28
+ "out_proj",
29
+ "k_proj"
30
+ ],
31
+ "task_type": "CAUSAL_LM",
32
+ "use_dora": false,
33
+ "use_rslora": false
34
+ }
adapter_model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:0fb605f746c8ee79ede9567114e61860fcc972cb802a23b522b9391ec4c0279c
3
+ size 41979088
generate.py ADDED
@@ -0,0 +1,73 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ import os
3
+ import torch
4
+ import argparse
5
+ from transformers import AutoModelForCausalLM, AutoTokenizer
6
+ from peft import PeftModel, PeftConfig
7
+
8
+ # Set constants
9
+ USERNAME = "jc4121" # Your username
10
+ BASE_PATH = f"/vol/bitbucket/{USERNAME}"
11
+ OUTPUT_PATH = f"{BASE_PATH}/gptneo-2.7Bloratunning/output"
12
+
13
+ def parse_args():
14
+ parser = argparse.ArgumentParser(description="Generate lyrics using fine-tuned GPT-Neo 2.7B")
15
+ parser.add_argument("--model_path", type=str, default=f"{OUTPUT_PATH}/final_model", help="Path to the fine-tuned model")
16
+ parser.add_argument("--artist", type=str, default="Taylor Swift", help="Artist name for conditioning")
17
+ parser.add_argument("--max_length", type=int, default=512, help="Maximum length of generated text")
18
+ parser.add_argument("--temperature", type=float, default=0.7, help="Sampling temperature")
19
+ parser.add_argument("--top_p", type=float, default=0.9, help="Top-p sampling parameter")
20
+ parser.add_argument("--top_k", type=int, default=50, help="Top-k sampling parameter")
21
+ parser.add_argument("--num_return_sequences", type=int, default=1, help="Number of sequences to generate")
22
+ parser.add_argument("--seed", type=int, default=42, help="Random seed")
23
+ return parser.parse_args()
24
+
25
+ def main():
26
+ args = parse_args()
27
+ torch.manual_seed(args.seed)
28
+
29
+ print(f"Loading model from {args.model_path}")
30
+
31
+ # Load tokenizer
32
+ tokenizer = AutoTokenizer.from_pretrained(args.model_path)
33
+
34
+ # Load model
35
+ model = AutoModelForCausalLM.from_pretrained(
36
+ args.model_path,
37
+ load_in_8bit=True,
38
+ device_map="auto",
39
+ torch_dtype=torch.float16,
40
+ )
41
+
42
+ # Set model to evaluation mode
43
+ model.eval()
44
+
45
+ # Prepare prompt
46
+ prompt = f"Artist: {args.artist}\nLyrics:"
47
+ print(f"Generating lyrics for artist: {args.artist}")
48
+
49
+ # Tokenize prompt
50
+ inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
51
+
52
+ # Generate text
53
+ with torch.no_grad():
54
+ outputs = model.generate(
55
+ **inputs,
56
+ max_length=args.max_length,
57
+ temperature=args.temperature,
58
+ top_p=args.top_p,
59
+ top_k=args.top_k,
60
+ num_return_sequences=args.num_return_sequences,
61
+ pad_token_id=tokenizer.eos_token_id,
62
+ do_sample=True,
63
+ )
64
+
65
+ # Decode and print generated text
66
+ for i, output in enumerate(outputs):
67
+ generated_text = tokenizer.decode(output, skip_special_tokens=True)
68
+ print(f"\n--- Generated Lyrics {i+1} ---\n")
69
+ print(generated_text)
70
+ print("\n" + "-" * 50)
71
+
72
+ if __name__ == "__main__":
73
+ main()
merges.txt ADDED
The diff for this file is too large to render. See raw diff
 
requirements.txt ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ transformers>=4.30.0
2
+ datasets>=2.12.0
3
+ peft>=0.4.0
4
+ accelerate>=0.20.0
5
+ bitsandbytes>=0.40.0
6
+ torch>=2.0.0
7
+ wandb>=0.15.0
8
+ tqdm>=4.65.0
9
+ sentencepiece>=0.1.99
10
+ tensorboard>=2.13.0
run_generate.sh ADDED
@@ -0,0 +1,34 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/bin/bash
2
+
3
+ # Set environment variables
4
+ export CUDA_VISIBLE_DEVICES=0
5
+ export TRANSFORMERS_CACHE="/vol/bitbucket/jc4121/gptneo-2.7Bloratunning/cache"
6
+ export HF_DATASETS_CACHE="/vol/bitbucket/jc4121/gptneo-2.7Bloratunning/data"
7
+ export HF_HOME="/vol/bitbucket/jc4121/gptneo-2.7Bloratunning/hf_home"
8
+
9
+ # Create directories
10
+ mkdir -p $TRANSFORMERS_CACHE
11
+ mkdir -p $HF_DATASETS_CACHE
12
+ mkdir -p $HF_HOME
13
+ mkdir -p /vol/bitbucket/jc4121/gptneo-2.7Bloratunning/lib
14
+
15
+ # Add custom lib path to PYTHONPATH
16
+ export PYTHONPATH=/vol/bitbucket/jc4121/gptneo-2.7Bloratunning/lib:$PYTHONPATH
17
+
18
+ # Default artist
19
+ ARTIST="Taylor Swift"
20
+
21
+ # Check if an artist name was provided
22
+ if [ $# -ge 1 ]; then
23
+ ARTIST="$1"
24
+ fi
25
+
26
+ # Run generation
27
+ echo "Generating lyrics for artist: $ARTIST"
28
+ python generate.py \
29
+ --artist "$ARTIST" \
30
+ --max_length 512 \
31
+ --temperature 0.7 \
32
+ --top_p 0.9 \
33
+ --top_k 50 \
34
+ --num_return_sequences 1
special_tokens_map.json ADDED
@@ -0,0 +1,24 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": {
3
+ "content": "<|endoftext|>",
4
+ "lstrip": false,
5
+ "normalized": true,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "eos_token": {
10
+ "content": "<|endoftext|>",
11
+ "lstrip": false,
12
+ "normalized": true,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "pad_token": "<|endoftext|>",
17
+ "unk_token": {
18
+ "content": "<|endoftext|>",
19
+ "lstrip": false,
20
+ "normalized": true,
21
+ "rstrip": false,
22
+ "single_word": false
23
+ }
24
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,24 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_bos_token": false,
3
+ "add_prefix_space": false,
4
+ "added_tokens_decoder": {
5
+ "50256": {
6
+ "content": "<|endoftext|>",
7
+ "lstrip": false,
8
+ "normalized": true,
9
+ "rstrip": false,
10
+ "single_word": false,
11
+ "special": true
12
+ }
13
+ },
14
+ "bos_token": "<|endoftext|>",
15
+ "clean_up_tokenization_spaces": false,
16
+ "eos_token": "<|endoftext|>",
17
+ "errors": "replace",
18
+ "extra_special_tokens": {},
19
+ "model_max_length": 2048,
20
+ "pad_token": "<|endoftext|>",
21
+ "padding_side": "right",
22
+ "tokenizer_class": "GPT2Tokenizer",
23
+ "unk_token": "<|endoftext|>"
24
+ }
vocab.json ADDED
The diff for this file is too large to render. See raw diff