Upload folder using huggingface_hub

Browse files

Files changed (12) hide show

LYRICS_GENERATION_GUIDE.md +137 -0
README.md +129 -3
adapter_config.json +34 -0
adapter_model.safetensors +3 -0
generate.py +73 -0
merges.txt +0 -0
requirements.txt +10 -0
run_generate.sh +34 -0
special_tokens_map.json +24 -0
tokenizer.json +0 -0
tokenizer_config.json +24 -0
vocab.json +0 -0

LYRICS_GENERATION_GUIDE.md ADDED Viewed

	@@ -0,0 +1,137 @@

+# Lyrics Generation Guide
+This guide explains how to use the fine-tuned GPT-Neo 2.7B model to generate lyrics in different styles and themes.
+## Basic Usage
+The basic command to generate lyrics is:
+```bash
+python generate_lyrics_english.py --artist "Artist Name" --use_cpu
+```
+This will generate lyrics in the style of the specified artist.
+## Available Parameters
+The script supports the following parameters:
+| Parameter | Description | Default |
+|-----------|-------------|---------|
+| `--artist` | The artist name to emulate | "Taylor Swift" |
+| `--theme` | Theme or topic for the lyrics | "" (none) |
+| `--style` | Style of the lyrics (e.g., romantic, upbeat, sad) | "" (none) |
+| `--prompt` | Custom text to start the lyrics | "" (none) |
+| `--max_length` | Maximum length of generated text | 200 |
+| `--temperature` | Generation temperature (higher = more diverse) | 0.7 |
+| `--top_p` | Nucleus sampling probability threshold | 0.9 |
+| `--num_samples` | Number of samples to generate | 1 |
+| `--use_cpu` | Use CPU for inference | False |
+| `--model_path` | Path to the model | Final model path |
+| `--checkpoint_path` | Path to a specific checkpoint | Checkpoint-900 |
+## Example Prompts
+### 1. Basic Artist Emulation
+```bash
+python generate_lyrics_english.py --artist "Taylor Swift" --use_cpu
+```
+### 2. Artist with Theme
+```bash
+python generate_lyrics_english.py --artist "Beyonce" --theme "empowerment" --use_cpu
+```
+### 3. Artist with Style
+```bash
+python generate_lyrics_english.py --artist "Ed Sheeran" --style "romantic" --use_cpu
+```
+### 4. Artist with Theme and Style
+```bash
+python generate_lyrics_english.py --artist "Adele" --theme "heartbreak" --style "emotional" --use_cpu
+```
+### 5. Artist with Custom Starting Prompt
+```bash
+python generate_lyrics_english.py --artist "Bruno Mars" --prompt "Dancing in the moonlight" --use_cpu
+```
+### 6. Complete Specification
+```bash
+python generate_lyrics_english.py --artist "Lady Gaga" --theme "freedom" --style "upbeat" --prompt "I was born this way" --use_cpu
+```
+## Controlling Generation Parameters
+### Temperature
+The temperature parameter controls the randomness of the generation. Higher values (e.g., 1.0) make the output more diverse but potentially less coherent, while lower values (e.g., 0.5) make the output more focused and deterministic.
+```bash
+# More creative output
+python generate_lyrics_english.py --artist "Ariana Grande" --temperature 1.0 --use_cpu
+# More focused output
+python generate_lyrics_english.py --artist "Drake" --temperature 0.5 --use_cpu
+```
+### Top-p (Nucleus Sampling)
+The top-p parameter controls the diversity of the generation by considering only the most probable tokens whose cumulative probability exceeds the threshold.
+```bash
+# More diverse output
+python generate_lyrics_english.py --artist "Beyonce" --top_p 0.95 --use_cpu
+# More focused output
+python generate_lyrics_english.py --artist "Beyonce" --top_p 0.7 --use_cpu
+```
+### Maximum Length
+Control the length of the generated lyrics:
+```bash
+# Shorter lyrics
+python generate_lyrics_english.py --artist "Taylor Swift" --max_length 100 --use_cpu
+# Longer lyrics
+python generate_lyrics_english.py --artist "Taylor Swift" --max_length 300 --use_cpu
+```
+## Batch Generation
+Generate multiple samples at once:
+```bash
+python generate_lyrics_english.py --artist "Taylor Swift" --num_samples 5 --use_cpu
+```
+## Tips for Better Results
+1. **Be specific with artists**: The model was trained on specific artists' styles, so using those artists will yield better results.
+2. **Combine parameters**: Using a combination of theme, style, and prompt often yields the most interesting and coherent results.
+3. **Experiment with temperature**: If the output is too repetitive, try increasing the temperature. If it's too random, try decreasing it.
+4. **Use CPU mode for reliability**: While GPU mode is faster, CPU mode is more reliable for avoiding memory issues.
+5. **Try different starting prompts**: The starting prompt can significantly influence the direction of the generated lyrics.
+## Running the Examples Script
+For a quick demonstration of different prompt combinations, run:
+```bash
+./english_prompt_examples.sh
+```
+This will run through several example prompts with different configurations.

README.md CHANGED Viewed

@@ -1,3 +1,129 @@
----
-license: mit
----

+---
+language: en
+license: mit
+tags:
+  - gpt-neo
+  - lora
+  - text-generation
+  - lyrics
+  - music
+datasets:
+  - smgriffin/modern-pop-lyrics
+---
+# GPT-Neo 2.7B LoRA Fine-tuned for Lyrics Generation
+This model is a fine-tuned version of [EleutherAI/gpt-neo-2.7B](https://huggingface.co/EleutherAI/gpt-neo-2.7B) on a dataset of modern pop lyrics using Low-Rank Adaptation (LoRA). The model can generate lyrics in the style of different artists.
+## Model Description
+This model uses LoRA (Low-Rank Adaptation) to efficiently fine-tune GPT-Neo 2.7B for lyrics generation. LoRA adds trainable rank decomposition matrices to the existing weights, greatly reducing the number of trainable parameters while preserving performance.
+- **Base model**: [EleutherAI/gpt-neo-2.7B](https://huggingface.co/EleutherAI/gpt-neo-2.7B)
+- **Model type**: Causal language model with LoRA adapters
+- **Language**: English
+- **Training data**: [Modern Pop Lyrics](https://huggingface.co/datasets/smgriffin/modern-pop-lyrics) dataset
+- **LoRA config**:
+  - rank (r): 16
+  - alpha: 32
+  - dropout: 0.05
+  - Target modules: q_proj, k_proj, v_proj, o_proj (attention modules)
+## How to Use
+You can use this model to generate lyrics in the style of different artists. Here's a simple example:
+```python
+import torch
+from transformers import AutoModelForCausalLM, AutoTokenizer
+# Load model and tokenizer
+tokenizer = AutoTokenizer.from_pretrained("YOUR_USERNAME/gptneo-2.7Bloratunning")
+model = AutoModelForCausalLM.from_pretrained(
+    "YOUR_USERNAME/gptneo-2.7Bloratunning",
+    load_in_8bit=True,  # For memory efficiency
+    device_map="auto",
+    torch_dtype=torch.float16,
+)
+# Set the artist name
+artist = "Taylor Swift"
+prompt = f"Artist: {artist}\nLyrics:"
+# Tokenize input
+inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
+# Generate lyrics
+with torch.no_grad():
+    outputs = model.generate(
+        **inputs,
+        max_length=200,
+        temperature=0.7,
+        top_p=0.9,
+        top_k=50,
+        num_return_sequences=1,
+        pad_token_id=tokenizer.eos_token_id,
+        do_sample=True,
+    )
+# Decode and print the generated lyrics
+generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
+print(generated_text)
+```
+## Generation Parameters
+You can adjust these parameters to control the output:
+- **temperature**: Controls randomness (higher = more diverse, lower = more deterministic)
+- **top_p**: Nucleus sampling threshold (consider tokens with cumulative probability ≥ top_p)
+- **top_k**: Consider only the top k tokens (set to 0 to disable)
+- **max_length**: Maximum length of generated text
+- **num_return_sequences**: Number of different outputs to generate
+## Example Usage
+### Basic Generation
+```bash
+python generate.py --artist "Taylor Swift"
+```
+### Customized Generation
+```bash
+python generate.py --artist "Beyonce" --temperature 0.8 --top_p 0.92 --max_length 300
+```
+## Training Details
+The model was trained with the following configuration:
+- **Batch size**: 1 per device
+- **Gradient accumulation steps**: 16
+- **Learning rate**: 2e-4
+- **Training steps**: 900
+- **Sequence length**: 512
+- **Optimizer**: AdamW
+- **Scheduler**: Linear with warmup
+## Limitations
+- The model is trained on a specific set of artists, so it works best with those artists
+- May generate content that reflects biases in the training data
+- Can sometimes produce repetitive content, especially with lower temperature settings
+- Output quality varies based on the specific artist and generation parameters
+## Citation
+If you use this model, please cite:
+```bibtex
+@misc{gptneo-2.7Bloratunning,
+  author = {YOUR_NAME},
+  title = {GPT-Neo 2.7B LoRA Fine-tuned for Lyrics Generation},
+  year = {2024},
+  publisher = {HuggingFace},
+  howpublished = {\url{https://huggingface.co/YOUR_USERNAME/gptneo-2.7Bloratunning}}
+}
+```

adapter_config.json ADDED Viewed

	@@ -0,0 +1,34 @@

+{
+  "alpha_pattern": {},
+  "auto_mapping": null,
+  "base_model_name_or_path": "EleutherAI/gpt-neo-2.7B",
+  "bias": "none",
+  "eva_config": null,
+  "exclude_modules": null,
+  "fan_in_fan_out": false,
+  "inference_mode": true,
+  "init_lora_weights": true,
+  "layer_replication": null,
+  "layers_pattern": null,
+  "layers_to_transform": null,
+  "loftq_config": {},
+  "lora_alpha": 32,
+  "lora_bias": false,
+  "lora_dropout": 0.05,
+  "megatron_config": null,
+  "megatron_core": "megatron.core",
+  "modules_to_save": null,
+  "peft_type": "LORA",
+  "r": 16,
+  "rank_pattern": {},
+  "revision": null,
+  "target_modules": [
+    "v_proj",
+    "q_proj",
+    "out_proj",
+    "k_proj"
+  ],
+  "task_type": "CAUSAL_LM",
+  "use_dora": false,
+  "use_rslora": false
+}

adapter_model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:0fb605f746c8ee79ede9567114e61860fcc972cb802a23b522b9391ec4c0279c
+size 41979088

generate.py ADDED Viewed

	@@ -0,0 +1,73 @@

+#!/usr/bin/env python3
+import os
+import torch
+import argparse
+from transformers import AutoModelForCausalLM, AutoTokenizer
+from peft import PeftModel, PeftConfig
+# Set constants
+USERNAME = "jc4121"  # Your username
+BASE_PATH = f"/vol/bitbucket/{USERNAME}"
+OUTPUT_PATH = f"{BASE_PATH}/gptneo-2.7Bloratunning/output"
+def parse_args():
+    parser = argparse.ArgumentParser(description="Generate lyrics using fine-tuned GPT-Neo 2.7B")
+    parser.add_argument("--model_path", type=str, default=f"{OUTPUT_PATH}/final_model", help="Path to the fine-tuned model")
+    parser.add_argument("--artist", type=str, default="Taylor Swift", help="Artist name for conditioning")
+    parser.add_argument("--max_length", type=int, default=512, help="Maximum length of generated text")
+    parser.add_argument("--temperature", type=float, default=0.7, help="Sampling temperature")
+    parser.add_argument("--top_p", type=float, default=0.9, help="Top-p sampling parameter")
+    parser.add_argument("--top_k", type=int, default=50, help="Top-k sampling parameter")
+    parser.add_argument("--num_return_sequences", type=int, default=1, help="Number of sequences to generate")
+    parser.add_argument("--seed", type=int, default=42, help="Random seed")
+    return parser.parse_args()
+def main():
+    args = parse_args()
+    torch.manual_seed(args.seed)
+    print(f"Loading model from {args.model_path}")
+    # Load tokenizer
+    tokenizer = AutoTokenizer.from_pretrained(args.model_path)
+    # Load model
+    model = AutoModelForCausalLM.from_pretrained(
+        args.model_path,
+        load_in_8bit=True,
+        device_map="auto",
+        torch_dtype=torch.float16,
+    )
+    # Set model to evaluation mode
+    model.eval()
+    # Prepare prompt
+    prompt = f"Artist: {args.artist}\nLyrics:"
+    print(f"Generating lyrics for artist: {args.artist}")
+    # Tokenize prompt
+    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
+    # Generate text
+    with torch.no_grad():
+        outputs = model.generate(
+            **inputs,
+            max_length=args.max_length,
+            temperature=args.temperature,
+            top_p=args.top_p,
+            top_k=args.top_k,
+            num_return_sequences=args.num_return_sequences,
+            pad_token_id=tokenizer.eos_token_id,
+            do_sample=True,
+        )
+    # Decode and print generated text
+    for i, output in enumerate(outputs):
+        generated_text = tokenizer.decode(output, skip_special_tokens=True)
+        print(f"\n--- Generated Lyrics {i+1} ---\n")
+        print(generated_text)
+        print("\n" + "-" * 50)
+if __name__ == "__main__":
+    main()

merges.txt ADDED Viewed

The diff for this file is too large to render. See raw diff

requirements.txt ADDED Viewed

	@@ -0,0 +1,10 @@

+transformers>=4.30.0
+datasets>=2.12.0
+peft>=0.4.0
+accelerate>=0.20.0
+bitsandbytes>=0.40.0
+torch>=2.0.0
+wandb>=0.15.0
+tqdm>=4.65.0
+sentencepiece>=0.1.99
+tensorboard>=2.13.0

run_generate.sh ADDED Viewed

	@@ -0,0 +1,34 @@

+#!/bin/bash
+# Set environment variables
+export CUDA_VISIBLE_DEVICES=0
+export TRANSFORMERS_CACHE="/vol/bitbucket/jc4121/gptneo-2.7Bloratunning/cache"
+export HF_DATASETS_CACHE="/vol/bitbucket/jc4121/gptneo-2.7Bloratunning/data"
+export HF_HOME="/vol/bitbucket/jc4121/gptneo-2.7Bloratunning/hf_home"
+# Create directories
+mkdir -p $TRANSFORMERS_CACHE
+mkdir -p $HF_DATASETS_CACHE
+mkdir -p $HF_HOME
+mkdir -p /vol/bitbucket/jc4121/gptneo-2.7Bloratunning/lib
+# Add custom lib path to PYTHONPATH
+export PYTHONPATH=/vol/bitbucket/jc4121/gptneo-2.7Bloratunning/lib:$PYTHONPATH
+# Default artist
+ARTIST="Taylor Swift"
+# Check if an artist name was provided
+if [ $# -ge 1 ]; then
+    ARTIST="$1"
+fi
+# Run generation
+echo "Generating lyrics for artist: $ARTIST"
+python generate.py \
+    --artist "$ARTIST" \
+    --max_length 512 \
+    --temperature 0.7 \
+    --top_p 0.9 \
+    --top_k 50 \
+    --num_return_sequences 1

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,24 @@

+{
+  "bos_token": {
+    "content": "<|endoftext|>",
+    "lstrip": false,
+    "normalized": true,
+    "rstrip": false,
+    "single_word": false
+  },
+  "eos_token": {
+    "content": "<|endoftext|>",
+    "lstrip": false,
+    "normalized": true,
+    "rstrip": false,
+    "single_word": false
+  },
+  "pad_token": "<|endoftext|>",
+  "unk_token": {
+    "content": "<|endoftext|>",
+    "lstrip": false,
+    "normalized": true,
+    "rstrip": false,
+    "single_word": false
+  }
+}

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,24 @@

+{
+  "add_bos_token": false,
+  "add_prefix_space": false,
+  "added_tokens_decoder": {
+    "50256": {
+      "content": "<|endoftext|>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    }
+  },
+  "bos_token": "<|endoftext|>",
+  "clean_up_tokenization_spaces": false,
+  "eos_token": "<|endoftext|>",
+  "errors": "replace",
+  "extra_special_tokens": {},
+  "model_max_length": 2048,
+  "pad_token": "<|endoftext|>",
+  "padding_side": "right",
+  "tokenizer_class": "GPT2Tokenizer",
+  "unk_token": "<|endoftext|>"
+}

vocab.json ADDED Viewed

The diff for this file is too large to render. See raw diff