PsiPi
/

audio

@@ -60,7 +60,7 @@ model = model.to(device)
 # Set up text and timing conditioning
 conditioning = [{
-    "prompt": "128 BPM tech house drum loop",
     "seconds_start": 0,
     "seconds_total": 30
 }]
@@ -99,7 +99,7 @@ pipe = StableAudioPipeline.from_pretrained("PsiPi/audio", torch_dtype=torch.floa
 pipe = pipe.to("cuda")
 # define the prompts
-prompt = "The sound of a hammer hitting a wooden surface."
 negative_prompt = "Low quality."
 # set the seed for generator
@@ -116,7 +116,7 @@ audio = pipe(
 ).audios
 output = audio[0].T.float().cpu().numpy()
-sf.write("hammer.wav", output, pipe.vae.sampling_rate)
 ```
 Refer to the [documentation](https://huggingface.co/docs/diffusers/main/en/index) for more details on optimization and usage.
@@ -134,7 +134,7 @@ Refer to the [documentation](https://huggingface.co/docs/diffusers/main/en/index
 ## Training dataset
 ### Datasets Used
-Our dataset consists of 486492 audio recordings, where 472618 are from Freesound and 13874 are from the Free Music Archive (FMA). All audio files are licensed under CC0, CC BY, or CC Sampling+. This data is used to train our autoencoder and DiT. We use a publicly available pre-trained T5 model ([t5-base](https://huggingface.co/google-t5/t5-base)) for text conditioning.
 ### Attribution
 Attribution for all audio recordings used to train Stable Audio Open 1.0 can be found in this repository.

 # Set up text and timing conditioning
 conditioning = [{
+    "prompt": "specialKay vocal, A cappella, 120 BPM twobob house vocalisations feat. Special Kay",
     "seconds_start": 0,
     "seconds_total": 30
 }]
 pipe = pipe.to("cuda")
 # define the prompts
+prompt = "The sound of specialKay vocalising, A cappella"
 negative_prompt = "Low quality."
 # set the seed for generator
 ).audios
 output = audio[0].T.float().cpu().numpy()
+sf.write("specialk.wav", output, pipe.vae.sampling_rate)
 ```
 Refer to the [documentation](https://huggingface.co/docs/diffusers/main/en/index) for more details on optimization and usage.
 ## Training dataset
 ### Datasets Used
+Our dataset consists of 486492 audio recordings, where 472618 are from Freesound and 13874 are from the Free Music Archive (FMA). All audio files are licensed under CC0, CC BY, or CC Sampling+. This data is used to train our autoencoder and DiT. We use a publicly available pre-trained T5 model ([t5-base](https://huggingface.co/google-t5/t5-base)) for text conditioning. We then further finetuned on [this kaggle dataset](https://www.kaggle.com/datasets/twobob/moar-bobtex-n-friends-gpu-fodder)
 ### Attribution
 Attribution for all audio recordings used to train Stable Audio Open 1.0 can be found in this repository.