guowenxiang commited on
Commit
f9309e7
·
verified ·
1 Parent(s): a19a05e

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +107 -0
README.md ADDED
@@ -0,0 +1,107 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Make-An-Audio 3: Transforming Text into Audio via Flow-based Large Diffusion Transformers
2
+
3
+ PyTorch Implementation of [Lumina-t2x](https://arxiv.org/abs/2405.05945)
4
+
5
+ We will provide our implementation and pretrained models as open source in this repository recently.
6
+
7
+ [![arXiv](https://img.shields.io/badge/arXiv-Paper-<COLOR>.svg)](https://arxiv.org/abs/2305.18474)
8
+ [![Hugging Face](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-blue)](https://huggingface.co/spaces/AIGC-Audio/Lumina-Audio)
9
+ [![GitHub Stars](https://img.shields.io/github/stars/Text-to-Audio/Make-An-Audio-3?style=social)](https://github.com/Text-to-Audio/Make-An-Audio-3)
10
+
11
+ ## Use pretrained model
12
+ We provide our implementation and pretrained models as open source in this repository.
13
+
14
+ Visit our [demo page](https://make-an-audio-2.github.io/) for audio samples.
15
+ ## Quick Started
16
+ ### Pretrained Models
17
+ Simply download the weights from [![Hugging Face](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-blue)](https://huggingface.co/Alpha-VLLM/Lumina-T2Music).
18
+ - Text Encoder: [FLAN-T5-Large](https://huggingface.co/google/flan-t5-large)
19
+ - VAE: Make-An-Audio 2, finetuned from [Make an Audio](https://github.com/Text-to-Audio/Make-An-Audio)
20
+ - Decoder: [Vocoder](https://github.com/NVIDIA/BigVGAN)
21
+ - `Music` Checkpoints: [huggingface](https://huggingface.co/Alpha-VLLM/Lumina-T2Music), `Audio` Checkpoints: [huggingface]()
22
+
23
+ ### Generate audio/music from text
24
+ ```
25
+ python3 scripts/txt2audio_for_2cap_flow.py
26
+ --outdir output_dir -r checkpoints_last.ckpt -b configs/txt2audio-cfm1-cfg-LargeDiT3.yaml --scale 3.0
27
+ --vocoder-ckpt useful_ckpts/bigvnat --test-dataset audiocaps
28
+ ```
29
+
30
+ ### Generate audio/music from audiocaps or musiccaps test dataset
31
+ - remember to relatively change `config["test_dataset]`
32
+ ```
33
+ python3 scripts/txt2audio_for_2cap_flow.py
34
+ --outdir output_dir -r checkpoints_last.ckpt -b configs/txt2audio-cfm1-cfg-LargeDiT3.yaml --scale 3.0
35
+ --vocoder-ckpt useful_ckpts/bigvnat --test-dataset testset
36
+ ```
37
+
38
+ ### Generate audio/music from video
39
+ ```
40
+ python3 scripts/video2audio_flow.py
41
+ --outdir output_dir -r checkpoints_last.ckpt -b configs/txt2audio-cfm1-cfg-LargeDiT3.yaml --scale 3.0
42
+ --vocoder-ckpt useful_ckpts/bigvnat --test-dataset vggsound
43
+ ```
44
+
45
+ ## Train
46
+ ### Data preparation
47
+ - We can't provide the dataset download link for copyright issues. We provide the process code to generate melspec, count audio duration and generate structured caption.
48
+ - Before training, we need to construct the dataset information into a tsv file, which includes name (id for each audio), dataset (which dataset the audio belongs to), audio_path (the path of .wav file),caption (the caption of the audio) ,mel_path (the processed melspec file path of each audio), duration (the duration of the audio). We provide a tsv file of audiocaps test set: audiocaps_test_struct.tsv as a sample.
49
+ - We provide a tsv file of the audiocaps test set: ./audiocaps_test_16000_struct.tsv as a sample.
50
+
51
+ ### Generate the melspec file of audio
52
+ Assume you have already got a tsv file to link each caption to its audio_path, which mean the tsv_file have "name","audio_path","dataset" and "caption" columns in it.
53
+ To get the melspec of audio, run the following command, which will save mels in ./processed
54
+ ```
55
+ python preprocess/mel_spec.py --tsv_path tmp.tsv --num_gpus 1 --max_duration 10
56
+ ```
57
+
58
+ ### Count audio duration
59
+ To count the duration of the audio and save duration information in tsv file, run the following command:
60
+ ```
61
+ python preprocess/add_duration.py --tsv_path tmp.tsv
62
+ ```
63
+
64
+ ### Generated structure caption from the original natural language caption
65
+ Firstly you need to get an authorization token in openai(https://openai.com/blog/openai-api), here is a tutorial(https://www.maisieai.com/help/how-to-get-an-openai-api-key-for-chatgpt). Then replace your key of variable openai_key in preprocess/n2s_by_openai.py. Run the following command to add structed caption, the tsv file with structured caption will be saved into {tsv_file_name}_struct.tsv:
66
+ ```
67
+ python preprocess/n2s_by_openai.py --tsv_path tmp.tsv
68
+ ```
69
+
70
+ ### Place Tsv files
71
+ After generated structure caption, put the tsv with structed caption to ./data/main_spec_dir . And put tsv files without structured caption to ./data/no_struct_dir
72
+
73
+ Modify the config data.params.main_spec_dir and data.params.main_spec_dir.other_spec_dir_path respectively in config file configs/text2audio-ConcatDiT-ae1dnat_Skl20d2_struct2MLPanylen.yaml .
74
+
75
+ ## Train variational autoencoder
76
+ Assume we have processed several datasets, and save the .tsv files in tsv_dir/*.tsv . Replace data.params.spec_dir_path with tsv_dir in the config file. Then we can train VAE with the following command. If you don't have 8 gpus in your machine, you can replace --gpus 0,1,...,gpu_nums
77
+ ```
78
+ python main.py --base configs/research/autoencoder/autoencoder1d_kl20_natbig_r1_down2_disc2.yaml -t --gpus 0,1,2,3,4,5,6,7
79
+ ```
80
+
81
+ ## Train latent diffsuion
82
+ After trainning VAE, replace model.params.first_stage_config.params.ckpt_path with your trained VAE checkpoint path in the config file.
83
+ Run the following command to train Diffusion model
84
+ ```
85
+ python main.py --base configs/research/text2audio/text2audio-ConcatDiT-ae1dnat_Skl20d2_freezeFlananylen_drop.yaml -t --gpus 0,1,2,3,4,5,6,7
86
+ ```
87
+
88
+ ## Evaluation
89
+ Please refer to [Make-An-Audio](https://github.com/Text-to-Audio/Make-An-Audio?tab=readme-ov-file#evaluation)
90
+
91
+
92
+ ## Acknowledgements
93
+ This implementation uses parts of the code from the following Github repos:
94
+ [Make-An-Audio](https://github.com/Text-to-Audio/Make-An-Audio),
95
+ [AudioLCM](https://github.com/Text-to-Audio/AudioLCM),
96
+ [CLAP](https://github.com/LAION-AI/CLAP),
97
+ as described in our code.
98
+
99
+
100
+
101
+ ## Citations ##
102
+ If you find this code useful in your research, please consider citing:
103
+ ```bibtex
104
+ ```
105
+
106
+ # Disclaimer ##
107
+ Any organization or individual is prohibited from using any technology mentioned in this paper to generate someone's speech without his/her consent, including but not limited to government leaders, political figures, and celebrities. If you do not comply with this item, you could be in violation of copyright laws.