Ruurd Kuiper's picture
1 3

Ruurd Kuiper PRO

Ruurd

AI & ML interests

None yet

Recent Activity

reacted to their post with ๐Ÿ”ฅ 1 day ago
The past year I have been trying to get diffusion models to work for language generation, without having to retrain a LLM from scratch. And recently, we finally succeeded: We introduce "LAD: LoRA-Adapted Denoiser", a method to convert a LLaMA model into a text diffusion model using LoRA finetuning and structured input corruption. ๐ŸŽฏ Try the demo and read the write-up here! https://ruurdkuiper.github.io/tini-lad/ Unlike autoregressive (word-for-word) models like ChatGPT, diffusion models iteratively refine a noised sequence. However, most current diffusion approaches rely on all-parameter retraining and repeatedly remasking tokens, which is costly and slow during both training and inference! ๐Ÿง  With LAD: - We can finetune an autoregressive model for diffusive generation in just 10 hours on a single GPU. - Test-time compute is fully adjustable: fewer steps means faster outputs while more steps improve output quality. - Due to our unique noising schedule, remasking is not always needed during inference. All tokens are attended to in each iteration! ๐Ÿ” LAD is built using: โ€“ A frozen LLaMA-8B backbone โ€“ Structured noising: token swaps, duplications, replacements, span shifts โ€“ Modified attention masks for bidirectional decoding ๐Ÿ’ก We show that even small, fast-trained models can perform diffusive generation โ€” with competitive benchmark performance, perplexity and more flexible test-time behavior than traditional transformers.
reacted to their post with ๐Ÿ‘€ 1 day ago
The past year I have been trying to get diffusion models to work for language generation, without having to retrain a LLM from scratch. And recently, we finally succeeded: We introduce "LAD: LoRA-Adapted Denoiser", a method to convert a LLaMA model into a text diffusion model using LoRA finetuning and structured input corruption. ๐ŸŽฏ Try the demo and read the write-up here! https://ruurdkuiper.github.io/tini-lad/ Unlike autoregressive (word-for-word) models like ChatGPT, diffusion models iteratively refine a noised sequence. However, most current diffusion approaches rely on all-parameter retraining and repeatedly remasking tokens, which is costly and slow during both training and inference! ๐Ÿง  With LAD: - We can finetune an autoregressive model for diffusive generation in just 10 hours on a single GPU. - Test-time compute is fully adjustable: fewer steps means faster outputs while more steps improve output quality. - Due to our unique noising schedule, remasking is not always needed during inference. All tokens are attended to in each iteration! ๐Ÿ” LAD is built using: โ€“ A frozen LLaMA-8B backbone โ€“ Structured noising: token swaps, duplications, replacements, span shifts โ€“ Modified attention masks for bidirectional decoding ๐Ÿ’ก We show that even small, fast-trained models can perform diffusive generation โ€” with competitive benchmark performance, perplexity and more flexible test-time behavior than traditional transformers.
reacted to their post with โค๏ธ 1 day ago
The past year I have been trying to get diffusion models to work for language generation, without having to retrain a LLM from scratch. And recently, we finally succeeded: We introduce "LAD: LoRA-Adapted Denoiser", a method to convert a LLaMA model into a text diffusion model using LoRA finetuning and structured input corruption. ๐ŸŽฏ Try the demo and read the write-up here! https://ruurdkuiper.github.io/tini-lad/ Unlike autoregressive (word-for-word) models like ChatGPT, diffusion models iteratively refine a noised sequence. However, most current diffusion approaches rely on all-parameter retraining and repeatedly remasking tokens, which is costly and slow during both training and inference! ๐Ÿง  With LAD: - We can finetune an autoregressive model for diffusive generation in just 10 hours on a single GPU. - Test-time compute is fully adjustable: fewer steps means faster outputs while more steps improve output quality. - Due to our unique noising schedule, remasking is not always needed during inference. All tokens are attended to in each iteration! ๐Ÿ” LAD is built using: โ€“ A frozen LLaMA-8B backbone โ€“ Structured noising: token swaps, duplications, replacements, span shifts โ€“ Modified attention masks for bidirectional decoding ๐Ÿ’ก We show that even small, fast-trained models can perform diffusive generation โ€” with competitive benchmark performance, perplexity and more flexible test-time behavior than traditional transformers.
View all activity

Organizations

None yet

Ruurd's activity

reacted to their post with ๐Ÿ”ฅ๐Ÿ‘€โค๏ธ 1 day ago
view post
Post
2110
The past year I have been trying to get diffusion models to work for language generation, without having to retrain a LLM from scratch. And recently, we finally succeeded:

We introduce "LAD: LoRA-Adapted Denoiser", a method to convert a LLaMA model into a text diffusion model using LoRA finetuning and structured input corruption.

๐ŸŽฏ Try the demo and read the write-up here!
https://ruurdkuiper.github.io/tini-lad/

Unlike autoregressive (word-for-word) models like ChatGPT, diffusion models iteratively refine a noised sequence. However, most current diffusion approaches rely on all-parameter retraining and repeatedly remasking tokens, which is costly and slow during both training and inference!

๐Ÿง  With LAD:
- We can finetune an autoregressive model for diffusive generation in just 10 hours on a single GPU.
- Test-time compute is fully adjustable: fewer steps means faster outputs while more steps improve output quality.
- Due to our unique noising schedule, remasking is not always needed during inference. All tokens are attended to in each iteration!

๐Ÿ” LAD is built using:
โ€“ A frozen LLaMA-8B backbone
โ€“ Structured noising: token swaps, duplications, replacements, span shifts
โ€“ Modified attention masks for bidirectional decoding

๐Ÿ’ก We show that even small, fast-trained models can perform diffusive generation โ€” with competitive benchmark performance, perplexity and more flexible test-time behavior than traditional transformers.

  • 2 replies
ยท
replied to their post 1 day ago
view reply

Thanks! Iโ€™m trying to get it under attention as I think the leap from pretraining (10000+ hours) diffusion models to mere finetuning (10 hours) for adaptation is a big one, and could really help this method gain some traction.

upvoted an article 2 days ago
published an article 3 days ago
posted an update 3 days ago
view post
Post
2110
The past year I have been trying to get diffusion models to work for language generation, without having to retrain a LLM from scratch. And recently, we finally succeeded:

We introduce "LAD: LoRA-Adapted Denoiser", a method to convert a LLaMA model into a text diffusion model using LoRA finetuning and structured input corruption.

๐ŸŽฏ Try the demo and read the write-up here!
https://ruurdkuiper.github.io/tini-lad/

Unlike autoregressive (word-for-word) models like ChatGPT, diffusion models iteratively refine a noised sequence. However, most current diffusion approaches rely on all-parameter retraining and repeatedly remasking tokens, which is costly and slow during both training and inference!

๐Ÿง  With LAD:
- We can finetune an autoregressive model for diffusive generation in just 10 hours on a single GPU.
- Test-time compute is fully adjustable: fewer steps means faster outputs while more steps improve output quality.
- Due to our unique noising schedule, remasking is not always needed during inference. All tokens are attended to in each iteration!

๐Ÿ” LAD is built using:
โ€“ A frozen LLaMA-8B backbone
โ€“ Structured noising: token swaps, duplications, replacements, span shifts
โ€“ Modified attention masks for bidirectional decoding

๐Ÿ’ก We show that even small, fast-trained models can perform diffusive generation โ€” with competitive benchmark performance, perplexity and more flexible test-time behavior than traditional transformers.

  • 2 replies
ยท