Sarath Shekkizhar
sarath-shekkizhar
AI & ML interests
None yet
Recent Activity
posted
an
update
about 1 hour ago
Some interesting architectural choices made in Llama 4 models -- were these key to the 10M context? Possibly ๐ค
๐ Takeaways:
๐งฉ Interleaved Attention without position encoding
- LLaMA 4 removes explicit positional encoding in some attention layers to boost performance on longer contexts.
- The principles here could be similar to the residual connections to facilitate attention to early tokens without positional decay.
โ๏ธ Scaled Softmax to increase attention at inference time
- The max attention value (output of softmax) decreases as context size increases.
- Llama 4 incorporates a context-size dependent temperature in the softmax function to modify the slope of softmax, allowing the model to focus better on relevant tokens.
- Done only at inference time -- guessing it was more a choice after some observation on eval datasets.
What did you think of these choices?
updated
a Space
9 months ago
tenyx/TenyxChat-7B-v1
updated
a Space
9 months ago
tenyx/TenyxChat-8x7B-v1
Organizations
sarath-shekkizhar's activity
Failed evaluation for 70B model
1
#804 opened 10 months ago
by
sarath-shekkizhar

Dataset loading failing with HF load_dataset
1
#3 opened 10 months ago
by
sarath-shekkizhar

great evals
1
#2 opened 11 months ago
by
gblazex
Script to reproduce MT-Bench
2
#1 opened 11 months ago
by
MaziyarPanahi

Resubmitting failed 70B model (tenyx/Llama3-TenyxChat-70B)
1
#728 opened 11 months ago
by
sarath-shekkizhar

Evaluation for 70B model FAILED (tenyx/Llama3-TenyxChat-70B)
5
#719 opened 12 months ago
by
sarath-shekkizhar

Update README.md
1
#1 opened about 1 year ago
by
mostafagv
