Benjamin Paine PRO
AI & ML interests
Recent Activity
Organizations
benjamin-paine's activity


Attention mechanisms allow models to dynamically focus on specific parts of their input when performing tasks. In our recent article, we discussed Multi-Head Latent Attention (MLA) in detail and now it's time to summarize other existing types of attention.
Here is a list of 15 types of attention mechanisms used in AI models:
1. Soft attention (Deterministic attention) -> Neural Machine Translation by Jointly Learning to Align and Translate (1409.0473)
Assigns a continuous weight distribution over all parts of the input. It produces a weighted sum of the input using attention weights that sum to 1.
2. Hard attention (Stochastic attention) -> Effective Approaches to Attention-based Neural Machine Translation (1508.04025)
Makes a discrete selection of some part of the input to focus on at each step, rather than attending to everything.
3. Self-attention -> Attention Is All You Need (1706.03762)
Each element in the sequence "looks" at other elements and "decides" how much to borrow from each of them for its new representation.
4. Cross-Attention (Encoder-Decoder attention) -> Cross-Attention is All You Need: Adapting Pretrained Transformers for Machine Translation (2104.08771)
The queries come from one sequence and the keys/values come from another sequence. It allows a model to combine information from two different sources.
5. Multi-Head Attention (MHA) -> Attention Is All You Need (1706.03762)
Multiple attention βheadsβ are run in parallel.β The model computes several attention distributions (heads), each with its own set of learned projections of queries, keys, and values.
6. Multi-Head Latent Attention (MLA) -> DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model (2405.04434)
Extends MHA by incorporating a latent space where attention heads can dynamically learn different latent factors or representations.
7. Memory-Based attention -> End-To-End Memory Networks (1503.08895)
Involves an external memory and uses attention to read from and write to this memory.
See other types in the comments π

Whenever I introduce myself, people often start speaking French to me, even though my French is très basic. It turns out that AI systems do something similar:
Large language models infer cultural identity from names, shaping their responses based on presumed backgrounds. But is this helpful personalization or a reinforcement of stereotypes?
In our latest paper, we explored this question by testing DeepSeek, Llama, Aya, Mistral-Nemo, and GPT-4o-mini on how they associate names with cultural identities. We analysed 900 names from 30 cultures and found strong assumptions baked into AI responses: some cultures were overrepresented, while others barely registered.
For example, a name like "Jun" often triggered Japan-related responses, while "Carlos" was linked primarily to Mexico, even though these names exist in multiple countries. Meanwhile, names from places like Ireland led to more generic answers, suggesting weaker associations in the training data.
This has real implications for AI fairness: How should AI systems personalize without stereotyping? Should they adapt at all based on a name?
Work with some of my favourite researchers: @sidicity Arnav Arora and @IAugenstein
Read the full paper here: Presumed Cultural Identity: How Names Shape LLM Responses (2502.11995)

Just getting started of course but early users seem to like it & always happy to be able to partner with cool startups in the ecosystem.
Have you been using any integration and how can we make it better?
https://huggingface.co/blog/inference-providers

Try it here: https://huggingface.co/spaces/benjamin-paine/zonos-longform
Getting started with Zonos in Taproot is easy; with a working CUDA toolkit and Python/Pip installation, all you have to do is:
apt install espeak-ng
pip install taproot
taproot install speech-synthesis:zonos-transformer
taproot invoke speech-synthesis:zonos-transformer --text "Hello, world!"
See more on GitHub at https://github.com/painebenjamin/taproot/
Yup! That stays one chunk.
chunker.push("Last week she said, βHi there. How are you?β");
chunker.flush()
Emitting "Last week she said, βHi there. How are you?β"
The only exception is with newlines - I wanted it to emit when a newline was encountered.
chunker.push("Last week she said,\nβHi there. How are you?β");
chunker.flush()
Emitting "Last week she said,"
Emitting "βHi there. How are you?β"
If you want to disable this behavior, pass in {emitParagraphs: false}
to the constructor, i.e.:
const chunker = new SentenceChunker({emitParagraphs: false});
There's also chunkLength
to determine the character length maximum (128 by default), and emitTrimmed
on whether or not each emit should trim leading/trailing whitespace (default true.) One last thing, if your input is always growing - like if you're streaming one response and just concatenating it as one big string - you can use GrowingSentenceChunker
instead (in the same file.) Example:
const chunker = new GrowingSentenceChunker();
chunker.onChunk((chunk) => { console.log(`Emitting "${chunk}"`); });
chunker.push("Last week");
chunker.push("Last week she said");
chunker.push("Last week she said, βHi there. How are you?β");
chunker.flush()
Emitting "Last week she said, βHi there. How are you?β"
And just in case it's not obvious, the .flush()
call will just emit anything left in the buffer, even if it's shorter than the maximum length. If you don't call .flush()
, it will wait for another input that pushes it over the chunk limit before emitting again.
I spent a bit of time working on a JavaScript sentence splitter - it might work right out of the box for this purpose! It tries to split on punctuation when possible for smooth flow, but has a max length option to ensure run-on sentences still get split, too. It also maintains a buffer so you can just keep pushing streaming text into it and it will emit when it has a full chunk.
https://raw.githubusercontent.com/painebenjamin/anachrovox/refs/heads/main/www/sentence.js
Example:
const chunker = new SentenceChunker();
chunker.onChunk((sentenceChunk) => { console.log(`Emitting "${sentenceChunk}"`); });
chunker.push("The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train.");
chunker.flush()
Output:
Emitting "The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration."
Emitting "The best performing models also connect the encoder and decoder through an attention mechanism."
Emitting "We propose a new simple network architecture, the Transformer, based solely on attention mechanisms,"
Emitting "dispensing with recurrence and convolutions entirely."
Emitting "Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train."

Generate 10 seconds of speech in ~1 second for $0.
What will you build? π₯
webml-community/kokoro-webgpu
The most difficult part was getting the model running in the first place, but the next steps are simple:
βοΈ Implement sentence splitting, allowing for streamed responses
π Multilingual support (only phonemization left)
Who wants to help?

pip install kokoro
, and still 82M parameters.GitHub: https://github.com/hexgrad/kokoro
PyPI: https://pypi.org/project/kokoro/
Space: hexgrad/Kokoro-TTS


We cover the models supported, the knobs of optims our users can fire, fine-tuning, and more π₯
5-6GBs for HunyuanVideo, sky is the limit π π€
https://huggingface.co/blog/video_gen
Thanks for doing this! I've been all-in on llama.cpp for awhile now but I would be lying if I said I didn't wonder if I was missing out on anything with other engines.

You can follow my account on Bluesky for updates on Shining Valiant 3, other Valiant Labs models, my open-source datasets, etc: https://bsky.app/profile/sequelbox.bsky.social
back to building :)

π Multimodal
- MiniCPM-o 2.6 is a new sota any-to-any model by OpenBMB
(vision, speech and text!)
- VideoChat-Flash-Qwen2.5-2B is new video multimodal models by OpenGVLab that come in sizes 2B & 7B in resolutions 224 & 448
- ByteDance released larger SA2VA that comes in 26B parameters
- Dataset: VRC-Bench is a new diverse benchmark for multimodal LLM reasoning performance
π¬ LLMs
- MiniMax-Text-01 is a new huge language model (456B passive 45.9B active params) by MiniMaxAI with context length of 4M tokens π€―
- Dataset: Sky-T1-data-17k is a diverse dataset used to train Sky-T1-32B
- kyutai released Helium-1-Preview-2B is a new small multilingual LM
- Wayfarer-12B is a new LLM able to write D&D π§π»ββοΈ
- ReaderLM-v2 is a new HTML parsing model by Jina AI
- Dria released, Dria-Agent-a-3B, new agentic coding model (Pythonic function calling) based on Qwen2.5 Coder
- Unsloth released Phi-4, faster and memory efficient Llama 3.3
πΌοΈ Vision
- MatchAnything is a new foundation model for matching
- FitDit is a high-fidelity VTON model based on DiT architecture
π£οΈ Audio
- OuteTTS-0.3-1B is a new multilingual text-to-speech model with voice cloning and emotion control capabilities
π Retrieval
- lightblue released a new reranker based on Qwen2.5 LB-reranker-0.5B-v1.0 that can handle 95+ languages
- cde-small-v2 is a new sota small retrieval model by
@jxm

Hello again @JLouisBiz !
I've updated the spaces, they now use Kokoro instead of XTTS. It's drastically faster. Additionally, since the TTS is so much faster, I felt comfortable extended the output to 1024 tokens.

Hello! It's currently clipped at 512 tokens for output, so yes it won't be suitable for very long generation. It's also a very tiny model - Llama 3.2 3B - so definitely more for conversation and less for completing tasks.
I'm going to try and swap in Kokoro TTS which should be faster on these small machines. Thanks for taking the time to test.

I'm sorry that it's not working for you - can you make sure you've given it permission to use your microphone and that you're using the correct one (if you have multiple)? There should be an icon in the corner like this (in chrome) you can click on which should let you select microphones and check levels. Whenever I've had trouble activating it, I've always found I was using the wrong microphone or my voice volume was way far down.
If you're using a browser other than Chrome please let me know, I've tested it in others but there could always be something I'm missing.

Regarding the indicators in the bottom right,
- If the "recording" light doesn't turn on (the top one,) then it did not hear you utter a wake phrase.
- If the "listening" light does turn on, it detects voice activity, but unless you utter a wake phrase it will not send the recording for transcription and completion.
So in short, if you say "Hex Vox, what's the news?" and you don't see the recording light turn on, then it didn't catch the wake phrase and you have to try again.
If instead you just want to speak your command without relying on wake phrase recognition, you can just click the "Call" button - that will start recording immediately and always send the audio for transcription.
This project was the one that set me off on making the wake phrase model in the first place. At first I didn't have it and relied instead on voice activity detection and transcription, however this performs extremely poorly in noisy environments or any kind of muted speech, with near-constant accidental activation. The only efficient way to be always-on AND hands-free was to use a front-end wake-word model to gate the rest of the audio workflow.

You're very welcome! Just so it's clear, the code is licensed under Apache, and the wake-word models are licensed under CC-BY-4.0 (to coincide with the licenses of the audio they were trained on.) More info on the models here: https://huggingface.co/benjamin-paine/anachrovox