maomao88 commited on
Commit
3b4b5c1
·
1 Parent(s): e13cd84

update readme

Browse files
Files changed (1) hide show
  1. README.md +8 -7
README.md CHANGED
@@ -21,15 +21,16 @@ tags:
21
 
22
  This app aims to help users better understand the behavior behind the attention layers in transformer models by visualizing the cross-attention and self-attention weights in an encoder-decoder model to see the alignment between and within the source and target tokens.
23
 
24
- The app leverages the `Helsinki-NLP/opus-mt-en-zh` model to perform translation tasks from English to Chinese and by `output_attentions=True`, the attention weights are stored as follows:
25
 
26
- Attention Type | Shape | Role
27
- encoder_attentions | (layers, B, heads, src_len, src_len) | Encoder self-attention on source tokens
28
- decoder_attentions | (layers, B, heads, tgt_len, tgt_len) | Decoder self-attention on generated tokens
29
- cross_attentions | (layers, B, heads, tgt_len, src_len) | Decoder attention over source tokens (encoder outputs)
 
30
 
31
- By taking the weights from the last encoder and decoder layers and calculating the mean over the 8 heads, the attention weights (avg over heads) are obtained to build attention visualizations
32
 
33
  **Note :**
34
- * `attn_weights = softmax(Q @ K.T / sqrt(d_k)) `
35
  * `(layers, B, heads, src_len, src_len)` - e.g. `(6, 1, 8, 24, 18)`
 
21
 
22
  This app aims to help users better understand the behavior behind the attention layers in transformer models by visualizing the cross-attention and self-attention weights in an encoder-decoder model to see the alignment between and within the source and target tokens.
23
 
24
+ The app leverages the `Helsinki-NLP/opus-mt-en-zh` model to perform translation tasks from English to Chinese and by setting `output_attentions=True`, the attention weights are stored as follows:
25
 
26
+ | Attention Type | Shape | Role |
27
+ |---------------------|---------------------------------------------|----------------------------------------------------|
28
+ | encoder_attentions | (layers, B, heads, src_len, src_len) | Encoder self-attention on source tokens |
29
+ | decoder_attentions | (layers, B, heads, tgt_len, tgt_len) | Decoder self-attention on generated tokens |
30
+ | cross_attentions | (layers, B, heads, tgt_len, src_len) | Decoder attention over source tokens (encoder outputs) |
31
 
32
+ By taking the weights from the last encoder and decoder layers and calculating the mean over all of the attention heads, the attention weights (avg over heads) are obtained to build attention visualization.
33
 
34
  **Note :**
35
+ * `attention_weights = softmax(Q @ K.T / sqrt(d_k)) ` - A probability distribution over all keys (i.e., tokens being attended to) for each query (i.e., the current token).
36
  * `(layers, B, heads, src_len, src_len)` - e.g. `(6, 1, 8, 24, 18)`