Some examples of feature extraction include transforming raw text into word embeddings and extracting important features such as edges or shapes from image/video data.
feed forward chunking
In each residual attention block in transformers the self-attention layer is usually followed by 2 feed forward layers.
The intermediate embedding size of the feed forward layers is often bigger than the hidden size of the model (e.g., for
google-bert/bert-base-uncased).
For an input of size [batch_size, sequence_length], the memory required to store the intermediate feed forward
embeddings [batch_size, sequence_length, config.intermediate_size] can account for a large fraction of the memory
use.