Some question about the "mem_tokens" position
Why is mem_token
concatenated to the front of enc_input
at line 281 of the code?
When using the decoder itself as a compressor, in the subsequent call chain: generate -> compress_and_replace_emb -> compr_decoder
method, the forward
method of the decoder model is directly called to encode the input and obtain the compressed vector of the document based on the position of mem_token
. According to the __init__
method, the decoder is initialized using AutoModelForCausalLM
, and the forward
method of the MistralForCausalLM
model uses a triangular matrix to mask the input. Will concatenating mem_token
at the front cause it to fail to obtain information from subsequent inputs? Is this a bug in the code?
Hey, thanks for spotting what indeed looks like a bug.
This code is deprecated in favor of the 'PISCO' collection code https://huggingface.co/collections/naver/pisco-67683f2b553e5dc6b8c3f156
Thank you for your sharing. I have another question: could you share the training code for reference? I encountered some problems while trying to reproduce your work. For example, there is a phenomenon of gradient explosion, it cannot be well compatible with DeepSpeed ZeRO-2, and there are still some issues related to the construction of input for the pretraining task.
Unfortunately we cannot share the training code. There is however no mystery to it, and, following what's described in the paper, you should not struggle too much to replicate our experiments. We did not need deepspeed to train the models, only DDP with "standard" tranformers code (which does gradient clipping by default, but nonetheless we never had this 'exploding gradient' problem). If you have more details or questions to reproduce, feel free to reach the first author by mail.