Some question about the "mem_tokens" position

#1
by Martrix - opened

Why is mem_token concatenated to the front of enc_input at line 281 of the code?

image.png

When using the decoder itself as a compressor, in the subsequent call chain: generate -> compress_and_replace_emb -> compr_decoder method, the forward method of the decoder model is directly called to encode the input and obtain the compressed vector of the document based on the position of mem_token. According to the __init__ method, the decoder is initialized using AutoModelForCausalLM, and the forward method of the MistralForCausalLM model uses a triangular matrix to mask the input. Will concatenating mem_token at the front cause it to fail to obtain information from subsequent inputs? Is this a bug in the code?

NAVER LABS Europe org

Hey, thanks for spotting what indeed looks like a bug.
This code is deprecated in favor of the 'PISCO' collection code https://huggingface.co/collections/naver/pisco-67683f2b553e5dc6b8c3f156

Thank you for your sharing. I have another question: could you share the training code for reference? I encountered some problems while trying to reproduce your work. For example, there is a phenomenon of gradient explosion, it cannot be well compatible with DeepSpeed ZeRO-2, and there are still some issues related to the construction of input for the pretraining task.

NAVER LABS Europe org

Unfortunately we cannot share the training code. There is however no mystery to it, and, following what's described in the paper, you should not struggle too much to replicate our experiments. We did not need deepspeed to train the models, only DDP with "standard" tranformers code (which does gradient clipping by default, but nonetheless we never had this 'exploding gradient' problem). If you have more details or questions to reproduce, feel free to reach the first author by mail.

Sign up or log in to comment