Integrate with Transformers & SentenceTransformers

#3
by Samoed - opened

Hi! I've started integrating your model with Transformers, but when I do

from transformers import AutoModelForSequenceClassification, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(".")
model = AutoModelForSequenceClassification.from_pretrained(".", trust_remote_code=True)

inputs = tokenizer(["Hello"], return_tensors="pt")
outputs = model(**inputs)

I got output

File ~/.cache/huggingface/modules/transformers_modules/listconranker.py:223, in ListConRankerModel.forward(self, input_ids, attention_mask, token_type_ids, position_ids, head_mask, inputs_embeds, labels, output_attentions, output_hidden_states, return_dict, pair_num, **kwargs)
    220     last_hidden_state = ranker_out.last_hidden_state
    222 pair_features = self.average_pooling(last_hidden_state, attention_mask)
--> 223 pair_features = self.linear_in_embedding(pair_features)
    225 logits, pair_features_after_list_transformer = self.list_transformer(pair_features, pair_nums)
    226 logits = self.sigmoid(logits)

...

RuntimeError: mat1 and mat2 shapes cannot be multiplied (1x1792 and 1024x1792)

And I don't know why, because I've copied code from your implementation

Samoed changed pull request status to open

I'm sorry I didn't receive timely notice because I'm not the owner of the repository. Thank you for your contribution to our repository!

I think it is because you modify the hidden_state in the config.json.
image.png
After that, it modify the last_hidden_state size of ranker_out, when you initialize the BertModel
image.png

The size of last_hidden_state in last_hidden_state = ranker_out.last_hidden_state should be 1024. And the size should be 1792 after using the self.linear_in_embedding.

The solution is that you probably shouldn't modify the hidden_size in the config.json. Try defining it with another variable name.

Thanks! I've fixed shapes problem. Now I need to handle piars correctly. Problem is that in forward received only input_ids. I tried to find size of sentences by sep token, but it's too complicated I think. Do you have ideas how to handle them properly?

I think there might be a problem with your code here?

image.png

The input_ids, attention_mask, token_type_idsand so on are not defined in the arguments of forward method.

You can refer to the implementation of the BertModel class in transformers library. I believe it's feasible to include pair_nums as a argument in the forward method.

image.png

By the way, it's not feasible to determine pair_nums using the SEP token. For example, the input format for ListConRanker is as follows:

For queries and passages with the following input formats (ListConRanker supports an arbitrary number of passages):

inputs = [
    ['query_1', 'passage_11', 'passage_12'],  # batch 1
    ['query_2', 'passage_21', 'passage_22', 'passage_23']   # batch 2
]

ListConRanker first concatenates them and then tokenizes them separately like that:

inputs = tokenize(
    ['query_1',  'passage_11', 'passage_12', 'query_2', 'passage_21', 'passage_22', 'passage_23']
)

and it will get the tokenization results like:

[CLS] query_1 [SEP]
[CLS] passage_11 [SEP]
[CLS] passage_12 [SEP]
and so on

Finally, we use the tokenization results to input to the BertModel. And then we use the pair_nums to reorganize the inputs in ListTransformer.

Therefore, I believe that requiring the user to input pair_nums in forward method is the only solution.

The input_ids, attention_mask, token_type_idsand so on are not defined in the arguments of forward method.

They're defined, but just hidden in collapse. I've copied formard from BertForSequenceClassification.

image.png

For queries and passages with the following input formats (ListConRanker supports an arbitrary number of passages):

I want to make it possible to run your model with SentenceTransformers.CrossEncoder correctly. The main problem there, that CrossEncoder pass only input_ids to the forward in batched manner. So, I want to split input_ids by sep_token and after that create pairs. I will try to create correct implementation later

Sorry, I missed some code. The implementation of the forward method is correct.

I haven't tried adding a customized model to the library before😭. I'll also try to study the SentenceTransformer code if needed.

Sentence transformers using AutoModelForSequenceClassification for cross-encodes. It's loading now with automodels, but I need to figure out how to handle pairs

I think I understand your approach now. Suppose there is an input like this:

inputs = [
    ['query_1', 'passage_11', 'passage_12'],  # batch 1
    ['query_2', 'passage_21', 'passage_22', 'passage_23']   # batch 2
]

You can use the tokenizer of CrossEncoder to obtain the following output input_ids:

[CLS] query_1 [SEP] passage_11 [SEP]
[CLS] query_1 [SEP] passage_12 [SEP]
[CLS] query_2 [SEP] passage_21 [SEP]
and so on

You can separate the input_ids at the first [SEP] token, then check whether the preceding queries are the same. Use the number of identical queries as pair_nums.

But finally, you still need to reorganize the input_ids into a form similar to the following, and pay attention to modifying the first token of the passage into [CLS].

[CLS] query_1 [SEP]
[CLS] passage_11 [SEP]
[CLS] passage_12 [SEP]
and so on

Yes, that's exactly what I wanted to do. However, the CrossEncoder can only use one pair of passages at a time. I will create a different method for multiple passages.

Maybe the way passages are handled could be changed directly in the CrossEncoder. cc @tomaarsen

Ready to merge
This branch is ready to get merged automatically.

Sign up or log in to comment