Integrate with Transformers & SentenceTransformers
Hi! I've started integrating your model with Transformers, but when I do
from transformers import AutoModelForSequenceClassification, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(".")
model = AutoModelForSequenceClassification.from_pretrained(".", trust_remote_code=True)
inputs = tokenizer(["Hello"], return_tensors="pt")
outputs = model(**inputs)
I got output
File ~/.cache/huggingface/modules/transformers_modules/listconranker.py:223, in ListConRankerModel.forward(self, input_ids, attention_mask, token_type_ids, position_ids, head_mask, inputs_embeds, labels, output_attentions, output_hidden_states, return_dict, pair_num, **kwargs)
220 last_hidden_state = ranker_out.last_hidden_state
222 pair_features = self.average_pooling(last_hidden_state, attention_mask)
--> 223 pair_features = self.linear_in_embedding(pair_features)
225 logits, pair_features_after_list_transformer = self.list_transformer(pair_features, pair_nums)
226 logits = self.sigmoid(logits)
...
RuntimeError: mat1 and mat2 shapes cannot be multiplied (1x1792 and 1024x1792)
And I don't know why, because I've copied code from your implementation
I'm sorry I didn't receive timely notice because I'm not the owner of the repository. Thank you for your contribution to our repository!
I think it is because you modify the hidden_state
in the config.json
.
After that, it modify the last_hidden_state
size of ranker_out
, when you initialize the BertModel
The size of last_hidden_state
in last_hidden_state = ranker_out.last_hidden_state
should be 1024. And the size should be 1792 after using the self.linear_in_embedding
.
The solution is that you probably shouldn't modify the hidden_size
in the config.json
. Try defining it with another variable name.
Thanks! I've fixed shapes problem. Now I need to handle piars correctly. Problem is that in forward
received only input_ids
. I tried to find size of sentences by sep
token, but it's too complicated I think. Do you have ideas how to handle them properly?
I think there might be a problem with your code here?
The input_ids
, attention_mask
, token_type_ids
and so on are not defined in the arguments of forward
method.
You can refer to the implementation of the BertModel
class in transformers
library. I believe it's feasible to include pair_nums
as a argument in the forward
method.
By the way, it's not feasible to determine pair_nums
using the SEP
token. For example, the input format for ListConRanker
is as follows:
For queries and passages with the following input formats (ListConRanker supports an arbitrary number of passages):
inputs = [
['query_1', 'passage_11', 'passage_12'], # batch 1
['query_2', 'passage_21', 'passage_22', 'passage_23'] # batch 2
]
ListConRanker first concatenates them and then tokenizes them separately like that:
inputs = tokenize(
['query_1', 'passage_11', 'passage_12', 'query_2', 'passage_21', 'passage_22', 'passage_23']
)
and it will get the tokenization results like:
[CLS] query_1 [SEP]
[CLS] passage_11 [SEP]
[CLS] passage_12 [SEP]
and so on
Finally, we use the tokenization results to input to the BertModel
. And then we use the pair_nums
to reorganize the inputs in ListTransformer.
Therefore, I believe that requiring the user to input pair_nums
in forward
method is the only solution.
The input_ids, attention_mask, token_type_idsand so on are not defined in the arguments of forward method.
They're defined, but just hidden in collapse. I've copied formard
from BertForSequenceClassification
.
For queries and passages with the following input formats (ListConRanker supports an arbitrary number of passages):
I want to make it possible to run your model with SentenceTransformers.CrossEncoder
correctly. The main problem there, that CrossEncoder
pass only input_ids
to the forward
in batched manner. So, I want to split input_ids
by sep_token
and after that create pairs. I will try to create correct implementation later
Sorry, I missed some code. The implementation of the forward
method is correct.
I haven't tried adding a customized model to the library before😭. I'll also try to study the SentenceTransformer code if needed.
Sentence transformers using AutoModelForSequenceClassification for cross-encodes. It's loading now with automodels, but I need to figure out how to handle pairs
I think I understand your approach now. Suppose there is an input like this:
inputs = [
['query_1', 'passage_11', 'passage_12'], # batch 1
['query_2', 'passage_21', 'passage_22', 'passage_23'] # batch 2
]
You can use the tokenizer of CrossEncoder to obtain the following output input_ids
:
[CLS] query_1 [SEP] passage_11 [SEP]
[CLS] query_1 [SEP] passage_12 [SEP]
[CLS] query_2 [SEP] passage_21 [SEP]
and so on
You can separate the input_ids
at the first [SEP]
token, then check whether the preceding queries are the same. Use the number of identical queries as pair_nums
.
But finally, you still need to reorganize the input_ids
into a form similar to the following, and pay attention to modifying the first token of the passage into [CLS]
.
[CLS] query_1 [SEP]
[CLS] passage_11 [SEP]
[CLS] passage_12 [SEP]
and so on
Yes, that's exactly what I wanted to do. However, the CrossEncoder can only use one pair of passages at a time. I will create a different method for multiple passages.
Maybe the way passages are handled could be changed directly in the CrossEncoder. cc @tomaarsen