May 14

HI, hope you're having a good day!
I'm having an issue with inferring the model (zero-shot). When setting the model to prediction, the model is giving a tensor mismatch, which is strange as I have passed the same data in the train and eval stages of the model, which is working fine and iterating through the network, but what setting the model to inference, I am getting a tensor mismatch

The Error below:

RuntimeError Traceback (most recent call last)
Cell In[18], line 25
17 # Create the trainer with training arguments
18 trainer = Trainer(
19 model=model,
20 args=training_args,
(...)
23 data_collator=DataCollatorForCellClassification(token_dictionary=token_dict),
24 )
---> 25 predictions=trainer.predict(test_dataset_cancer)
26 # Use trainer
27 eval= trainer.evaluate(eval_dataset=test_dataset_cancer)

File /hpcfs/users/a1841503/myconda/envs/geneformer/lib/python3.12/site-packages/transformers/trainer.py:4151, in Trainer.predict(self, test_dataset, ignore_keys, metric_key_prefix)
4148 start_time = time.time()
4150 eval_loop = self.prediction_loop if self.args.use_legacy_prediction_loop else self.evaluation_loop
-> 4151 output = eval_loop(
4152 test_dataloader, description="Prediction", ignore_keys=ignore_keys, metric_key_prefix=metric_key_prefix
4153 )
4154 total_batch_size = self.args.eval_batch_size * self.args.world_size
4155 if f"{metric_key_prefix}_jit_compilation_time" in output.metrics:

File /hpcfs/users/a1841503/myconda/envs/geneformer/lib/python3.12/site-packages/transformers/trainer.py:4294, in Trainer.evaluation_loop(self, dataloader, description, prediction_loss_only, ignore_keys, metric_key_prefix)
4292 logits = self.gather_function((logits))
4293 if not self.args.batch_eval_metrics or description == "Prediction":
-> 4294 all_preds.add(logits)
4295 if labels is not None:
4296 labels = self.gather_function((labels))

File /hpcfs/users/a1841503/myconda/envs/geneformer/lib/python3.12/site-packages/transformers/trainer_pt_utils.py:317, in EvalLoopContainer.add(self, tensors)
315 self.tensors = tensors if self.do_nested_concat else [tensors]
316 elif self.do_nested_concat:
--> 317 self.tensors = nested_concat(self.tensors, tensors, padding_index=self.padding_index)
318 else:
319 self.tensors.append(tensors)

File /hpcfs/users/a1841503/myconda/envs/geneformer/lib/python3.12/site-packages/transformers/trainer_pt_utils.py:129, in nested_concat(tensors, new_tensors, padding_index)
125 assert (
126 type(tensors) is type(new_tensors)
127 ), f"Expected tensors and new_tensors to have the same type but found {type(tensors)} and {type(new_tensors)}."
128 if isinstance(tensors, (list, tuple)):
--> 129 return type(tensors)(nested_concat(t, n, padding_index=padding_index) for t, n in zip(tensors, new_tensors))
130 elif isinstance(tensors, torch.Tensor):
131 return torch_pad_and_concatenate(tensors, new_tensors, padding_index=padding_index)

File /hpcfs/users/a1841503/myconda/envs/geneformer/lib/python3.12/site-packages/transformers/trainer_pt_utils.py:129, in (.0)
125 assert (
126 type(tensors) is type(new_tensors)
127 ), f"Expected tensors and new_tensors to have the same type but found {type(tensors)} and {type(new_tensors)}."
128 if isinstance(tensors, (list, tuple)):
--> 129 return type(tensors)(nested_concat(t, n, padding_index=padding_index) for t, n in zip(tensors, new_tensors))
130 elif isinstance(tensors, torch.Tensor):
131 return torch_pad_and_concatenate(tensors, new_tensors, padding_index=padding_index)

File /hpcfs/users/a1841503/myconda/envs/geneformer/lib/python3.12/site-packages/transformers/trainer_pt_utils.py:129, in nested_concat(tensors, new_tensors, padding_index)
125 assert (
126 type(tensors) is type(new_tensors)
127 ), f"Expected tensors and new_tensors to have the same type but found {type(tensors)} and {type(new_tensors)}."
128 if isinstance(tensors, (list, tuple)):
--> 129 return type(tensors)(nested_concat(t, n, padding_index=padding_index) for t, n in zip(tensors, new_tensors))
130 elif isinstance(tensors, torch.Tensor):
131 return torch_pad_and_concatenate(tensors, new_tensors, padding_index=padding_index)

File /hpcfs/users/a1841503/myconda/envs/geneformer/lib/python3.12/site-packages/transformers/trainer_pt_utils.py:129, in (.0)
125 assert (
126 type(tensors) is type(new_tensors)
127 ), f"Expected tensors and new_tensors to have the same type but found {type(tensors)} and {type(new_tensors)}."
128 if isinstance(tensors, (list, tuple)):
--> 129 return type(tensors)(nested_concat(t, n, padding_index=padding_index) for t, n in zip(tensors, new_tensors))
130 elif isinstance(tensors, torch.Tensor):
131 return torch_pad_and_concatenate(tensors, new_tensors, padding_index=padding_index)

File /hpcfs/users/a1841503/myconda/envs/geneformer/lib/python3.12/site-packages/transformers/trainer_pt_utils.py:131, in nested_concat(tensors, new_tensors, padding_index)
129 return type(tensors)(nested_concat(t, n, padding_index=padding_index) for t, n in zip(tensors, new_tensors))
130 elif isinstance(tensors, torch.Tensor):
--> 131 return torch_pad_and_concatenate(tensors, new_tensors, padding_index=padding_index)
132 elif isinstance(tensors, Mapping):
133 return type(tensors)(
134 {k: nested_concat(t, new_tensors[k], padding_index=padding_index) for k, t in tensors.items()}
135 )

File /hpcfs/users/a1841503/myconda/envs/geneformer/lib/python3.12/site-packages/transformers/trainer_pt_utils.py:89, in torch_pad_and_concatenate(tensor1, tensor2, padding_index)
86 tensor2 = atleast_1d(tensor2)
88 if len(tensor1.shape) == 1 or tensor1.shape[1] == tensor2.shape[1]:
---> 89 return torch.cat((tensor1, tensor2), dim=0)
91 # Let's figure out the new shape
92 new_shape = (tensor1.shape[0] + tensor2.shape[0], max(tensor1.shape[1], tensor2.shape[1])) + tensor1.shape[2:]

RuntimeError: Sizes of tensors must match except in dimension 0. Expected size 1147 but got size 1326 for tensor number 1 in the list.

ctheodoris

Owner May 14

Thank you for your question. This can happen if the tensors are not padded properly to be the same size across a batch size. We would suggest checking that. If that does not appear to be the problem, please provide the code you are using to help troubleshoot further.

ctheodoris changed discussion status to closed May 14

GalenP

May 15

I used the GeneFormer tokeniser to tokenise the data; I also suspected that the padding had gone wrong. Should I retokenise it and see if the issue persists?

Here is my code below:
from geneformer import TranscriptomeTokenizer
token_dir = "/data/tokenized_data/cd8_cancer/"

if not os.path.exists(token_dir):
os.makedirs(token_dir)

dictionary of custom attributes {output dataset column name: input .loom column name}

tk = TranscriptomeTokenizer(custom_attr_name_dict={"joinid": "joinid","cancerType": "cancerType"},
model_input_size = 2048,
special_token = False,
collapse_gene_ids=True,
gene_median_file="/Geneformer/geneformer/gene_dictionaries_30m/gene_median_dictionary_gc30M.pkl" ,
token_dictionary_file= "Geneformer/geneformer/gene_dictionaries_30m/token_dictionary_gc30M.pkl",
gene_mapping_file= "Geneformer/geneformer/gene_dictionaries_30m/ensembl_mapping_dict_gc30M.pkl")
tk.tokenize_data(data_directory=h5ad_dir,
output_directory=token_dir,
output_prefix="cd8",
file_format="h5ad")

Load the tokenised data
dataset = load_from_disk("/Geneformer/Zheng/data/tokenized_data/cd8_cancer/cd8_cancer.dataset")

The I split the data :
#Adding the dummy label to the dataset for the model to make predictions
token_dir_cancer = "/Geneformer/Zheng/data/tokenized_data/cd8_cancer/"
dataset_cancer = datasets.load_from_disk(token_dir_cancer + "cd8_cancer.dataset")
dataset_cancer = dataset_cancer.add_column("label", [0] * len(dataset_cancer))

Create train, validation, and test splits

dataset_dict_cancer = dataset_cancer.train_test_split(test_size=0.2, seed=42) # First split: 80% train, 20% test+val
train_dataset_cancer = dataset_dict_cancer['train']

validation and test

test_val_dataset_cancer = dataset_dict_cancer['test']
test_val_dict_cancer = test_val_dataset_cancer.train_test_split(test_size=0.5, seed=42) # Second split: 50% validation, 50% test

validation_dataset_cancer = test_val_dict_cancer['train'] # validation set
test_dataset_cancer = test_val_dict_cancer['test'] # test set

model = BertForSequenceClassification.from_pretrained("/Geneformer/fine_tuned_geneformer",output_attentions=True)

Configure for multi-GPU training

training_args = TrainingArguments(
output_dir="./results",
per_device_train_batch_size=16,
per_device_eval_batch_size=16,
save_strategy="epoch",
fp16=True, # Mixed precision training to reduce memory usage
dataloader_num_workers=4, # Parallel data loading

)

Create the trainer with training arguments

trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset_cancer,
eval_dataset=validation_dataset_cancer,
data_collator=DataCollatorForCellClassification(token_dictionary=token_dict),
)

Use trainer

predictions = trainer.predict(test_dataset_cancer)

ctheodoris

Owner May 15

Thank you for following up.

I am assuming that "/Geneformer/fine_tuned_geneformer" points to a model that has been previously fine-tuned for cell classification and that this was based on the 30M Geneformer pretrained model.

If so, you may consider using the function we provide for evaluating a previously saved model.

The padding does not occur at tokenization but occurs as batches are made and presented to the model during training or inference. The function we provide should handle the padding. Otherwise, you will need to check that the collator does this properly for trainer.predict the way you have set it up.