"ValueError: Image features and image tokens do not match: tokens: 0, features 2852"

#49
by jdkruzr - opened

keep getting this whenever I ask this model to do transcription. it never gets to the point where it starts outputting text. I have a lot of logging implemented to figure out what's going wrong, and am getting this:

2025-06-02 00:29:25,290 - qwen_processor - INFO - Processing page 1/2
2025-06-02 00:29:25,290 - db_manager - INFO - Updated document 1 progress to 20.0% (Recognizing text on page 1/2)
2025-06-02 00:29:25,291 - qwen_processor - INFO - Processing image of size 2925x3900
2025-06-02 00:29:25,292 - qwen_processor - INFO - Before image processing: GPU Memory: 3.79 GB allocated, 3.99 GB reserved, 1.80 GB free
2025-06-02 00:29:25,338 - qwen_processor - INFO - Saved original image to /mnt/rectangularfile/debug_images/original_20250602_002925.jpg
2025-06-02 00:29:25,338 - qwen_processor - INFO - Resizing image from 2925x3900 (11,407,500 pixels) to 1299x1732 (2,249,868 pixels) to conserve memory
2025-06-02 00:29:25,480 - qwen_processor - INFO - Saved resized image to /mnt/rectangularfile/debug_images/resized_20250602_002925.jpg
2025-06-02 00:29:25,480 - qwen_processor - INFO - Preparing inputs with processor...
2025-06-02 00:29:25,635 - qwen_processor - INFO - Processor output details:
2025-06-02 00:29:25,635 - qwen_processor - INFO - - input_ids: shape torch.Size([1, 83])
2025-06-02 00:29:25,635 - qwen_processor - INFO - - attention_mask: shape torch.Size([1, 83])
2025-06-02 00:29:25,635 - qwen_processor - INFO - - pixel_values: shape torch.Size([11408, 1176])
2025-06-02 00:29:25,635 - qwen_processor - INFO - - image_grid_thw: shape torch.Size([1, 3])
2025-06-02 00:29:25,644 - qwen_processor - INFO - Moved inputs to cuda
2025-06-02 00:29:25,645 - qwen_processor - INFO - After moving inputs to GPU: GPU Memory: 3.84 GB allocated, 4.05 GB reserved, 1.75 GB free
2025-06-02 00:29:25,645 - qwen_processor - INFO - Generating transcription...
192.168.100.2 - - [02/Jun/2025 00:29:29] "GET /new_files HTTP/1.1" 200 -
2025-06-02 00:29:30,507 - qwen_processor - ERROR - Error processing image: Image features and image tokens do not match: tokens: 0, features 2852

Claude seems to think the pixel_values parameter is odd -- that the first number shouldn't be so huge for files this small. does anyone have some clues as to what might be going on here?

this is my image processing function:

    def _process_image(self, image: Image.Image) -> Tuple[str, float]:
        """Process a single image with Qwen2.5-VL."""
        try:
            self.logger.info(f"Processing image of size {image.width}x{image.height}")
            self._log_memory_usage("Before image processing")

            image = self._resize_image_if_needed(image)

            prompt = """<|im_start|>system
You are a helpful assistant that describes and transcribes handwritten text from images.
<|im_end|>
<|im_start|>user
First describe what you see in this image, including any visible characteristics of the handwriting, layout, or document structure. Then, transcribe all handwritten text, preserving layout and line breaks.
<|im_end|>
<|im_start|>assistant
I'll first describe what I see in the image, then provide the transcription:

Description:
"""

            self.logger.info("Preparing inputs with processor...")
            inputs = self.processor(
                text=prompt,
                images=image,
                return_tensors="pt"
            )

            self.logger.info("Processor output details:")
            for key, value in inputs.items():
                if hasattr(value, 'shape'):
                    self.logger.info(f"- {key}: shape {value.shape}")
                elif isinstance(value, list):
                    self.logger.info(f"- {key}: list of length {len(value)}")
                else:
                    self.logger.info(f"- {key}: type {type(value)}")

            inputs = inputs.to(self.device)
            self.logger.info(f"Moved inputs to {self.device}")
            self._log_memory_usage("After moving inputs to GPU")

            self.logger.info("Generating transcription...")
            with torch.no_grad():
                outputs = self.model.generate(
                    **inputs,
                    max_new_tokens=500,
                    do_sample=False,
                    temperature=1.0,
                    pad_token_id=self.tokenizer.pad_token_id,
                    eos_token_id=self.tokenizer.eos_token_id
                )

            self._log_memory_usage("After generation")
            self.logger.info("Generation completed")

            generated_text = self.tokenizer.decode(outputs[0], skip_special_tokens=False)

            del inputs, outputs
            if self.device == "cuda":
                torch.cuda.empty_cache()
                self._log_memory_usage("After cleanup")

            # Log the complete response for debugging
            self.logger.info("=== Complete model response ===")
            self.logger.info(generated_text)
            self.logger.info("===============================")

            # Extract and log the description separately
            response_start = generated_text.find("<|im_start|>assistant")
            if response_start != -1:
                text_start = generated_text.find("Description:", response_start)
                if text_start != -1:
                    description_end = generated_text.find("Transcription:", text_start)
                    if description_end != -1:
                        description = generated_text[text_start:description_end].strip()
                        self.logger.info("=== Model's description of image ===")
                        self.logger.info(description)
                        self.logger.info("===================================")
                        text_start = description_end + len("Transcription:")
                    else:
                        text_start = text_start + len("Description:")
                
                text_end = generated_text.find("<|im_end|>", text_start)
                if text_end == -1:
                    text_end = None
                text = generated_text[text_start:text_end].strip()
            else:
                text = generated_text.strip()

            text_preview = text[:100] + "..." if text else "[No text recognized]"
            self.logger.info("=== Transcription preview ===")
            self.logger.info(text_preview)
            self.logger.info("============================")

            return text, 0.95 if text else 0.0

        except Exception as e:
            self.logger.error(f"Error processing image: {e}")
            import traceback
            self.logger.error(f"Traceback: {traceback.format_exc()}")
            return "", 0.0

Your processor max_length default value of the processor is truncating the image/video text placeholders if the image/video size is large. So you can either:

a) decrease the"max_pixels" value inside the message. This will result in fewer vision tokens, and you can do it like this:

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "file:///path/to/your/image.jpg",
                "max_pixels": 16384,       # this is the default value, so decrease that to something smaller if you don't care about feeding all the pixels (eg. "max_pixels": 1920)
            },
            {"type": "text", "text": "Describe this image."},
        ],
    }
]

b) or while calling the processor you can increase the max_length value (this accounts for both text + vision tokens):

        inputs = self.processor(
            text=prompt,
            max_length= 8192 or more.  # add this here
            images=image,
            return_tensors="pt"
        )

If you have large enough inputs, try do both.
Let me know how it went!

Sign up or log in to comment