Rotated Positional Embedding for Object Detection in Latent Space

The initial positional embeddings are rotated to align with the latent coordinates of the tagged objects. Positioning them in proximity to the corresponding object in the image.

Built on a multimodal model, Wan2.1 encoded the image.

Categories:

- [1] hat
- [2] hair
- [3] sunglasses
- [4] shirt
- [5] skirt
- [6] pants
- [7] dress
- [8] belt
- [9] shoes
- [11] face
- [12] legs
- [14] arms
- [16] bag
- [17] scarf

Disclaimer

The documentation and the model requires citation and attribution to the author via a link to their Hugging Face profile.