--- license: apache-2.0 language: - en base_model: Qwen/Qwen3-Embedding-0.6B tags: - encoder - Text Generation - embedding --- # Qwen3-0.6B-T5-xxl-GGUF ## Model Description This repository provides GGUF quantized versions of the `Qwen3-0.6B-T5-xxl` model body. These models are designed for fast, low-resource inference on CPUs. The goal of this project is to replicate the embedding outputs of `google/t5-v1_1-xxl` using a highly optimized pipeline. To make this repository fully functional out-of-the-box, the fine-tuned **projection head is also included**. This allows you to combine the GGUF model with the PyTorch-based head to get the final 4096-dimension embeddings. ## Repository Contents - `qwen3-0.6B-Q4_K_M.gguf`: The model body quantized using the Q4_K_M method. (And potentially other quantizations). - **/projection_head/projection_head.pth**: The PyTorch state dictionary for the final projection layer. ## How to Use: Hybrid GGUF + PyTorch Pipeline This tutorial shows how to use the GGUF model for fast base embedding generation and the PyTorch head for the final projection. ### Step 1: Prerequisites First, install the necessary libraries. `llama-cpp-python` is required to run GGUF models. ``` pip install llama-cpp-python torch numpy ``` ### Step 2: Inference Script The following script encapsulates the entire hybrid pipeline into a convenient class. You can save it as a `.py` file and import it into your projects. ```python import torch from torch import nn from llama_cpp import Llama import numpy as np class HybridEmbedder: """ A class that encapsulates the hybrid embedding pipeline. It loads the models once at initialization for optimal performance. """ def __init__(self, gguf_path: str, head_path: str, n_ctx: int = 512): print("Initializing HybridEmbedder...") # 1. Load the GGUF body print(f"Loading GGUF body from: {gguf_path}") self.body_model = Llama( model_path=gguf_path, embedding=True, n_ctx=n_ctx, verbose=False ) print(" -> GGUF body loaded.") # 2. Load the PyTorch projection head print(f"Loading projection head from: {head_path}") input_dim = self.body_model.n_embd() hidden_dim = 2048 output_dim = 4096 self.head_model = nn.Sequential( nn.Linear(input_dim, hidden_dim), nn.GELU(), nn.Dropout(0.1), nn.Linear(hidden_dim, output_dim) ) self.head_model.load_state_dict(torch.load(head_path)) self.head_model.eval() print(" -> Projection head loaded.") print("\n✅ Embedder is ready to use.") def get_embedding(self, text: str) -> torch.Tensor: # a) Get the sequence of token embeddings from the GGUF model token_embeddings = self.body_model.embed(text) # b) Apply Mean Pooling to get a single sentence vector sentence_embedding = np.mean(token_embeddings, axis=0) # c) Convert to a PyTorch tensor and add a batch dimension sentence_tensor = torch.tensor(sentence_embedding).unsqueeze(0) # d) Pass through the projection head with torch.no_grad(): final_embedding = self.head_model(sentence_tensor.float()) return final_embedding # --- Example Usage --- if __name__ == "__main__": # Define the paths to your local model files GGUF_FILE = "qwen3-0.6B-Q4_K_M.gguf" HEAD_FILE = "./projection_head/projection_head.pth" # Create an instance of our embedder embedder = HybridEmbedder(gguf_path=GGUF_FILE, head_path=HEAD_FILE) # Use the embedder to get vectors prompt = "A sprawling fantasy city built into a giant tree." embedding = embedder.get_embedding(prompt) print("\n--- Inference Test ---") print(f"Prompt: '{prompt}'") print(f"Output dimension: {embedding.shape}") print(f"Vector excerpt: {embedding[0, :5]}...") ``` ## License This repository is licensed under the **Apache license 2.0**.