File size: 4,123 Bytes
2021d5e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e2bd076
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
---
license: apache-2.0
language:
  - en
base_model: Qwen/Qwen3-Embedding-0.6B
tags:
  - encoder
  - Text Generation 
  - embedding
---

# Qwen3-0.6B-T5-xxl-GGUF

## Model Description

This repository provides GGUF quantized versions of the `Qwen3-0.6B-T5-xxl` model body. These models are designed for fast, low-resource inference on CPUs.

The goal of this project is to replicate the embedding outputs of `google/t5-v1_1-xxl` using a highly optimized pipeline.

To make this repository fully functional out-of-the-box, the fine-tuned **projection head is also included**. This allows you to combine the GGUF model with the PyTorch-based head to get the final 4096-dimension embeddings.

## Repository Contents

- `qwen3-0.6B-Q4_K_M.gguf`: The model body quantized using the Q4_K_M method. (And potentially other quantizations).
- **/projection_head/projection_head.pth**: The PyTorch state dictionary for the final projection layer.

## How to Use: Hybrid GGUF + PyTorch Pipeline

This tutorial shows how to use the GGUF model for fast base embedding generation and the PyTorch head for the final projection.

### Step 1: Prerequisites

First, install the necessary libraries. `llama-cpp-python` is required to run GGUF models.

```
pip install llama-cpp-python torch numpy
```

### Step 2: Inference Script

The following script encapsulates the entire hybrid pipeline into a convenient class. You can save it as a `.py` file and import it into your projects.

```python
import torch
from torch import nn
from llama_cpp import Llama
import numpy as np

class HybridEmbedder:
    """
    A class that encapsulates the hybrid embedding pipeline.
    It loads the models once at initialization for optimal performance.
    """
    def __init__(self, gguf_path: str, head_path: str, n_ctx: int = 512):
        print("Initializing HybridEmbedder...")
        
        # 1. Load the GGUF body
        print(f"Loading GGUF body from: {gguf_path}")
        self.body_model = Llama(
            model_path=gguf_path,
            embedding=True,
            n_ctx=n_ctx,
            verbose=False
        )
        print(" -> GGUF body loaded.")

        # 2. Load the PyTorch projection head
        print(f"Loading projection head from: {head_path}")
        input_dim = self.body_model.n_embd()
        hidden_dim = 2048
        output_dim = 4096
        
        self.head_model = nn.Sequential(
            nn.Linear(input_dim, hidden_dim), 
            nn.GELU(),
            nn.Dropout(0.1), 
            nn.Linear(hidden_dim, output_dim)
        )
        self.head_model.load_state_dict(torch.load(head_path))
        self.head_model.eval()
        print(" -> Projection head loaded.")
        print("\n✅ Embedder is ready to use.")

    def get_embedding(self, text: str) -> torch.Tensor:
        # a) Get the sequence of token embeddings from the GGUF model
        token_embeddings = self.body_model.embed(text)
        
        # b) Apply Mean Pooling to get a single sentence vector
        sentence_embedding = np.mean(token_embeddings, axis=0)
        
        # c) Convert to a PyTorch tensor and add a batch dimension
        sentence_tensor = torch.tensor(sentence_embedding).unsqueeze(0)
        
        # d) Pass through the projection head
        with torch.no_grad():
            final_embedding = self.head_model(sentence_tensor.float())
            
        return final_embedding

# --- Example Usage ---
if __name__ == "__main__":
    # Define the paths to your local model files
    GGUF_FILE = "qwen3-0.6B-Q4_K_M.gguf"
    HEAD_FILE = "./projection_head/projection_head.pth"

    # Create an instance of our embedder
    embedder = HybridEmbedder(gguf_path=GGUF_FILE, head_path=HEAD_FILE)

    # Use the embedder to get vectors
    prompt = "A sprawling fantasy city built into a giant tree."
    embedding = embedder.get_embedding(prompt)
    
    print("\n--- Inference Test ---")
    print(f"Prompt: '{prompt}'")
    print(f"Output dimension: {embedding.shape}")
    print(f"Vector excerpt: {embedding[0, :5]}...")
```

## License

This repository is licensed under the **Apache license 2.0**.