File size: 4,123 Bytes
2021d5e e2bd076 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 |
---
license: apache-2.0
language:
- en
base_model: Qwen/Qwen3-Embedding-0.6B
tags:
- encoder
- Text Generation
- embedding
---
# Qwen3-0.6B-T5-xxl-GGUF
## Model Description
This repository provides GGUF quantized versions of the `Qwen3-0.6B-T5-xxl` model body. These models are designed for fast, low-resource inference on CPUs.
The goal of this project is to replicate the embedding outputs of `google/t5-v1_1-xxl` using a highly optimized pipeline.
To make this repository fully functional out-of-the-box, the fine-tuned **projection head is also included**. This allows you to combine the GGUF model with the PyTorch-based head to get the final 4096-dimension embeddings.
## Repository Contents
- `qwen3-0.6B-Q4_K_M.gguf`: The model body quantized using the Q4_K_M method. (And potentially other quantizations).
- **/projection_head/projection_head.pth**: The PyTorch state dictionary for the final projection layer.
## How to Use: Hybrid GGUF + PyTorch Pipeline
This tutorial shows how to use the GGUF model for fast base embedding generation and the PyTorch head for the final projection.
### Step 1: Prerequisites
First, install the necessary libraries. `llama-cpp-python` is required to run GGUF models.
```
pip install llama-cpp-python torch numpy
```
### Step 2: Inference Script
The following script encapsulates the entire hybrid pipeline into a convenient class. You can save it as a `.py` file and import it into your projects.
```python
import torch
from torch import nn
from llama_cpp import Llama
import numpy as np
class HybridEmbedder:
"""
A class that encapsulates the hybrid embedding pipeline.
It loads the models once at initialization for optimal performance.
"""
def __init__(self, gguf_path: str, head_path: str, n_ctx: int = 512):
print("Initializing HybridEmbedder...")
# 1. Load the GGUF body
print(f"Loading GGUF body from: {gguf_path}")
self.body_model = Llama(
model_path=gguf_path,
embedding=True,
n_ctx=n_ctx,
verbose=False
)
print(" -> GGUF body loaded.")
# 2. Load the PyTorch projection head
print(f"Loading projection head from: {head_path}")
input_dim = self.body_model.n_embd()
hidden_dim = 2048
output_dim = 4096
self.head_model = nn.Sequential(
nn.Linear(input_dim, hidden_dim),
nn.GELU(),
nn.Dropout(0.1),
nn.Linear(hidden_dim, output_dim)
)
self.head_model.load_state_dict(torch.load(head_path))
self.head_model.eval()
print(" -> Projection head loaded.")
print("\n✅ Embedder is ready to use.")
def get_embedding(self, text: str) -> torch.Tensor:
# a) Get the sequence of token embeddings from the GGUF model
token_embeddings = self.body_model.embed(text)
# b) Apply Mean Pooling to get a single sentence vector
sentence_embedding = np.mean(token_embeddings, axis=0)
# c) Convert to a PyTorch tensor and add a batch dimension
sentence_tensor = torch.tensor(sentence_embedding).unsqueeze(0)
# d) Pass through the projection head
with torch.no_grad():
final_embedding = self.head_model(sentence_tensor.float())
return final_embedding
# --- Example Usage ---
if __name__ == "__main__":
# Define the paths to your local model files
GGUF_FILE = "qwen3-0.6B-Q4_K_M.gguf"
HEAD_FILE = "./projection_head/projection_head.pth"
# Create an instance of our embedder
embedder = HybridEmbedder(gguf_path=GGUF_FILE, head_path=HEAD_FILE)
# Use the embedder to get vectors
prompt = "A sprawling fantasy city built into a giant tree."
embedding = embedder.get_embedding(prompt)
print("\n--- Inference Test ---")
print(f"Prompt: '{prompt}'")
print(f"Output dimension: {embedding.shape}")
print(f"Vector excerpt: {embedding[0, :5]}...")
```
## License
This repository is licensed under the **Apache license 2.0**. |