From Text to Meaning: The Power of Embeddings with GraphBit
Introduction
“How do AI models actually understand what we mean?”
If you’ve ever asked yourself that question, the answer lies in embeddings — the mathematical magic that lets AI represent meaning, context, and similarity between pieces of text.
But while embeddings are powerful, building a system around them can be overwhelming. That’s where GraphBit comes in.
GraphBit is an agentic ai framework that simplifies the process of turning text into knowledge — from preprocessing and embedding to semantic search and retrieval.
Whether you’re building a research paper summarizer, a chatbot, or a knowledge base assistant, this guide will walk you through every step to get started with GraphBit and use embeddings effectively.
What Is GraphBit?
At its heart, GraphBit is an open-source Python toolkit that bridges traditional text processing with modern AI workflows.
Think of it as a Lego kit for AI applications — you can snap together components to build systems that understand text, not just read it.
With GraphBit, You Can
- Load and preprocess documents (PDFs, Markdown, plain text, etc.)
- Split large text into context-friendly chunks
- Generate embeddings using OpenAI or similar models
- Store and query embeddings using FAISS or ChromaDB
It’s ideal for projects involving Retrieval-Augmented Generation (RAG), document summarization, and semantic search.
Step 1: Install GraphBit and Set Up Your Environment
Let’s start simple.
pip install graphbit
Then create a .env file in your project folder and add your OpenAI key:
OPENAI_API_KEY=your_openai_api_key_here
Finally, load your environment variables in Python:
from dotenv import load_dotenv
load_dotenv()
That’s all the setup you need before diving in.
Step 2: Load Your Documents
GraphBit’s DocumentLoader takes care of reading text from various file formats.
from graphbit import DocumentLoader, DocumentLoaderConfig
loader_config = DocumentLoaderConfig()
loader = DocumentLoader(loader_config)
documents = loader.load_documents(filepath, datatype)
Now you’ve got all your data neatly loaded into memory, ready for processing.
If you’re working with PDFs, GraphBit automatically merges multi-page documents and extracts text cleanly — no messy formatting or broken sentences.
Step 3: Split Text into Contextual Chunks
Embedding models perform best when the text is broken into smaller, meaningful pieces.
from graphbit import TextSplitter, TextSplitterConfig
splitter_config = TextSplitterConfig(
chunk_size=500,
chunk_overlap=50
)
splitter = TextSplitter(splitter_config)
chunks = splitter.split_documents(documents)
Why Use Overlaps?
Small overlaps (like 50 tokens) help preserve continuity between chunks, ensuring smoother summarization or question-answering later.
Step 4: Generate Embeddings
Now it’s time for the core magic — turning text into vectors.
Embeddings are numerical fingerprints that capture meaning. Two sentences that say the same thing will have nearly identical embeddings, even if they use different words.
from graphbit import EmbeddingConfig, EmbeddingClient
embedding_config = EmbeddingConfig(model="text-embedding-3-small")
embedding_client = EmbeddingClient(embedding_config)
embeddings = [embedding_client.embed(chunk) for chunk in chunks]
Each chunk is now represented as a high-dimensional vector — a precise, machine-understandable version of your text.
Step 5: Store Embeddings in a Vector Database
To make embeddings useful, we need a way to store and search them efficiently.
Traditional databases can’t handle similarity search across thousands of 1,536-dimensional vectors — but vector databases can.
GraphBit supports multiple vector backends such as FAISS, ChromaDB, PGVector, and more. In this example, we’ll use PGVector.
Using PGVector
import psycopg2
import json
# Connect to PostgreSQL
conn = psycopg2.connect(
dbname="vector_db",
user="postgres",
password="your_password",
host="localhost",
port=5432
)
cur = conn.cursor()
# Enable PGVector and create table
cur.execute("CREATE EXTENSION IF NOT EXISTS vector;")
cur.execute("""
CREATE TABLE IF NOT EXISTS vector_data (
id SERIAL PRIMARY KEY,
item_id TEXT,
embedding VECTOR(1536),
metadata JSONB
);
""")
cur.execute("""
CREATE INDEX IF NOT EXISTS idx_embedding_vector ON vector_data
USING ivfflat (embedding vector_cosine_ops) WITH (lists = 100);
""")
conn.commit()
# Insert embeddings
for idx, (chunk, emb) in enumerate(zip(chunks, embeddings)):
cur.execute(
"""
INSERT INTO vector_data (item_id, embedding, metadata)
VALUES (%s, %s, %s)
""",
(f"chunk_{idx}", emb, json.dumps({"text": chunk}))
)
conn.commit()
Now your data is ready for semantic querying.
Step 6: Search and Retrieve by Meaning
This is where GraphBit starts feeling magical — instead of keyword matching, you can now search by concept.
import ast
# Fetch all stored vectors
cur.execute("SELECT item_id, embedding, metadata FROM vector_data;")
all_rows = cur.fetchall()
# Create embedding for query
query = "What are the main findings of the research?"
query_embedding = embedding_client.embed(query)
best_score = -1
best_item = None
results = []
for item_id, embedding_vec, metadata in all_rows:
if isinstance(embedding_vec, str):
embedding_vec = ast.literal_eval(embedding_vec)
score = embedding_client.similarity(query_embedding, embedding_vec)
results.append((score, item_id, metadata))
if score > best_score:
best_score = score
best_item = (item_id, metadata)
# Sort and get top 3
results.sort(reverse=True)
top_3 = results[:3]
print(f"Most similar document: {best_item[0]}, score: {best_score:.4f}")
You’ll get the top 3 most semantically similar results, even if your query doesn’t contain the same words.
That’s the power of embeddings — meaning-based retrieval instead of keyword matching.
Behind the scenes, GraphBit computes the cosine similarity between the query embedding and your stored embeddings to find the closest matches.
Step 7: Build Something Smarter on Top
Once your retrieval pipeline is in place, you can layer intelligence on top using OpenAI or other LLMs.
Here’s how to build a quick summarization or RAG pipeline:
from graphbit import LlmConfig, LlmClient
llm_config = LlmConfig.openai("OPENAI_API_KEY", model="gpt-4o-mini")
llm_client = LlmClient(llm_config)
# Gather context from top results
context = " ".join([r[2]['text'] for r in top_3]) # r[2] = metadata
# Build a summarization prompt
prompt = f"Summarize the following context:\n\n{context}"
# Generate the response
response = llm_client.complete(prompt)
print("Summary:\n", response)
And just like that, you’ve built the foundation of a knowledge-aware AI system — capable of summarizing research papers, answering domain-specific questions, or powering chatbots that know your data inside-out.
Why GraphBit?
There are many AI frameworks out there — so what makes GraphBit stand out?
Key Advantages
- Lightweight & Modular – No bloated dependencies.
- Consistent APIs – Unified design across loaders, splitters, embedders, and vector stores.
- Switchable Backends – Easily switch between FAISS, ChromaDB, or PGVector.
- Production-Ready – Integrates cleanly with FastAPI backends or LangChain extensions.
In short, GraphBit lets you focus on building intelligence, not infrastructure.
Going Beyond: Combine GraphBit with Your Stack
GraphBit plays nicely with popular tools and frameworks:
- FastAPI – Turn your retrieval logic into REST APIs for production.
- Streamlit – Build lightweight UI prototypes for demos.
- LangChain / LlamaIndex – Use GraphBit embeddings inside larger knowledge graph workflows.
For example, you could use FastAPI to expose an endpoint like /search that returns GraphBit results as JSON, or integrate Streamlit for live semantic search visualization.
Conclusion
Embeddings are the backbone of intelligent AI — they allow reasoning models like ChatGPT to recall meaning instead of memorizing words.
GraphBit takes this complex process and makes it approachable.
You’ve Now Learned How to
- Load and clean documents
- Split them into structured chunks
- Generate embeddings using OpenAI
- Store and query them with any vector DB
- Retrieve and summarize results with context
From here, you can build research tools, AI documentation bots, or enterprise knowledge assistants that operate entirely on your own data.
GraphBit isn’t just another library — it’s your gateway to context-aware AI development.