@sometimesanotion Maybe this is useful to you! :)
CultriX PRO
AI & ML interests
Recent Activity
Organizations
CultriX's activity
Transform Your Personal Data into High-Quality Training Datasets with help from a LLM.
Inspired by a Reddit post (link below) I've created a script that converts custom documents into question-answer pairs for LLM fine-tuning.
What it does:
1. Split the input data into chunks (note: this is important, more below!)
2. QA generation: Creates contextually relevant question-answer pairs from each chunk.
3. Quality assurance: Validates outputs using both rule-based filters and LLM judges
4. Exports datasets in both CSV and JSON formats
Key features:
- Separate model configurations for generation and evaluation
- Configurable chunk sizes and question length
- Multi-language support (English and Dutch, but easy to add your own!)
- Local and cloud API compatibility
Quick start:
Place your documents (.txt for now) in an input folder and run:
python generate-rag-qav4.py \
--input-dir ./rag-input/ \
--output-dir ./rag-output/ \
--output-filename finetuning_qa_dataset \
--gen-model google/gemma-3-4b \
--gen-api-base http://127.0.0.1:1234/v1 \
--judge-model google/gemma-3-4b \
--judge-api-base http://127.0.0.1:1234/v1 \
--min-chunk-len 200 \
--question-chars 20 \
--answer-chars 5 \
--lang en
Pro tip: The --min-chunk-len parameter is critical. Too short (< 150 chars) and questions lack context; too long (> 1000 chars) and the model struggles with focus. Start with 200-400 characters and adjust based on your content type!
Use cases:
- Personal knowledge base fine-tuning
- Domain-specific QA dataset creation
- RAG system training data preparation
Note: The script includes comprehensive error handling and progress tracking, and allows resuming progress should the process get interrupted.
Note2: Original Reddit post that gave me the idea:
https://www.reddit.com/r/LocalLLaMA/s/avkdzk8NSn
The script can be found here:
https://gist.github.com/CultriX-Github/9d53565214d56b12b9002a56230d1c00
Oh also definitely look into this! Don't know how I forgot to mention it in my first post it's SUPER useful for RAG:
I know it's something very different from what you described, but have you read about AnythingLLM and their browser extension? I have been using it a lot and it works very well.
I also have been looking into MCP a lot lately (it seems to be very promising and imo is the next big thing happening right now) which could be used for this.
Finally, just because I found it super useful (although a bit unrelated), this python script that can turn pretty much any text data into a LLM-dataset is something I wanted to share with you as well (even though technically not RAG-related. It's been a while since we talked haha): https://www.reddit.com/r/LocalLLaMA/comments/1ai2gby/comment/korunem/?share_id=DFUUUr1ZD2ZCKFGXwccvF
Supercharge your Retrieval Augmented Generation (RAG) pipelines with ease! I just finished working on the **RAG-Ready Content Scraper**, a mix between two very useful tools (RAG-Scraper and RepoMix); now available as a Hugging Face Space!
## What can it do?
This intuitive application helps you effortlessly gather and process content from various sources:
* ๐ **Webpages**: Scrape content from any URL (with RAG-Scraper). You can even control the scraping depth to fetch linked pages!
* ๐ **GitHub Repositories**: Process entire GitHub repos (using the power of Repomix) by simply providing a URL or
username/repo
ID.## Various Output Formats
Convert the scraped content into a variety of RAG-friendly formats:
* **Markdown** (.md)
* **JSON** (.json)
* **CSV** (.csv)
* **Plain Text** (.txt)
* **PDF** (.pdf)
Perfect for building datasets, knowledge bases, and feeding your LLMs with high-quality, structured information.
## Hope you enjoY!
Ready to streamline your RAG data preparation?
๐ **Visit the RAG-Ready Content Scraper on Hugging Face Spaces:** [https://huggingface.co/spaces/CultriX/RAG-Scraper]
---
Feedback and feature requests are welcome! Let's build better RAG together.
- added sequence similarity reward (seems to work)
- improved vllm support (5x inference speed)
- adjusted reward scores (this helped with format/accuracy)
- can now push to hf hub (already pushed mine lol: Jaward/smollm2_360m_grpo_gsm8k_reasoner)
Code: https://github.com/Jaykef/ai-algorithms/blob/main/smollm2_360M_135M_grpo_gsm8k.ipynb
https://huggingface.co/blog/CultriX/reverse-engineering-customgpts
It now includes :
- a live stream of the progress being made on the task (see included video),
- The following components:
1. Automatic prompt optimization
2. An orchestrator deciding which agent to call dynamically including feedback from a human (human-in-the-loop)
3. A coding agent to complete the task
4. A code reviewing agent to iteratively provide feedback to improve the code generated by the coding agent until the code meets the required criteria after which it is approved.
5. A testing agent that tests the approved code or provides information on how to test it.
6. A documentation agent that provides documentation and a help message for the approved and tested code.
It now includes :
- a live stream of the progress being made on the task (see included video),
- The following components:
1. Automatic prompt optimization
2. An orchestrator deciding which agent to call dynamically including feedback from a human (human-in-the-loop)
3. A coding agent to complete the task
4. A code reviewing agent to iteratively provide feedback to improve the code generated by the coding agent until the code meets the required criteria after which it is approved.
5. A testing agent that tests the approved code or provides information on how to test it.
6. A documentation agent that provides documentation and a help message for the approved and tested code.
OK nevermind I clicked that blog link and this is hella damn interesting how come I never heard of this haha. It states some really promising things right there... :o
the model that calls itself "Qwenconceited-14B-v13-DeepSuffering". <-- That cracked me up, lol!
And yeah very interesting but I'm going to have to read that again at another moment to fully understand all it is saying haha. Sounds like interesting stuff though!
Oh yeah for sure I'll hit you up sometime! Just to be clear I wasn't asking you to upload all your personal tweaks that you've spent probably weeks on to improve them haha. I was just curious about some of the things you said. For example, when you said " Extract a small LoRA from this" I was a little bit confused actually haha. As in: I have no idea how to do that, let alone apply it to smoothen out other models in the merge.
I know about adapter models and that you can create those with LoRA fine-tuning which you can either load on top during inference or you can merge with the base model, but extracting a LoRA from an existing model is kinda confusing me haha (sorry!). It sounds interesting though! Do I understand correctly that this would enable you to kind of "operate" on the model more precisely and with a lot less compute required (aka: more merges you can make and test in a given time window)?
Would you mind doing a writeup about your customized mergekit workflow, or do you prefer to keep some of the secret sauce to yourself? ;)
Or I guess the ReadMe as nobody can read that lol: https://huggingface.co/spaces/CultriX/MultiAgent-CodeTask/blob/main/README.md
This version does not rely on AutoGen.
The user simply enters his OPENAI_API_KEY and a task and the Space goes to work, employing a
- 1. prompt-enhancer agent,
- 2. an orchestrator agent,
- 3. a coder agent,
- 4. a code-reviewing agent and
-5. a code documentation generator agent.
See below image for an example workflow:
CultriX/MultiAgent-CodeTask
I made an improved version of this that does not rely on autogen and produces some really nice code! :)
https://huggingface.co/spaces/CultriX/MultiAgent-CodeTask
DEMO
========== USER PROMPT ==========
Write a random password generator in python.
========== Multi-Agent Conversation ==========
[Prompt Optimizer]: Optimized Task:
Instructions for Password Generator in Python
Objective: Create a Python script that generates secure, random passwords based on user-defined criteria.
Requirements:
- The script should allow users to customize the password:
- Length: Specify the desired length of the password (e.g., 12-16 characters).
- Character Types: Allow users to include or exclude uppercase letters, lowercase letters, numbers, and special characters.
- The password should be randomly generated and accessible for immediate use.
Output Specifications:
- The generated password should be printed to the console.
- The format should be simple: just the password string without any additional text.
Style:
- Write clear, maintainable Python code with appropriate comments explaining the functionality of key sections.
- Use descriptive variable names and adhere to PEP 8 formatting standards.
Example of Desired Output:
If a user requests a 12-character password with all character types included, the script might output something like:
D4v!7l@rTzq%
Guidance for Implementation:
- Start by importing necessary libraries (e.g.,
random
andstring
). - Define a function that takes parameters for password length and character type preferences.
- Use a combination of built-in functions and methods to dynamically create the password based on the specified criteria.
Here's a brief code outline to guide your implementation:
import random
import string
def generate_password(length=12, use_uppercase=True, use_lowercase=True, use_numbers=True, use_special_chars=True):
# Create a pool of characters based on user preferences
charset = ''
if use_uppercase:
charset += string.ascii_uppercase
if use_lowercase:
charset += string.ascii_lowercase
if use_numbers:
charset += string.digits
if use_special_chars:
charset += string.punctuation
# Ensure the charset is not empty
if not charset:
raise ValueError("At least one character type must be selected.")
# Generate a random password
password = ''.join(random.choice(charset) for _ in range(length))
print(password)
Feel free to modify the above snippet based on additional features or functionality you may want to incorporate. Ensure that the code is fully functional, handles user input safely, and returns a strong password.
Here's a step-by-step plan for completing the Password Generator task while coordinating tasks among a Coder Agent, a Code Reviewer Agent, and a Documentation Agent:
Step 1: Task Assignment
- Coder Agent: Responsible for writing the initial implementation of the password generator based on the provided requirements.
- Code Reviewer Agent: Responsible for reviewing the code for correctness, adherence to coding standards, and overall quality.
- Documentation Agent: Responsible for creating user documentation and comments within the code.
Step 2: Coder Agent - Implementation
- Step 2.1: Import necessary libraries (
random
andstring
). - Step 2.2: Define the function
generate_password()
with parameters for password length and character type preferences. - Step 2.3: Create a character set based on user preferences:
- Initialize an empty string for
charset
. - Use
if
statements to append tocharset
depending on whether the user has selected uppercase letters, lowercase letters, numbers, and special characters.
- Initialize an empty string for
- Step 2.4: Check if
charset
is empty and raise aValueError
if it is. - Step 2.5: Generate the password using a list comprehension that iterates for the specified length, selecting random characters from
charset
. - Step 2.6: Print the generated password.
Step 3: Coder Agent - Testing
- Step 3.1: Write test cases to validate that the password generation meets the specified criteria (length, character types).
- Step 3.2: Ensure that the function behaves as expected with various combinations of parameters.
Step 4: Code Reviewer Agent - Code Review
- Step 4.1: Review the code for:
- Correct functionality (does it generate a valid password?).
- Adherence to PEP 8 formatting standards (indentation, line length, etc.).
- Appropriate comments for understanding the code's functionality.
- Step 4.2: Provide feedback to the Coder Agent regarding any improvements or necessary changes.
Step 5: Coder Agent - Revise Code
- Step 5.1: Implement any changes or improvements suggested during the code review.
- Step 5.2: Re-test the code to ensure that revisions do not introduce any errors.
Step 6: Documentation Agent - Document the Code
- Step 6.1: Write clear and concise comments within the code, explaining the key sections and logic.
- Step 6.2: Create a README file or a user guide that explains how to use the password generator script, including:
- Functionality overview.
- Instructions for specifying parameters.
- Examples of usage and output.
Step 7: Final Review and Integration
- Step 7.1: Perform a final review of the combined code and documentation to ensure that everything is coherent and aligned with the original task requirements.
- Step 7.2: Package the final version of the code and documentation for delivery or deployment.
Step 8: Completion
- Step 8.1: Confirm all tasks are complete and that the password generator meets the specified objectives.
- Step 8.2: Consider potential further enhancements or features for future iterations based on user feedback.
By following this structured plan and assigning specific roles and responsibilities, the collaborative effort from the Coder Agent, Code Reviewer Agent, and Documentation Agent will lead to a well-crafted, maintainable, and user-friendly password generator in Python.
import random
import string
def generate_password(length=12, use_uppercase=True, use_lowercase=True, use_numbers=True, use_special_characters=True):
"""Generate a random password based on user preferences."""
charset = ""
if use_uppercase:
charset += string.ascii_uppercase
if use_lowercase:
charset += string.ascii_lowercase
if use_numbers:
charset += string.digits
if use_special_characters:
charset += string.punctuation
if not charset:
raise ValueError("At least one character type must be selected for password generation.")
password = ''.join(random.choice(charset) for _ in range(length))
print(password)
# Testing the password generator
if __name__ == "__main__":
print("Generated Passwords:")
generate_password(length=16, use_uppercase=True, use_lowercase=True, use_numbers=True, use_special_characters=True)
generate_password(length=12, use_uppercase=False, use_lowercase=True, use_numbers=True, use_special_characters=False)
generate_password(length=20, use_uppercase=True, use_lowercase=False, use_numbers=False, use_special_characters=True)
[Code Reviewer]: Review (Iteration 0):
The provided code meets the task specifications and correctly implements the password generator based on user-defined criteria. Here is the feedback based on the requirements and code quality:
Functionality: The function
generate_password
accepts the correct parameters for customizing password length and character types, allowing appropriate flexibility for the user.Character Sets: The character sets for uppercase letters, lowercase letters, numbers, and special characters are correctly created based on user preferences.
Error Handling: The code properly raises a
ValueError
if no character types are selected, which is essential for ensuring that a password can be generated.Randomness: The generation of the password using
random.choice
correctly ensures randomness in the selection of characters.Output: The password is printed directly to the console as specified, without any additional text or formatting.
PEP 8 Compliance: The variable names are descriptive, and the overall formatting adheres to PEP 8 standards. The use of double quotes for strings and the overall indentation are consistent and clear.
Comments: The code includes a comment block describing the function, which is beneficial for maintainability.
Main Guard: The
if __name__ == "__main__":
construct is used correctly, allowing the code to be tested when run as a script.Testing: Sample calls to the password generator function provide a way to check its functionality during development. This is good for demonstration but may not be appropriate in a production setting if this code were to be used as a library.
Overall, the code is well-structured and addresses the task requirements effectively. There are no significant issues to note.
APPROVE
Documentation for generate_password
Function: generate_password
Purpose: Generates a random password based on user-specified criteria.
Parameters:
length
(int): Desired length of the password (default is 12).use_uppercase
(bool): Include uppercase letters (default is True).use_lowercase
(bool): Include lowercase letters (default is True).use_numbers
(bool): Include numeric digits (default is True).use_special_characters
(bool): Include special characters (default is True).
Returns:
- A randomly generated password printed to the console.
Exceptions:
- Raises
ValueError
if no character type is selected.
Notes:
- The function constructs a character set based on the selected options and generates a password by randomly selecting characters from this set.
Help Message
Usage: generate_password(length=12, use_uppercase=True, use_lowercase=True, use_numbers=True, use_special_characters=True)
Generate a random password based on preferences:
- length: Length of the password (default: 12)
- use_uppercase: Include uppercase letters? (default: True)
- use_lowercase: Include lowercase letters? (default: True)
- use_numbers: Include numbers? (default: True)
- use_special_characters: Include special characters? (default: True)
Example: generate_password(16, use_uppercase=True)
Congrats! Very nicely done :)!