CultriX (CultriX)

replied to their post 6 days ago

@sometimesanotion Maybe this is useful to you! :)

posted an update 6 days ago

Post

850

Script for QA-style dataset generation from custom data:
Transform Your Personal Data into High-Quality Training Datasets with help from a LLM.

Inspired by a Reddit post (link below) I've created a script that converts custom documents into question-answer pairs for LLM fine-tuning.
What it does:
1. Split the input data into chunks (note: this is important, more below!)
2. QA generation: Creates contextually relevant question-answer pairs from each chunk.
3. Quality assurance: Validates outputs using both rule-based filters and LLM judges
4. Exports datasets in both CSV and JSON formats

Key features:
- Separate model configurations for generation and evaluation
- Configurable chunk sizes and question length
- Multi-language support (English and Dutch, but easy to add your own!)
- Local and cloud API compatibility

Quick start:
Place your documents (.txt for now) in an input folder and run:

python generate-rag-qav4.py \
  --input-dir ./rag-input/ \
  --output-dir ./rag-output/ \
  --output-filename finetuning_qa_dataset \
  --gen-model google/gemma-3-4b \
  --gen-api-base http://127.0.0.1:1234/v1 \
  --judge-model google/gemma-3-4b \
  --judge-api-base http://127.0.0.1:1234/v1 \
  --min-chunk-len 200 \
  --question-chars 20 \
  --answer-chars 5 \
  --lang en

Pro tip: The --min-chunk-len parameter is critical. Too short (< 150 chars) and questions lack context; too long (> 1000 chars) and the model struggles with focus. Start with 200-400 characters and adjust based on your content type!

Use cases:
- Personal knowledge base fine-tuning
- Domain-specific QA dataset creation
- RAG system training data preparation

Note: The script includes comprehensive error handling and progress tracking, and allows resuming progress should the process get interrupted.

Note2: Original Reddit post that gave me the idea:
https://www.reddit.com/r/LocalLLaMA/s/avkdzk8NSn

The script can be found here:
https://gist.github.com/CultriX-Github/9d53565214d56b12b9002a56230d1c00

2 replies

·

replied to their post 7 days ago

Oh also definitely look into this! Don't know how I forgot to mention it in my first post it's SUPER useful for RAG:

https://github.com/unclecode/crawl4ai

replied to their post 7 days ago

I know it's something very different from what you described, but have you read about AnythingLLM and their browser extension? I have been using it a lot and it works very well.

I also have been looking into MCP a lot lately (it seems to be very promising and imo is the next big thing happening right now) which could be used for this.

Finally, just because I found it super useful (although a bit unrelated), this python script that can turn pretty much any text data into a LLM-dataset is something I wanted to share with you as well (even though technically not RAG-related. It's been a while since we talked haha): https://www.reddit.com/r/LocalLLaMA/comments/1ai2gby/comment/korunem/?share_id=DFUUUr1ZD2ZCKFGXwccvF

posted an update 12 days ago

Post

348

# Announcing the RAG-Ready Conteaant Scraper! 🚀

Supercharge your Retrieval Augmented Generation (RAG) pipelines with ease! I just finished working on the **RAG-Ready Content Scraper**, a mix between two very useful tools (RAG-Scraper and RepoMix); now available as a Hugging Face Space!

## What can it do?

This intuitive application helps you effortlessly gather and process content from various sources:

* 🌐 **Webpages**: Scrape content from any URL (with RAG-Scraper). You can even control the scraping depth to fetch linked pages!
* 📂 **GitHub Repositories**: Process entire GitHub repos (using the power of Repomix) by simply providing a URL or username/repo ID.

## Various Output Formats

Convert the scraped content into a variety of RAG-friendly formats:

* **Markdown** (.md)
* **JSON** (.json)
* **CSV** (.csv)
* **Plain Text** (.txt)
* **PDF** (.pdf)

Perfect for building datasets, knowledge bases, and feeding your LLMs with high-quality, structured information.

## Hope you enjoY!

Ready to streamline your RAG data preparation?

👉 **Visit the RAG-Ready Content Scraper on Hugging Face Spaces:** [https://huggingface.co/spaces/CultriX/RAG-Scraper]

---

Feedback and feature requests are welcome! Let's build better RAG together.

5 replies

·

reacted to Jaward's post with 🔥👍 3 months ago

Post

4987

made a few improvements on custom grpo trainer:
- added sequence similarity reward (seems to work)
- improved vllm support (5x inference speed)
- adjusted reward scores (this helped with format/accuracy)
- can now push to hf hub (already pushed mine lol: Jaward/smollm2_360m_grpo_gsm8k_reasoner)

Code: https://github.com/Jaykef/ai-algorithms/blob/main/smollm2_360M_135M_grpo_gsm8k.ipynb

reacted to their post with 👍 4 months ago

Post

1453

Reverse-engineering most custom-GPT's is stupidly simple:

https://huggingface.co/blog/CultriX/reverse-engineering-customgpts

reacted to their post with 🔥👍❤️ 4 months ago

Post

2684

Final upgrade to the Multi-Agent Task Completion Space: CultriX/MultiAgent-CodeTask .

It now includes :
- a live stream of the progress being made on the task (see included video),
- The following components:
1. Automatic prompt optimization
2. An orchestrator deciding which agent to call dynamically including feedback from a human (human-in-the-loop)
3. A coding agent to complete the task
4. A code reviewing agent to iteratively provide feedback to improve the code generated by the coding agent until the code meets the required criteria after which it is approved.
5. A testing agent that tests the approved code or provides information on how to test it.
6. A documentation agent that provides documentation and a help message for the approved and tested code.

posted an update 4 months ago

Post

2684

Final upgrade to the Multi-Agent Task Completion Space: CultriX/MultiAgent-CodeTask .

It now includes :
- a live stream of the progress being made on the task (see included video),
- The following components:
1. Automatic prompt optimization
2. An orchestrator deciding which agent to call dynamically including feedback from a human (human-in-the-loop)
3. A coding agent to complete the task
4. A code reviewing agent to iteratively provide feedback to improve the code generated by the coding agent until the code meets the required criteria after which it is approved.
5. A testing agent that tests the approved code or provides information on how to test it.
6. A documentation agent that provides documentation and a help message for the approved and tested code.

replied to sometimesanotion's post 4 months ago

OK nevermind I clicked that blog link and this is hella damn interesting how come I never heard of this haha. It states some really promising things right there... :o

replied to sometimesanotion's post 4 months ago

the model that calls itself "Qwenconceited-14B-v13-DeepSuffering". <-- That cracked me up, lol!

And yeah very interesting but I'm going to have to read that again at another moment to fully understand all it is saying haha. Sounds like interesting stuff though!

replied to sometimesanotion's post 4 months ago

Oh yeah for sure I'll hit you up sometime! Just to be clear I wasn't asking you to upload all your personal tweaks that you've spent probably weeks on to improve them haha. I was just curious about some of the things you said. For example, when you said " Extract a small LoRA from this" I was a little bit confused actually haha. As in: I have no idea how to do that, let alone apply it to smoothen out other models in the merge.

I know about adapter models and that you can create those with LoRA fine-tuning which you can either load on top during inference or you can merge with the base model, but extracting a LoRA from an existing model is kinda confusing me haha (sorry!). It sounds interesting though! Do I understand correctly that this would enable you to kind of "operate" on the model more precisely and with a lot less compute required (aka: more merges you can make and test in a given time window)?

replied to sometimesanotion's post 4 months ago

Would you mind doing a writeup about your customized mergekit workflow, or do you prefer to keep some of the secret sauce to yourself? ;)

replied to their post 4 months ago

Or I guess the ReadMe as nobody can read that lol: https://huggingface.co/spaces/CultriX/MultiAgent-CodeTask/blob/main/README.md

posted an update 4 months ago

Post

1769

# Multi-Agent Collaboration for Coding Tasks - Updated Space!

This version does not rely on AutoGen.
The user simply enters his OPENAI_API_KEY and a task and the Space goes to work, employing a
- 1. prompt-enhancer agent,
- 2. an orchestrator agent,
- 3. a coder agent,
- 4. a code-reviewing agent and
-5. a code documentation generator agent.

See below image for an example workflow:

CultriX/MultiAgent-CodeTask

1 reply

·

replied to their post 4 months ago

I made an improved version of this that does not rely on autogen and produces some really nice code! :)

https://huggingface.co/spaces/CultriX/MultiAgent-CodeTask

DEMO

========== USER PROMPT ==========
Write a random password generator in python.

========== Multi-Agent Conversation ==========
[Prompt Optimizer]: Optimized Task:

Instructions for Password Generator in Python

Objective: Create a Python script that generates secure, random passwords based on user-defined criteria.

Requirements:

The script should allow users to customize the password:
- Length: Specify the desired length of the password (e.g., 12-16 characters).
- Character Types: Allow users to include or exclude uppercase letters, lowercase letters, numbers, and special characters.
The password should be randomly generated and accessible for immediate use.

Output Specifications:

The generated password should be printed to the console.
The format should be simple: just the password string without any additional text.

Style:

Write clear, maintainable Python code with appropriate comments explaining the functionality of key sections.
Use descriptive variable names and adhere to PEP 8 formatting standards.

Example of Desired Output:
If a user requests a 12-character password with all character types included, the script might output something like:

D4v!7l@rTzq%

Guidance for Implementation:

Start by importing necessary libraries (e.g., random and string).
Define a function that takes parameters for password length and character type preferences.
Use a combination of built-in functions and methods to dynamically create the password based on the specified criteria.

Here's a brief code outline to guide your implementation:

import random
import string

def generate_password(length=12, use_uppercase=True, use_lowercase=True, use_numbers=True, use_special_chars=True):
    # Create a pool of characters based on user preferences
    charset = ''
    if use_uppercase:
        charset += string.ascii_uppercase
    if use_lowercase:
        charset += string.ascii_lowercase
    if use_numbers:
        charset += string.digits
    if use_special_chars:
        charset += string.punctuation

    # Ensure the charset is not empty
    if not charset:
        raise ValueError("At least one character type must be selected.")

    # Generate a random password
    password = ''.join(random.choice(charset) for _ in range(length))
    print(password)

Feel free to modify the above snippet based on additional features or functionality you may want to incorporate. Ensure that the code is fully functional, handles user input safely, and returns a strong password.

Here's a step-by-step plan for completing the Password Generator task while coordinating tasks among a Coder Agent, a Code Reviewer Agent, and a Documentation Agent:

Step 1: Task Assignment

Coder Agent: Responsible for writing the initial implementation of the password generator based on the provided requirements.
Code Reviewer Agent: Responsible for reviewing the code for correctness, adherence to coding standards, and overall quality.
Documentation Agent: Responsible for creating user documentation and comments within the code.

Step 2: Coder Agent - Implementation

Step 2.1: Import necessary libraries (random and string).
Step 2.2: Define the function generate_password() with parameters for password length and character type preferences.
Step 2.3: Create a character set based on user preferences:
- Initialize an empty string for charset.
- Use if statements to append to charset depending on whether the user has selected uppercase letters, lowercase letters, numbers, and special characters.
Step 2.4: Check if charset is empty and raise a ValueError if it is.
Step 2.5: Generate the password using a list comprehension that iterates for the specified length, selecting random characters from charset.
Step 2.6: Print the generated password.

Step 3: Coder Agent - Testing

Step 3.1: Write test cases to validate that the password generation meets the specified criteria (length, character types).
Step 3.2: Ensure that the function behaves as expected with various combinations of parameters.

Step 4: Code Reviewer Agent - Code Review

Step 4.1: Review the code for:
- Correct functionality (does it generate a valid password?).
- Adherence to PEP 8 formatting standards (indentation, line length, etc.).
- Appropriate comments for understanding the code's functionality.
Step 4.2: Provide feedback to the Coder Agent regarding any improvements or necessary changes.

Step 5: Coder Agent - Revise Code

Step 5.1: Implement any changes or improvements suggested during the code review.
Step 5.2: Re-test the code to ensure that revisions do not introduce any errors.

Step 6: Documentation Agent - Document the Code

Step 6.1: Write clear and concise comments within the code, explaining the key sections and logic.
Step 6.2: Create a README file or a user guide that explains how to use the password generator script, including:
- Functionality overview.
- Instructions for specifying parameters.
- Examples of usage and output.

Step 7: Final Review and Integration

Step 7.1: Perform a final review of the combined code and documentation to ensure that everything is coherent and aligned with the original task requirements.
Step 7.2: Package the final version of the code and documentation for delivery or deployment.

Step 8: Completion

Step 8.1: Confirm all tasks are complete and that the password generator meets the specified objectives.
Step 8.2: Consider potential further enhancements or features for future iterations based on user feedback.

By following this structured plan and assigning specific roles and responsibilities, the collaborative effort from the Coder Agent, Code Reviewer Agent, and Documentation Agent will lead to a well-crafted, maintainable, and user-friendly password generator in Python.

import random
import string

def generate_password(length=12, use_uppercase=True, use_lowercase=True, use_numbers=True, use_special_characters=True):
    """Generate a random password based on user preferences."""
    
    charset = ""
    
    if use_uppercase:
        charset += string.ascii_uppercase
    if use_lowercase:
        charset += string.ascii_lowercase
    if use_numbers:
        charset += string.digits
    if use_special_characters:
        charset += string.punctuation
    
    if not charset:
        raise ValueError("At least one character type must be selected for password generation.")
    
    password = ''.join(random.choice(charset) for _ in range(length))
    print(password)

# Testing the password generator
if __name__ == "__main__":
    print("Generated Passwords:")
    generate_password(length=16, use_uppercase=True, use_lowercase=True, use_numbers=True, use_special_characters=True)
    generate_password(length=12, use_uppercase=False, use_lowercase=True, use_numbers=True, use_special_characters=False)
    generate_password(length=20, use_uppercase=True, use_lowercase=False, use_numbers=False, use_special_characters=True)

[Code Reviewer]: Review (Iteration 0):
The provided code meets the task specifications and correctly implements the password generator based on user-defined criteria. Here is the feedback based on the requirements and code quality:

Functionality: The function generate_password accepts the correct parameters for customizing password length and character types, allowing appropriate flexibility for the user.
Character Sets: The character sets for uppercase letters, lowercase letters, numbers, and special characters are correctly created based on user preferences.
Error Handling: The code properly raises a ValueError if no character types are selected, which is essential for ensuring that a password can be generated.
Randomness: The generation of the password using random.choice correctly ensures randomness in the selection of characters.
Output: The password is printed directly to the console as specified, without any additional text or formatting.
PEP 8 Compliance: The variable names are descriptive, and the overall formatting adheres to PEP 8 standards. The use of double quotes for strings and the overall indentation are consistent and clear.
Comments: The code includes a comment block describing the function, which is beneficial for maintainability.
Main Guard: The if __name__ == "__main__": construct is used correctly, allowing the code to be tested when run as a script.
Testing: Sample calls to the password generator function provide a way to check its functionality during development. This is good for demonstration but may not be appropriate in a production setting if this code were to be used as a library.

Overall, the code is well-structured and addresses the task requirements effectively. There are no significant issues to note.

APPROVE

Documentation for `generate_password`

Function: generate_password

Purpose: Generates a random password based on user-specified criteria.

Parameters:

length (int): Desired length of the password (default is 12).
use_uppercase (bool): Include uppercase letters (default is True).
use_lowercase (bool): Include lowercase letters (default is True).
use_numbers (bool): Include numeric digits (default is True).
use_special_characters (bool): Include special characters (default is True).

Returns:

A randomly generated password printed to the console.

Exceptions:

Raises ValueError if no character type is selected.

Notes:

The function constructs a character set based on the selected options and generates a password by randomly selecting characters from this set.

Help Message

Usage: generate_password(length=12, use_uppercase=True, use_lowercase=True, use_numbers=True, use_special_characters=True)

Generate a random password based on preferences:
- length: Length of the password (default: 12)
- use_uppercase: Include uppercase letters? (default: True)
- use_lowercase: Include lowercase letters? (default: True)
- use_numbers: Include numbers? (default: True)
- use_special_characters: Include special characters? (default: True)

Example: generate_password(16, use_uppercase=True)

replied to sometimesanotion's post 4 months ago

Congrats! Very nicely done :)!

CultriX PRO

AI & ML interests

Recent Activity

Organizations

CultriX's activity

DEMO

Instructions for Password Generator in Python

Guidance for Implementation:

Step 1: Task Assignment

Step 2: Coder Agent - Implementation

Step 3: Coder Agent - Testing

Step 4: Code Reviewer Agent - Code Review

Step 5: Coder Agent - Revise Code

Step 6: Documentation Agent - Document the Code

Step 7: Final Review and Integration

Step 8: Completion

Documentation for `generate_password`

Parameters:

Returns:

Exceptions:

Notes:

Help Message

CultriX PRO

AI & ML interests

Recent Activity

Organizations

CultriX's activity

DEMO

Instructions for Password Generator in Python

Guidance for Implementation:

Step 1: Task Assignment

Step 2: Coder Agent - Implementation

Step 3: Coder Agent - Testing

Step 4: Code Reviewer Agent - Code Review

Step 5: Coder Agent - Revise Code

Step 6: Documentation Agent - Document the Code

Step 7: Final Review and Integration

Step 8: Completion

Documentation for generate_password

Parameters:

Returns:

Exceptions:

Notes:

Help Message

Documentation for `generate_password`