--- title: PDF Q&A Dataset Generator emoji: 📚 colorFrom: blue colorTo: indigo sdk: gradio sdk_version: 5.29.0 app_file: app.py pinned: false --- # PDF Q&A Dataset Generator A Gradio application that generates Q&A datasets from PDF documents using instruction-tuned language models. ## Features - **PDF Processing**: Automatically extract and chunk text from uploaded PDFs - **Q&A Generation**: Create questions, answers, tags, and difficulty levels - **Multiple Models**: Choose from various instruction-tuned models - **Customization**: Configure number of questions, tags, and difficulty settings - **Multiple Output Formats**: Export datasets as JSON, CSV, or Excel ## How It Works This application: 1. Extracts text from uploaded PDFs 2. Splits the content into manageable chunks to maintain context 3. Uses instruction-tuned language models to generate Q&A pairs with tags 4. Combines these into a comprehensive dataset ready for use ## Use Cases - Creating educational resources and assessment materials - Generating training data for Q&A systems - Building flashcard datasets for studying - Developing content for educational applications - Preparing comprehension testing materials ## Getting Started ### Local Installation ```bash git clone https://github.com/your-username/pdf-qa-generator.git cd pdf-qa-generator pip install -r requirements.txt python app.py ``` ### Using on Hugging Face Spaces 1. Duplicate this Space to your account 2. Upload your PDFs 3. Configure your settings 4. Generate your Q&A dataset ### Enabling GPU on Hugging Face Spaces To enable GPU acceleration on Hugging Face Spaces: 1. Uncomment the `# import spaces` line at the top of app.py 2. Uncomment the `# @spaces.GPU` decorator above the `process_pdf_generate_qa` function 3. Save and redeploy your Space with GPU hardware selected ## Models The app includes a selection of instruction-tuned language models: - `databricks/dolly-v2-3b` (default) - `databricks/dolly-v2-7b` - `EleutherAI/gpt-neo-1.3B` - `EleutherAI/gpt-neo-2.7B` - `tiiuae/falcon-7b-instruct` ## License MIT