| Text-to-Code Generator using CodeGen-350M-Multi | |
| ============================================= | |
| This project provides a text-to-code generator using a fine-tuned Salesforce/codegen-350M-multi | |
| model, designed to run on low-end laptops (8GB RAM, CPU-only) for students to experiment with AI | |
| model development. The model is fine-tuned on a custom dataset and includes a Flask web interface | |
| for easy interaction. All resources are open-source under the Apache-2.0 license, with attribution | |
| to the original model by Salesforce. | |
| Do's and Setup Process | |
| --------------------- | |
| 1. **System Requirements**: | |
| - Laptop with at least 8GB RAM and 2GB free disk space. | |
| - Windows, macOS, or Linux (CPU-only, no GPU required). | |
| - Internet connection for initial model download. | |
| 2. **Install Python**: | |
| - Use Python 3.10.9. Download from https://www.python.org/downloads/release/python-3109/. | |
| - Verify installation: `python --version`. | |
| 3. **Clone or Download Repository**: | |
| - Download the project files from the Hugging Face repository: | |
| https://huggingface.co/remiai3/text-to-code-using-codegen-project. | |
| - Extract files to a folder (e.g., `text-to-code-codegen`). | |
| 4. **Set Up Virtual Environment**: | |
| - Open a terminal in the project folder. | |
| - Create a virtual environment: `python -m venv venv`. | |
| - Activate it: | |
| - Windows: `venv\Scripts\activate` | |
| - macOS/Linux: `source venv/bin/activate` | |
| 5. **Install Dependencies**: | |
| - Run: `pip install -r requirements.txt`. | |
| - Required libraries: torch, transformers, datasets, accelerate, protobuf, matplotlib, flask. | |
| NOTE: if the matplotlib version is not compatible remove the version 3.7.2 and also if any | |
| other library is also not compitable with the python version or local device because of | |
| previous libraries installed then remove all the versions from the libraries and install the | |
| libraries with the names only then a default version will installed of that particualr library | |
| 6. **Prepare Custom Dataset**: | |
| - Ensure the `custom_dataset.jsonl` file exists in the project folder. | |
| - Format: Each line is a JSON object with `prompt` (natural language) and `code` (Python code). | |
| - Example: | |
| {"prompt": "Write a Python program to print 'Hello, World!'", "code": "print('Hello, World!')"} | |
| {"prompt": "Write a Python function to add two numbers.", "code": "def add_numbers(a, b):\n return a + b"} | |
| 7. **Run the Model**: | |
| - Option 1: Run the full pipeline (download, fine-tune, test): | |
| - Update `run_all.py` with your Hugging Face token (`HF_TOKEN`). | |
| - Run: `python run_all.py`. | |
| - This downloads the model, fine-tunes it, tests it, and generates a loss plot. | |
| - Option 2: Test the fine-tuned model directly: | |
| - Run: `python test_codegen.py` to test with sample prompts. | |
| - Option 3: Use the web interface: | |
| - Run: `python app.py`. | |
| - Open a browser and go to `http://127.0.0.1:5000`. | |
| 8. **Using the AI Model**: | |
| - **Command Line Testing**: Use `test_codegen.py` to input prompts and generate Python code. | |
| - **Web Interface**: Use the Flask app (`app.py`) to enter prompts via a browser and view generated code. | |
| - Example prompts: | |
| - "Write a Python function to calculate factorial of a number" | |
| - "Write a Python function to check if a number is prime" | |
| - Output is saved in `./finetuned_codegen/loss_plot.png` (loss plot) and `./finetuned_codegen` | |
| (model weights). | |
| 9. **Model Details**: | |
| - Model: Salesforce/codegen-350M-multi (Apache-2.0 license). | |
| - Source: https://huggingface.co/Salesforce/codegen-350M-multi. | |
| - Fine-tuned on a custom dataset for Python code generation. | |
| - Attribution: This project uses the Salesforce CodeGen model, fine-tuned by remiai3 for | |
| educational purposes. | |
| 10. **Troubleshooting**: | |
| - Ensure ~2GB disk space for model weights. | |
| - If memory issues occur, reduce dataset size or batch size in `run_all.py`. | |
| - Check terminal output for errors and ensure all files (`custom_dataset.jsonl`, | |
| `finetuned_codegen`) are in place. | |
| 11. **Contributing**: | |
| - Add more examples to `custom_dataset.jsonl` to improve model performance. | |
| - Share feedback or improvements via the Hugging Face repository: | |
| https://huggingface.co/remiai3. | |
| Attribution | |
| ----------- | |
| This project is built using the Salesforce/codegen-350M-multi model, licensed under Apache-2.0. | |
| The fine-tuned model and resources are provided by remiai3 for free educational use to help students | |
| learn and experiment with AI models. |