Upload folder using huggingface_hub

Browse files

Files changed (11) hide show

.gitattributes +3 -33
README.md +144 -0
adapter_config.json +39 -0
adapter_model.safetensors +3 -0
added_tokens.json +24 -0
finetuning.py +191 -0
merges.txt +0 -0
special_tokens_map.json +31 -0
tokenizer.json +3 -0
tokenizer_config.json +208 -0
vocab.json +0 -0

.gitattributes CHANGED Viewed

@@ -1,35 +1,5 @@
-*.7z filter=lfs diff=lfs merge=lfs -text
-*.arrow filter=lfs diff=lfs merge=lfs -text
 *.bin filter=lfs diff=lfs merge=lfs -text
-*.bz2 filter=lfs diff=lfs merge=lfs -text
-*.ckpt filter=lfs diff=lfs merge=lfs -text
-*.ftz filter=lfs diff=lfs merge=lfs -text
-*.gz filter=lfs diff=lfs merge=lfs -text
-*.h5 filter=lfs diff=lfs merge=lfs -text
-*.joblib filter=lfs diff=lfs merge=lfs -text
-*.lfs.* filter=lfs diff=lfs merge=lfs -text
-*.mlmodel filter=lfs diff=lfs merge=lfs -text
-*.model filter=lfs diff=lfs merge=lfs -text
-*.msgpack filter=lfs diff=lfs merge=lfs -text
-*.npy filter=lfs diff=lfs merge=lfs -text
-*.npz filter=lfs diff=lfs merge=lfs -text
-*.onnx filter=lfs diff=lfs merge=lfs -text
-*.ot filter=lfs diff=lfs merge=lfs -text
-*.parquet filter=lfs diff=lfs merge=lfs -text
-*.pb filter=lfs diff=lfs merge=lfs -text
-*.pickle filter=lfs diff=lfs merge=lfs -text
-*.pkl filter=lfs diff=lfs merge=lfs -text
 *.pt filter=lfs diff=lfs merge=lfs -text
-*.pth filter=lfs diff=lfs merge=lfs -text
-*.rar filter=lfs diff=lfs merge=lfs -text
-*.safetensors filter=lfs diff=lfs merge=lfs -text
-saved_model/**/* filter=lfs diff=lfs merge=lfs -text
-*.tar.* filter=lfs diff=lfs merge=lfs -text
-*.tar filter=lfs diff=lfs merge=lfs -text
-*.tflite filter=lfs diff=lfs merge=lfs -text
-*.tgz filter=lfs diff=lfs merge=lfs -text
-*.wasm filter=lfs diff=lfs merge=lfs -text
-*.xz filter=lfs diff=lfs merge=lfs -text
-*.zip filter=lfs diff=lfs merge=lfs -text
-*.zst filter=lfs diff=lfs merge=lfs -text
-*tfevents* filter=lfs diff=lfs merge=lfs -text

+adapter_model.safetensors filter=lfs diff=lfs merge=lfs -text
+*.safetensors filter=lfs diff=lfs merge=lfs -text
 *.bin filter=lfs diff=lfs merge=lfs -text
 *.pt filter=lfs diff=lfs merge=lfs -text
+tokenizer.json filter=lfs diff=lfs merge=lfs -text

README.md ADDED Viewed

	@@ -0,0 +1,144 @@

+# JurisQwen: Legal Domain Fine-tuned Qwen2.5-7B Model
+## Overview
+JurisQwen is a specialized legal domain language model based on Qwen2.5-7B, fine-tuned on Indian legal datasets. This model is designed to assist with legal queries, document analysis, and providing information about Indian law.
+## Model Details
+### Model Description
+- **Developed by:** [Your Name/Organization]
+- **Base Model:** Qwen2.5-7B by Qwen
+- **Model Type:** Language Model with LoRA fine-tuning
+- **Language:** English with focus on Indian legal terminology
+- **License:** [Specify License - inherited from Qwen2.5 or your custom license]
+- **Finetuned from model:** Qwen/Qwen2.5-7B
+- **Framework:** PEFT 0.15.1 with Unsloth optimization
+### Training Dataset
+The model was fine-tuned on the "viber1/indian-law-dataset" which contains instruction-response pairs focused on Indian legal knowledge and terminology.
+## Technical Specifications
+### Model Architecture
+- Base model: Qwen2.5-7B
+- Fine-tuning method: LoRA (Low-Rank Adaptation)
+- LoRA configuration:
+  - Rank (r): 32
+  - Alpha: 64
+  - Dropout: 0.05
+  - Target modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
+### Training Procedure
+- **Training Infrastructure:** NVIDIA A100-40GB GPU
+- **Quantization:** 4-bit quantization using bitsandbytes
+- **Mixed Precision:** bfloat16
+- **Attention Implementation:** Flash Attention 2
+- **Training Hyperparameters:**
+  - Epochs: 3
+  - Batch size: 16
+  - Gradient accumulation steps: 2
+  - Learning rate: 2e-4
+  - Weight decay: 0.001
+  - Scheduler: Cosine with 10% warmup
+  - Optimizer: AdamW 8-bit
+  - Maximum sequence length: 4096
+  - TF32 enabled for A100
+### Deployment Infrastructure
+- Deployed using Modal cloud platform
+- GPU: NVIDIA A100-40GB
+- Persistent volume storage for model checkpoints
+## Usage
+### Setting Up the Environment
+This model is deployed using Modal. To use it, you'll need to:
+1. Install Modal:
+```bash
+pip install modal
+```
+2. Authenticate with Modal:
+```bash
+modal token new
+```
+3. Deploy the application:
+```bash
+python app.py
+```
+### Running Fine-tuning
+To run the fine-tuning process:
+```python
+from app import app, finetune_qwen
+# Deploy the app
+app.deploy()
+# Run fine-tuning
+result = finetune_qwen.remote()
+print(f"Fine-tuning result: {result}")
+```
+### Inference
+To run inference with the fine-tuned model:
+```python
+from app import app, test_inference
+# Example legal query
+response = test_inference.remote("What are the key provisions of the Indian Contract Act?")
+print(response)
+```
+## Input Format
+The model uses the following format for prompts:
+```
+<|im_start|>user
+[Your legal question here]
+<|im_end|>
+```
+The model will respond with:
+```
+<|im_start|>assistant
+[Legal response]
+<|im_end|>
+```
+## Limitations and Biases
+- The model is specifically trained on Indian legal data and may not generalize well to other legal systems
+- Legal advice provided by the model should not be considered as professional legal counsel
+- The model may exhibit biases present in the training data
+- Performance on complex or novel legal scenarios not present in the training data may be limited
+## Recommendations
+- Users should validate important legal information with qualified legal professionals
+- Always cross-reference model outputs with authoritative legal sources
+- Be aware that legal interpretations may vary and the model provides one possible interpretation
+## Environmental Impact
+- Hardware: NVIDIA A100-40GB GPU
+- Training time: Approximately 3-5 hours
+- Cloud Provider: Modal
+## Citation
+If you use this model in your research, please cite:
+```
+@software{JurisQwen,
+  author = {[Prathamesh Devadiga]},
+  title = {JurisQwen: Indian Legal Domain Fine-tuned Qwen2.5-7B Model},
+  year = {2025},
+  url = {[https://github.com/devadigapratham/JurisQwen]}
+}
+```
+## Acknowledgments
+- Qwen team for the original Qwen2.5-7B model
+- Unsloth for optimization tools
+- Modal for deployment infrastructure
+- Creator of the "viber1/indian-law-dataset"

adapter_config.json ADDED Viewed

	@@ -0,0 +1,39 @@

+{
+  "alpha_pattern": {},
+  "auto_mapping": null,
+  "base_model_name_or_path": "unsloth/qwen2.5-7b-unsloth-bnb-4bit",
+  "bias": "none",
+  "corda_config": null,
+  "eva_config": null,
+  "exclude_modules": null,
+  "fan_in_fan_out": false,
+  "inference_mode": true,
+  "init_lora_weights": true,
+  "layer_replication": null,
+  "layers_pattern": null,
+  "layers_to_transform": null,
+  "loftq_config": {},
+  "lora_alpha": 64,
+  "lora_bias": false,
+  "lora_dropout": 0.05,
+  "megatron_config": null,
+  "megatron_core": "megatron.core",
+  "modules_to_save": null,
+  "peft_type": "LORA",
+  "r": 32,
+  "rank_pattern": {},
+  "revision": null,
+  "target_modules": [
+    "q_proj",
+    "o_proj",
+    "v_proj",
+    "gate_proj",
+    "k_proj",
+    "up_proj",
+    "down_proj"
+  ],
+  "task_type": "CAUSAL_LM",
+  "trainable_token_indices": null,
+  "use_dora": false,
+  "use_rslora": false
+}

adapter_model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:a6b6b40b7f5a2311bb1a00e2d2665129ed926d4b87fd04578ec504452d5d5b84
+size 134

added_tokens.json ADDED Viewed

	@@ -0,0 +1,24 @@

+{
+  "</tool_call>": 151658,
+  "<tool_call>": 151657,
+  "<|box_end|>": 151649,
+  "<|box_start|>": 151648,
+  "<|endoftext|>": 151643,
+  "<|file_sep|>": 151664,
+  "<|fim_middle|>": 151660,
+  "<|fim_pad|>": 151662,
+  "<|fim_prefix|>": 151659,
+  "<|fim_suffix|>": 151661,
+  "<|im_end|>": 151645,
+  "<|im_start|>": 151644,
+  "<|image_pad|>": 151655,
+  "<|object_ref_end|>": 151647,
+  "<|object_ref_start|>": 151646,
+  "<|quad_end|>": 151651,
+  "<|quad_start|>": 151650,
+  "<|repo_name|>": 151663,
+  "<|video_pad|>": 151656,
+  "<|vision_end|>": 151653,
+  "<|vision_pad|>": 151654,
+  "<|vision_start|>": 151652
+}

finetuning.py ADDED Viewed

	@@ -0,0 +1,191 @@

+import modal
+import os
+from pathlib import Path
+# Define Modal app
+app = modal.App("qwen-law-finetuning")
+# Create a custom image with all dependencies
+# Breaking down pip installs to make the build more reliable
+# Use Modal's CUDA image which has the CUDA environment pre-configured
+image = (
+    modal.Image.from_registry(
+        "nvidia/cuda:12.1.0-devel-ubuntu22.04",
+        add_python="3.10"
+    )
+    .apt_install(["git", "build-essential", "ninja-build"])
+    .pip_install("unsloth", "datasets")  # Already correct
+    .pip_install("torch>=2.0.1", "transformers>=4.33.0")  # Fixed
+    .pip_install("peft>=0.5.0", "trl>=0.7.1", "tensorboard")  # Fixed
+    .pip_install("bitsandbytes>=0.41.1", "accelerate>=0.23.0")  # Fixed
+    .pip_install("xformers>=0.0.21", "einops", "sentencepiece", "protobuf")  # Fixed
+    .pip_install("flash-attn>=2.3.0")  # Already correct (single package)
+    .add_local_dir(".", remote_path="/root/code")
+)
+# Add local directory to the image - using add_local_dir as recommended
+image = image.add_local_dir(".", remote_path="/root/code")
+# Define volume to persist model checkpoints
+volume = modal.Volume.from_name("finetune-volume", create_if_missing=True)
+VOLUME_PATH = "/data"
+@app.function(
+    image=image,
+    gpu="A100-40GB",
+    timeout=60 * 60 * 5,  # 5 hour timeout
+    volumes={VOLUME_PATH: volume},
+)
+def finetune_qwen():
+    import torch
+    from datasets import load_dataset
+    from unsloth import FastLanguageModel
+    from transformers import TrainingArguments
+    from trl import SFTTrainer
+    import os
+    # Set working directory
+    os.chdir("/root/code")
+    # Create output directory in the volume
+    output_dir = os.path.join(VOLUME_PATH, "JurisQwen")
+    os.makedirs(output_dir, exist_ok=True)
+    print("Loading dataset...")
+    # Load the dataset
+    ds = load_dataset("viber1/indian-law-dataset")
+    # Format the dataset for instruction fine-tuning
+    def format_instruction(example):
+        return {
+            "text": f"<|im_start|>user\n{example['Instruction']}<|im_end|>\n<|im_start|>assistant\n{example['Response']}<|im_end|>"
+        }
+    # Apply formatting
+    formatted_ds = ds.map(format_instruction)
+    train_dataset = formatted_ds["train"]
+    # A100-optimized parameters
+    max_seq_length = 4096  # Increased for A100's larger memory
+    model_id = "Qwen/Qwen2.5-7B"
+    print("Loading model...")
+    # Initialize model with Unsloth, optimized for A100
+    model, tokenizer = FastLanguageModel.from_pretrained(
+        model_id,
+        max_seq_length=max_seq_length,
+        load_in_4bit=True,  # Quantized training for memory efficiency
+        attn_implementation="flash_attention_2",  # Flash Attention 2 for A100
+        dtype=torch.bfloat16,  # Explicitly use bfloat16 for A100
+    )
+    # Prepare model for training with optimized parameters for A100
+    model = FastLanguageModel.get_peft_model(
+        model,
+        r=32,  # Increased LoRA rank for A100
+        target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
+                       "gate_proj", "up_proj", "down_proj"],
+        lora_alpha=64,  # Increased alpha for better training
+        lora_dropout=0.05,
+        bias="none",
+        use_gradient_checkpointing="unsloth",  # Enables efficient training on long sequences
+    )
+    # Set training arguments optimized for A100
+    training_args = TrainingArguments(
+        output_dir=os.path.join(VOLUME_PATH, "checkpoints"),
+        num_train_epochs=3,
+        per_device_train_batch_size=16,  # Increased for A100
+        gradient_accumulation_steps=2,  # Reduced due to larger batch size
+        optim="adamw_8bit",  # 8-bit Adam optimizer for efficiency
+        learning_rate=2e-4,
+        weight_decay=0.001,
+        lr_scheduler_type="cosine",
+        warmup_ratio=0.1,
+        bf16=True,  # Enable bf16 (A100 supports it)
+        fp16=False,  # Disable fp16 when using bf16
+        logging_steps=10,
+        save_strategy="epoch",
+        report_to="tensorboard",
+        tf32=True,  # Enable TF32 for A100
+    )
+    print("Preparing trainer...")
+    # Using SFTTrainer for better performance
+    trainer = SFTTrainer(
+        model=model,
+        tokenizer=tokenizer,
+        train_dataset=train_dataset,
+        dataset_text_field="text",
+        max_seq_length=max_seq_length,
+        args=training_args,
+        packing=True,  # Enable packing for faster training
+    )
+    # Train the model
+    print("Starting training...")
+    trainer.train()
+    print("Training completed!")
+    # Save the fine-tuned model
+    print(f"Saving model to {output_dir}")
+    model.save_pretrained(output_dir)
+    tokenizer.save_pretrained(output_dir)
+    # Test inference with the fine-tuned model
+    print("Testing inference...")
+    FastLanguageModel.for_inference(model)  # Enable faster inference
+    test_prompt = "<|im_start|>user\nWhat are the key provisions of the Indian Contract Act?<|im_end|>"
+    inputs = tokenizer([test_prompt], return_tensors="pt").to("cuda")
+    outputs = model.generate(**inputs, max_new_tokens=512)
+    print("Generated response:")
+    print(tokenizer.decode(outputs[0]))
+    return f"Model successfully trained and saved to {output_dir}"
+@app.function(
+    image=image,
+    gpu="A100-40GB",
+    timeout=60 * 10,  # 10 minute timeout
+    volumes={VOLUME_PATH: volume},
+)
+def test_inference(prompt: str):
+    from unsloth import FastLanguageModel
+    import torch
+    import os
+    # Load the fine-tuned model
+    model_path = os.path.join(VOLUME_PATH, "JurisQwen")
+    print(f"Loading model from {model_path}")
+    model, tokenizer = FastLanguageModel.from_pretrained(
+        model_path,
+        max_seq_length=4096,
+        attn_implementation="flash_attention_2",
+        dtype=torch.bfloat16,
+    )
+    # Enable fast inference
+    FastLanguageModel.for_inference(model)
+    # Format the prompt
+    formatted_prompt = f"<|im_start|>user\n{prompt}<|im_end|>"
+    inputs = tokenizer([formatted_prompt], return_tensors="pt").to("cuda")
+    # Generate response
+    outputs = model.generate(**inputs, max_new_tokens=512)
+    response = tokenizer.decode(outputs[0])
+    return response
+# For debugging: This will show logs during the image build process
+@app.local_entrypoint()
+def main():
+    print("Starting fine-tuning process...")
+    app.deploy()
+    result = finetune_qwen.remote()
+    print(f"Fine-tuning result: {result}")
+if __name__ == "__main__":
+    main()

merges.txt ADDED Viewed

The diff for this file is too large to render. See raw diff

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,31 @@

+{
+  "additional_special_tokens": [
+    "<|im_start|>",
+    "<|im_end|>",
+    "<|object_ref_start|>",
+    "<|object_ref_end|>",
+    "<|box_start|>",
+    "<|box_end|>",
+    "<|quad_start|>",
+    "<|quad_end|>",
+    "<|vision_start|>",
+    "<|vision_end|>",
+    "<|vision_pad|>",
+    "<|image_pad|>",
+    "<|video_pad|>"
+  ],
+  "eos_token": {
+    "content": "<|endoftext|>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "pad_token": {
+    "content": "<|vision_pad|>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  }
+}

tokenizer.json ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:9c5ae00e602b8860cbd784ba82a8aa14e8feecec692e7076590d014d7b7fdafa
+size 11421896

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,208 @@

+{
+  "add_bos_token": false,
+  "add_prefix_space": false,
+  "added_tokens_decoder": {
+    "151643": {
+      "content": "<|endoftext|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151644": {
+      "content": "<|im_start|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151645": {
+      "content": "<|im_end|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151646": {
+      "content": "<|object_ref_start|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151647": {
+      "content": "<|object_ref_end|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151648": {
+      "content": "<|box_start|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151649": {
+      "content": "<|box_end|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151650": {
+      "content": "<|quad_start|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151651": {
+      "content": "<|quad_end|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151652": {
+      "content": "<|vision_start|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151653": {
+      "content": "<|vision_end|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151654": {
+      "content": "<|vision_pad|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151655": {
+      "content": "<|image_pad|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151656": {
+      "content": "<|video_pad|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151657": {
+      "content": "<tool_call>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151658": {
+      "content": "</tool_call>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151659": {
+      "content": "<|fim_prefix|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151660": {
+      "content": "<|fim_middle|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151661": {
+      "content": "<|fim_suffix|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151662": {
+      "content": "<|fim_pad|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151663": {
+      "content": "<|repo_name|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151664": {
+      "content": "<|file_sep|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    }
+  },
+  "additional_special_tokens": [
+    "<|im_start|>",
+    "<|im_end|>",
+    "<|object_ref_start|>",
+    "<|object_ref_end|>",
+    "<|box_start|>",
+    "<|box_end|>",
+    "<|quad_start|>",
+    "<|quad_end|>",
+    "<|vision_start|>",
+    "<|vision_end|>",
+    "<|vision_pad|>",
+    "<|image_pad|>",
+    "<|video_pad|>"
+  ],
+  "bos_token": null,
+  "clean_up_tokenization_spaces": false,
+  "eos_token": "<|endoftext|>",
+  "errors": "replace",
+  "extra_special_tokens": {},
+  "model_max_length": 131072,
+  "pad_token": "<|vision_pad|>",
+  "padding_side": "right",
+  "split_special_tokens": false,
+  "tokenizer_class": "Qwen2Tokenizer",
+  "unk_token": null
+}

vocab.json ADDED Viewed

The diff for this file is too large to render. See raw diff