kotlarmilos
/

repository-learning

+---
+# For reference on model card metadata, see the spec: https://github.com/huggingface/hub-docs/blob/main/modelcard.md?plain=1
+# Doc / guide: https://huggingface.co/docs/hub/model-cards
+pretty_name: "Repository learning Models"
+tags:
+  - code-review
+  - contrastive-learning
+  - sentence-transformers
+  - lora
+  - fine-tuned
+  - nextcoder
+  - faiss-index
+  - pytorch
+  - transformers
+license: mit
+language:
+  - en
+library_name: transformers
+pipeline_tag: text-generation
+base_model: microsoft/NextCoder-7B
+inference: true
+---
+# Model Card for Repository Learning Models
+This model card describes a multi-modal AI system for context-aware code review that combines contrastive learning, fine-tuning, and semantic indexing to understand repository-specific patterns and provide code review assistance.
+## Model Details
+### Model Description
+The Repository Learning Models consist of three specialized components that work together to provide context-aware code review assistance:
+1. **Contrastive Learning Model**: A fine-tuned SentenceTransformer that learns semantic relationships between code files based on Git change patterns
+2. **Fine-Tuned Review Model**: A LoRA-adapted NextCoder-7B model specialized for generating repository-specific code review comments
+3. **Semantic Index**: A FAISS-powered search system with LLM-generated function descriptions for rapid code navigation
+- **Developed by:** Milos Kotlar
+- **Model type:** Multi-modal (Text Generation + Embedding + Retrieval)
+- **Language(s) (NLP):** English
+- **License:** MIT
+- **Finetuned from model:** microsoft/NextCoder-7B (for review generation), sentence-transformers/all-MiniLM-L6-v2 (for embeddings)
+### Model Sources
+- **Repository:** https://github.com/kotlarmilos/repository-learning
+- **Demo:** https://huggingface.co/spaces/kotlarmilos/repository-learning, https://huggingface.co/spaces/kotlarmilos/repository-grounding
+- **Dataset:** https://huggingface.co/datasets/kotlarmilos/repository-learning
+## Uses
+### Direct Use
+The models are designed for:
+- **Automated Code Review**: Generate contextual review comments for pull requests
+- **Anomaly Detection**: Identify unusual file change patterns that may indicate architectural issues
+- **Code Search**: Find relevant functions and documentation using semantic similarity
+- **Team Onboarding**: Help new developers understand repository patterns and conventions
+### Downstream Use
+The models can be integrated into:
+- **CI/CD Pipelines**: GitHub Actions, Azure DevOps, Jenkins workflows
+- **IDE Extensions**: VS Code, IntelliJ plugins for real-time review assistance
+- **Code Review Tools**: Integration with GitHub, GitLab, Bitbucket review interfaces
+- **Documentation Systems**: Automatic code documentation and explanation generation
+### Out-of-Scope Use
+The models are **not intended for**:
+- **Security Vulnerability Detection**: While they may catch some issues, dedicated security tools should be used
+- **Performance Analysis**: Models don't analyze runtime performance or optimization
+- **Cross-Language Translation**: Optimized for reviewing within single programming languages
+- **Legal or Compliance Review**: Cannot assess licensing or regulatory compliance issues
+## Bias, Risks, and Limitations
+### Technical Limitations
+- **Repository Specificity**: Models are trained on specific open-source repositories and may not generalize to very different codebases or proprietary patterns
+- **Language Coverage**: Primary focus on 7 major programming languages (Python, JavaScript, TypeScript, Java, C++, C#, C)
+- **Context Window**: Fine-tuned model limited to 2048 tokens for inputs
+- **Temporal Bias**: Training data represents patterns from 2024-2025 timeframe
+### Social and Ethical Considerations
+- **Review Style Bias**: Models learn from existing human review patterns, potentially perpetuating team-specific biases or exclusionary language
+- **Open Source Bias**: Training primarily on open-source repositories may not reflect enterprise development patterns
+- **Developer Experience Bias**: May favor review styles from experienced developers, potentially alienating junior developers
+### Recommendations
+- **Human Oversight**: Use AI suggestions as guidance, not replacement for human code review
+- **Bias Monitoring**: Regularly evaluate generated reviews for inclusive language and fair treatment across developer experience levels
+- **Continuous Updating**: Retrain models periodically on recent repository activity to maintain relevance
+- **Domain Adaptation**: Fine-tune on organization-specific data when deploying in enterprise environments
+## How to Get Started with the Model
+```python
+# Install dependencies
+pip install transformers sentence-transformers faiss-cpu torch
+# Load the fine-tuned review model
+from transformers import AutoTokenizer, AutoModelForCausalLM
+import torch
+tokenizer = AutoTokenizer.from_pretrained("kotlarmilos/repository-learning-models")
+model = AutoModelForCausalLM.from_pretrained(
+    "kotlarmilos/repository-learning-models",
+    torch_dtype=torch.bfloat16,
+    device_map="auto"
+)
+# Generate a code review
+prompt = """Code diff:
+```diff
++def calculate_average(numbers):
++    return sum(numbers) / len(numbers)
+```
+Please write a code review comment:"""
+inputs = tokenizer(prompt, return_tensors="pt")
+outputs = model.generate(**inputs, max_new_tokens=150, temperature=0.7)
+review = tokenizer.decode(outputs[0], skip_special_tokens=True)
+print(review)
+```
+## Training Details
+### Training Data
+Training data consists of curated datasets from 15 high-quality open-source repositories:
+- **Source**: GitHub repositories with >100 stars and active development
+**Linked Dataset**: [kotlarmilos/repository-learning](https://huggingface.co/datasets/kotlarmilos/repository-learning)
+### Training Procedure
+#### Preprocessing
+1. **GitHub Data Collection**: GraphQL/REST API extraction of PRs, diffs, and review comments
+2. **Conversation Structuring**: Chronological ordering of review discussions with context
+3. **Code Analysis**: Tree-sitter AST parsing for function extraction across several programming languages
+4. **Quality Filtering**: Removal of non-constructive comments, bot interactions, and duplicate content
+#### Training Hyperparameters
+**Contrastive Learning Model**:
+- **Base Model**: sentence-transformers/all-MiniLM-L6-v2
+- **Batch Size**: 32
+- **Epochs**: 10
+- **Loss Function**: ContrastiveLoss
+- **Max Pairs**: 35,000 (positive/negative)
+**Fine-Tuned Review Model**:
+- **Base Model**: microsoft/NextCoder-7B
+- **Training Method**: LoRA (Low-Rank Adaptation)
+- **LoRA Rank**: 8, Alpha: 16, Dropout: 0.05
+- **Target Modules**: ["q_proj", "v_proj"]
+- **Quantization**: 4-bit NF4 with BitsAndBytes
+- **Learning Rate**: 1e-4 with cosine decay
+- **Batch Size**: 4 with 8 gradient accumulation steps
+- **Epochs**: 3
+- **Training regime**: bf16 mixed precision
+#### Speeds, Sizes, Times
+- **Training Time**: 2 hours on H100 GPU for complete pipeline
+## Technical Specifications
+### Model Architecture and Objective
+**Multi-Modal Architecture**:
+1. **Embedding Component**: SentenceTransformer with contrastive learning objective
+2. **Generation Component**: Transformer decoder with causal language modeling
+3. **Retrieval Component**: FAISS vector index with dense embeddings
+**Training Objectives**:
+- Contrastive learning: Maximize similarity of co-changed files, minimize similarity of unrelated files
+- Instruction following: Generate helpful review comments given code diff context
+- Semantic indexing: Create searchable representations of code functions