kotlarmilos commited on
Commit
9e60768
·
verified ·
1 Parent(s): 8e74d35

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +188 -0
README.md ADDED
@@ -0,0 +1,188 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ # For reference on model card metadata, see the spec: https://github.com/huggingface/hub-docs/blob/main/modelcard.md?plain=1
3
+ # Doc / guide: https://huggingface.co/docs/hub/model-cards
4
+ pretty_name: "Repository learning Models"
5
+ tags:
6
+ - code-review
7
+ - contrastive-learning
8
+ - sentence-transformers
9
+ - lora
10
+ - fine-tuned
11
+ - nextcoder
12
+ - faiss-index
13
+ - pytorch
14
+ - transformers
15
+ license: mit
16
+ language:
17
+ - en
18
+ library_name: transformers
19
+ pipeline_tag: text-generation
20
+ base_model: microsoft/NextCoder-7B
21
+ inference: true
22
+ ---
23
+
24
+ # Model Card for Repository Learning Models
25
+
26
+ This model card describes a multi-modal AI system for context-aware code review that combines contrastive learning, fine-tuning, and semantic indexing to understand repository-specific patterns and provide code review assistance.
27
+
28
+ ## Model Details
29
+
30
+ ### Model Description
31
+
32
+ The Repository Learning Models consist of three specialized components that work together to provide context-aware code review assistance:
33
+
34
+ 1. **Contrastive Learning Model**: A fine-tuned SentenceTransformer that learns semantic relationships between code files based on Git change patterns
35
+ 2. **Fine-Tuned Review Model**: A LoRA-adapted NextCoder-7B model specialized for generating repository-specific code review comments
36
+ 3. **Semantic Index**: A FAISS-powered search system with LLM-generated function descriptions for rapid code navigation
37
+
38
+ - **Developed by:** Milos Kotlar
39
+ - **Model type:** Multi-modal (Text Generation + Embedding + Retrieval)
40
+ - **Language(s) (NLP):** English
41
+ - **License:** MIT
42
+ - **Finetuned from model:** microsoft/NextCoder-7B (for review generation), sentence-transformers/all-MiniLM-L6-v2 (for embeddings)
43
+
44
+ ### Model Sources
45
+
46
+ - **Repository:** https://github.com/kotlarmilos/repository-learning
47
+ - **Demo:** https://huggingface.co/spaces/kotlarmilos/repository-learning, https://huggingface.co/spaces/kotlarmilos/repository-grounding
48
+ - **Dataset:** https://huggingface.co/datasets/kotlarmilos/repository-learning
49
+
50
+ ## Uses
51
+
52
+ ### Direct Use
53
+
54
+ The models are designed for:
55
+
56
+ - **Automated Code Review**: Generate contextual review comments for pull requests
57
+ - **Anomaly Detection**: Identify unusual file change patterns that may indicate architectural issues
58
+ - **Code Search**: Find relevant functions and documentation using semantic similarity
59
+ - **Team Onboarding**: Help new developers understand repository patterns and conventions
60
+
61
+ ### Downstream Use
62
+
63
+ The models can be integrated into:
64
+
65
+ - **CI/CD Pipelines**: GitHub Actions, Azure DevOps, Jenkins workflows
66
+ - **IDE Extensions**: VS Code, IntelliJ plugins for real-time review assistance
67
+ - **Code Review Tools**: Integration with GitHub, GitLab, Bitbucket review interfaces
68
+ - **Documentation Systems**: Automatic code documentation and explanation generation
69
+
70
+ ### Out-of-Scope Use
71
+
72
+ The models are **not intended for**:
73
+
74
+ - **Security Vulnerability Detection**: While they may catch some issues, dedicated security tools should be used
75
+ - **Performance Analysis**: Models don't analyze runtime performance or optimization
76
+ - **Cross-Language Translation**: Optimized for reviewing within single programming languages
77
+ - **Legal or Compliance Review**: Cannot assess licensing or regulatory compliance issues
78
+
79
+ ## Bias, Risks, and Limitations
80
+
81
+ ### Technical Limitations
82
+
83
+ - **Repository Specificity**: Models are trained on specific open-source repositories and may not generalize to very different codebases or proprietary patterns
84
+ - **Language Coverage**: Primary focus on 7 major programming languages (Python, JavaScript, TypeScript, Java, C++, C#, C)
85
+ - **Context Window**: Fine-tuned model limited to 2048 tokens for inputs
86
+ - **Temporal Bias**: Training data represents patterns from 2024-2025 timeframe
87
+
88
+ ### Social and Ethical Considerations
89
+
90
+ - **Review Style Bias**: Models learn from existing human review patterns, potentially perpetuating team-specific biases or exclusionary language
91
+ - **Open Source Bias**: Training primarily on open-source repositories may not reflect enterprise development patterns
92
+ - **Developer Experience Bias**: May favor review styles from experienced developers, potentially alienating junior developers
93
+
94
+ ### Recommendations
95
+
96
+ - **Human Oversight**: Use AI suggestions as guidance, not replacement for human code review
97
+ - **Bias Monitoring**: Regularly evaluate generated reviews for inclusive language and fair treatment across developer experience levels
98
+ - **Continuous Updating**: Retrain models periodically on recent repository activity to maintain relevance
99
+ - **Domain Adaptation**: Fine-tune on organization-specific data when deploying in enterprise environments
100
+
101
+ ## How to Get Started with the Model
102
+
103
+ ```python
104
+ # Install dependencies
105
+ pip install transformers sentence-transformers faiss-cpu torch
106
+
107
+ # Load the fine-tuned review model
108
+ from transformers import AutoTokenizer, AutoModelForCausalLM
109
+ import torch
110
+
111
+ tokenizer = AutoTokenizer.from_pretrained("kotlarmilos/repository-learning-models")
112
+ model = AutoModelForCausalLM.from_pretrained(
113
+ "kotlarmilos/repository-learning-models",
114
+ torch_dtype=torch.bfloat16,
115
+ device_map="auto"
116
+ )
117
+
118
+ # Generate a code review
119
+ prompt = """Code diff:
120
+ ```diff
121
+ +def calculate_average(numbers):
122
+ + return sum(numbers) / len(numbers)
123
+ ```
124
+
125
+ Please write a code review comment:"""
126
+
127
+ inputs = tokenizer(prompt, return_tensors="pt")
128
+ outputs = model.generate(**inputs, max_new_tokens=150, temperature=0.7)
129
+ review = tokenizer.decode(outputs[0], skip_special_tokens=True)
130
+ print(review)
131
+ ```
132
+
133
+ ## Training Details
134
+
135
+ ### Training Data
136
+
137
+ Training data consists of curated datasets from 15 high-quality open-source repositories:
138
+
139
+ - **Source**: GitHub repositories with >100 stars and active development
140
+
141
+ **Linked Dataset**: [kotlarmilos/repository-learning](https://huggingface.co/datasets/kotlarmilos/repository-learning)
142
+
143
+ ### Training Procedure
144
+
145
+ #### Preprocessing
146
+
147
+ 1. **GitHub Data Collection**: GraphQL/REST API extraction of PRs, diffs, and review comments
148
+ 2. **Conversation Structuring**: Chronological ordering of review discussions with context
149
+ 3. **Code Analysis**: Tree-sitter AST parsing for function extraction across several programming languages
150
+ 4. **Quality Filtering**: Removal of non-constructive comments, bot interactions, and duplicate content
151
+
152
+ #### Training Hyperparameters
153
+
154
+ **Contrastive Learning Model**:
155
+ - **Base Model**: sentence-transformers/all-MiniLM-L6-v2
156
+ - **Batch Size**: 32
157
+ - **Epochs**: 10
158
+ - **Loss Function**: ContrastiveLoss
159
+ - **Max Pairs**: 35,000 (positive/negative)
160
+
161
+ **Fine-Tuned Review Model**:
162
+ - **Base Model**: microsoft/NextCoder-7B
163
+ - **Training Method**: LoRA (Low-Rank Adaptation)
164
+ - **LoRA Rank**: 8, Alpha: 16, Dropout: 0.05
165
+ - **Target Modules**: ["q_proj", "v_proj"]
166
+ - **Quantization**: 4-bit NF4 with BitsAndBytes
167
+ - **Learning Rate**: 1e-4 with cosine decay
168
+ - **Batch Size**: 4 with 8 gradient accumulation steps
169
+ - **Epochs**: 3
170
+ - **Training regime**: bf16 mixed precision
171
+
172
+ #### Speeds, Sizes, Times
173
+
174
+ - **Training Time**: 2 hours on H100 GPU for complete pipeline
175
+
176
+ ## Technical Specifications
177
+
178
+ ### Model Architecture and Objective
179
+
180
+ **Multi-Modal Architecture**:
181
+ 1. **Embedding Component**: SentenceTransformer with contrastive learning objective
182
+ 2. **Generation Component**: Transformer decoder with causal language modeling
183
+ 3. **Retrieval Component**: FAISS vector index with dense embeddings
184
+
185
+ **Training Objectives**:
186
+ - Contrastive learning: Maximize similarity of co-changed files, minimize similarity of unrelated files
187
+ - Instruction following: Generate helpful review comments given code diff context
188
+ - Semantic indexing: Create searchable representations of code functions