Update README.md
Browse files
README.md
CHANGED
@@ -1,3 +1,201 @@
|
|
1 |
-
---
|
2 |
-
license: gemma
|
3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: gemma
|
3 |
+
datasets:
|
4 |
+
- georgeck/hacker-news-discussion-summarization-large
|
5 |
+
language:
|
6 |
+
- en
|
7 |
+
base_model:
|
8 |
+
- google/gemma-3-27b-it
|
9 |
+
pipeline_tag: text-generation
|
10 |
+
library_name: transformers
|
11 |
+
tags:
|
12 |
+
- summarization
|
13 |
+
- hacker-news
|
14 |
+
- hn-companion
|
15 |
+
---
|
16 |
+
# Model Card for Hacker-News-Comments-Summarization-gemma-3-27b-it
|
17 |
+
|
18 |
+
This model specializes in generating concise, informative summaries of Hacker News discussion threads.
|
19 |
+
It analyzes hierarchical comment structures to extract key themes, insights, and perspectives while prioritizing high-quality content based on community engagement.
|
20 |
+
|
21 |
+
## Model Details
|
22 |
+
|
23 |
+
### Model Description
|
24 |
+
|
25 |
+
The `Hacker-News-Comments-Summarization-gemma-3-27b-it` is a fine-tuned version of `google/gemma-3-27b-it`, optimized for summarizing structured discussions from Hacker News.
|
26 |
+
It processes hierarchical comment threads to identify main themes, significant viewpoints, and high-quality contributions, organizing them into a structured summary format that highlights community consensus and notable perspectives.
|
27 |
+
|
28 |
+
- **Developed by:** George Chiramattel & Ann Catherine Jose
|
29 |
+
- **Model type:** Fine-tuned Large Language Model (google/gemma-3-27b-it)
|
30 |
+
- **Language(s):** English
|
31 |
+
- **License:** gemma
|
32 |
+
- **Finetuned from model:** google/gemma-3-27b-it
|
33 |
+
|
34 |
+
### Model Sources
|
35 |
+
|
36 |
+
- **Repository:** https://huggingface.co/georgeck/Hacker-News-Comments-Summarization-gemma-3-27b-it
|
37 |
+
- **Dataset Repository:** https://huggingface.co/datasets/georgeck/hacker-news-discussion-summarization-large
|
38 |
+
|
39 |
+
## Uses
|
40 |
+
|
41 |
+
### Direct Use
|
42 |
+
|
43 |
+
This model is designed to generate structured summaries of Hacker News discussion threads. Given a thread with hierarchical comments, it produces a well-organized summary with:
|
44 |
+
|
45 |
+
1. An overview of the discussion
|
46 |
+
2. Main themes and key insights
|
47 |
+
3. Detailed theme breakdowns with notable quotes
|
48 |
+
4. Key perspectives including contrasting viewpoints
|
49 |
+
5. Notable side discussions
|
50 |
+
|
51 |
+
The model is particularly useful for:
|
52 |
+
- Helping users quickly understand the key points of lengthy discussion threads
|
53 |
+
- Identifying community consensus on technical topics
|
54 |
+
- Surfacing expert explanations and valuable insights
|
55 |
+
- Highlighting diverse perspectives on topics
|
56 |
+
|
57 |
+
### Downstream Use
|
58 |
+
|
59 |
+
This model was created for the [Hacker News Companion](https://github.com/levelup-apps/hn-enhancer) project.
|
60 |
+
|
61 |
+
|
62 |
+
## Bias, Risks, and Limitations
|
63 |
+
|
64 |
+
- **Community Bias:** The model may inherit biases present in the Hacker News community, which tends to skew toward certain demographics and perspectives in tech.
|
65 |
+
- **Content Prioritization:** The scoring system prioritizes comments with high engagement, which may not always correlate with factual accuracy or diverse representation.
|
66 |
+
- **Technical Limitations:** The model's performance may degrade with extremely long threads or discussions with unusual structures.
|
67 |
+
- **Limited Context:** The model focuses on the discussion itself and may lack broader context about the topics being discussed.
|
68 |
+
- **Attribution Challenges:** The model attempts to properly attribute quotes, but may occasionally misattribute or improperly format references.
|
69 |
+
- **Content Filtering:** While the model attempts to filter out low-quality or heavily downvoted content, it may not catch all problematic content.
|
70 |
+
|
71 |
+
### Recommendations
|
72 |
+
|
73 |
+
- Users should be aware that the summaries reflect community engagement patterns on Hacker News, which may include inherent biases.
|
74 |
+
- For critical decision-making, users should verify important information from the original source threads.
|
75 |
+
- Review the original discussion when the summary highlights conflicting perspectives to ensure fair representation.
|
76 |
+
- When repurposing summaries, maintain proper attribution to both the model and the original commenters.
|
77 |
+
|
78 |
+
## How to Get Started with the Model
|
79 |
+
|
80 |
+
```python
|
81 |
+
from transformers import AutoModelForCausalLM, AutoTokenizer
|
82 |
+
|
83 |
+
# Load model and tokenizer
|
84 |
+
model_name = "georgeck/Hacker-News-Comments-Summarization-gemma-3-27b-it"
|
85 |
+
tokenizer = AutoTokenizer.from_pretrained(model_name)
|
86 |
+
model = AutoModelForCausalLM.from_pretrained(model_name)
|
87 |
+
|
88 |
+
# Format input with the expected structure
|
89 |
+
post_title = "Your Hacker News post title here"
|
90 |
+
comments = """
|
91 |
+
[1] (score: 800) <replies: 2> {downvotes: 0} user1: This is a top-level comment
|
92 |
+
[1.1] (score: 600) <replies: 1> {downvotes: 0} user2: This is a reply to the first comment
|
93 |
+
[1.1.1] (score: 400) <replies: 0> {downvotes: 0} user3: This is a reply to the reply
|
94 |
+
[2] (score: 700) <replies: 0> {downvotes: 0} user4: This is another top-level comment
|
95 |
+
"""
|
96 |
+
|
97 |
+
prompt = f"""You are HackerNewsCompanion, an AI assistant specialized in summarizing Hacker News discussions.
|
98 |
+
Your task is to provide concise, meaningful summaries that capture the essence of the discussion while prioritizing high quality content.
|
99 |
+
Focus on high-scoring and highly-replied comments, while deprioritizing downvoted comments (EXCLUDE comments with more than 4 downvotes),
|
100 |
+
to identify main themes and key insights.
|
101 |
+
Summarize in markdown format with these sections: Overview, Main Themes & Key Insights, [Theme Titles], Significant Viewpoints, Notable Side Discussions.
|
102 |
+
In 'Main Themes', use bullet points. When quoting comments, include the hierarchy path and attribute the author, example '[1.2] (user1).'`;
|
103 |
+
|
104 |
+
Provide a concise and insightful summary of the following Hacker News discussion, as per the guidelines you've been given.
|
105 |
+
The goal is to help someone quickly grasp the main discussion points and key perspectives without reading all comments.
|
106 |
+
Please focus on extracting the main themes, significant viewpoints, and high-quality contributions.
|
107 |
+
The post title and comments are separated by three dashed lines:
|
108 |
+
---
|
109 |
+
Post Title:
|
110 |
+
{post_title}
|
111 |
+
---
|
112 |
+
Comments:
|
113 |
+
{comments}
|
114 |
+
---
|
115 |
+
"""
|
116 |
+
|
117 |
+
inputs = tokenizer(prompt, return_tensors="pt")
|
118 |
+
outputs = model.generate(inputs.input_ids, max_length=1024)
|
119 |
+
summary = tokenizer.decode(outputs[0], skip_special_tokens=True)
|
120 |
+
print(summary)
|
121 |
+
```
|
122 |
+
|
123 |
+
## Training Details
|
124 |
+
|
125 |
+
### Training Data
|
126 |
+
|
127 |
+
This model was fine-tuned on the [georgeck/hacker-news-discussion-summarization-large](https://huggingface.co/datasets/georgeck/hacker-news-discussion-summarization-large) dataset, which contains 14,531 records of Hacker News front-page stories and their associated discussion threads.
|
128 |
+
|
129 |
+
The dataset includes:
|
130 |
+
- 6,300 training examples
|
131 |
+
- 700 test examples
|
132 |
+
- Structured representations of hierarchical comment threads
|
133 |
+
- Normalized scoring system that represents comment importance
|
134 |
+
- Comprehensive metadata about posts and comments
|
135 |
+
|
136 |
+
Each example includes a post title, and a structured representation of the comment thread with information about comment scores, reply counts, and downvotes.
|
137 |
+
|
138 |
+
### Training Procedure
|
139 |
+
|
140 |
+
#### Preprocessing
|
141 |
+
|
142 |
+
- The hierarchical comment structure was preserved using a standardized format
|
143 |
+
- A normalized scoring system (1-1000) was applied to represent each comment's relative importance
|
144 |
+
- Comments were organized to maintain their hierarchical relationships
|
145 |
+
|
146 |
+
The training was done by using [Axolotl](https://axolotl-ai-cloud.github.io/axolotl/) and using GPUs from [Runpod](https://www.runpod.io/).
|
147 |
+
|
148 |
+
## Evaluation
|
149 |
+
|
150 |
+
### Testing Data, Factors & Metrics
|
151 |
+
|
152 |
+
#### Testing Data
|
153 |
+
|
154 |
+
The model was evaluated on the test split of the georgeck/hacker-news-discussion-summarization-large dataset.
|
155 |
+
|
156 |
+
#### Factors
|
157 |
+
|
158 |
+
Evaluation considered:
|
159 |
+
- Discussions of varying lengths and complexities
|
160 |
+
- Threads with differing numbers of comment hierarchies
|
161 |
+
- Discussions across various technical domains common on Hacker News
|
162 |
+
- Threads with different levels of controversy (measured by comment downvotes)
|
163 |
+
|
164 |
+
|
165 |
+
## Technical Specifications
|
166 |
+
|
167 |
+
### Model Architecture and Objective
|
168 |
+
|
169 |
+
This model is based on Llama-3.1-8B-Instruct, a causal language model.
|
170 |
+
The primary training objective was to generate structured summaries of hierarchical discussion threads that capture the most important themes, perspectives, and insights while maintaining proper attribution.
|
171 |
+
|
172 |
+
The model was trained to specifically understand and process the hierarchical structure of Hacker News comments, including their scoring system, reply counts, and downvote information to appropriately weight content importance.
|
173 |
+
|
174 |
+
|
175 |
+
## Citation
|
176 |
+
|
177 |
+
**BibTeX:**
|
178 |
+
|
179 |
+
```
|
180 |
+
@misc{georgeck2025HackerNewsSummarization,
|
181 |
+
author = {George Chiramattel, Ann Catherine Jose},
|
182 |
+
title = {Hacker-News-Comments-Summarization-gemma-3-27b-it},
|
183 |
+
year = {2025},
|
184 |
+
publisher = {Hugging Face},
|
185 |
+
journal = {Hugging Face Hub},
|
186 |
+
howpublished = {https://huggingface.co/georgeck/Hacker-News-Comments-Summarization-gemma-3-27b-it},
|
187 |
+
}
|
188 |
+
```
|
189 |
+
|
190 |
+
|
191 |
+
## Glossary
|
192 |
+
|
193 |
+
- **Hierarchy Path:** Notation (e.g., [1.2.1]) that shows a comment's position in the discussion tree. A single number indicates a top-level comment, while additional numbers represent deeper levels in the reply chain.
|
194 |
+
- **Score:** A normalized value between 1-1000 representing a comment's relative importance based on community engagement.
|
195 |
+
- **Downvotes:** Number of negative votes a comment received, used to filter out low-quality content.
|
196 |
+
- **Thread:** A chain of replies stemming from a single top-level comment.
|
197 |
+
- **Theme:** A recurring topic or perspective identified across multiple comments.
|
198 |
+
|
199 |
+
## Model Card Authors
|
200 |
+
|
201 |
+
[George Chiramattel, Ann Catherine Jose]
|