georgeck commited on
Commit
bed494a
·
verified ·
1 Parent(s): fe97bea

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +201 -3
README.md CHANGED
@@ -1,3 +1,201 @@
1
- ---
2
- license: gemma
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: gemma
3
+ datasets:
4
+ - georgeck/hacker-news-discussion-summarization-large
5
+ language:
6
+ - en
7
+ base_model:
8
+ - google/gemma-3-27b-it
9
+ pipeline_tag: text-generation
10
+ library_name: transformers
11
+ tags:
12
+ - summarization
13
+ - hacker-news
14
+ - hn-companion
15
+ ---
16
+ # Model Card for Hacker-News-Comments-Summarization-gemma-3-27b-it
17
+
18
+ This model specializes in generating concise, informative summaries of Hacker News discussion threads.
19
+ It analyzes hierarchical comment structures to extract key themes, insights, and perspectives while prioritizing high-quality content based on community engagement.
20
+
21
+ ## Model Details
22
+
23
+ ### Model Description
24
+
25
+ The `Hacker-News-Comments-Summarization-gemma-3-27b-it` is a fine-tuned version of `google/gemma-3-27b-it`, optimized for summarizing structured discussions from Hacker News.
26
+ It processes hierarchical comment threads to identify main themes, significant viewpoints, and high-quality contributions, organizing them into a structured summary format that highlights community consensus and notable perspectives.
27
+
28
+ - **Developed by:** George Chiramattel & Ann Catherine Jose
29
+ - **Model type:** Fine-tuned Large Language Model (google/gemma-3-27b-it)
30
+ - **Language(s):** English
31
+ - **License:** gemma
32
+ - **Finetuned from model:** google/gemma-3-27b-it
33
+
34
+ ### Model Sources
35
+
36
+ - **Repository:** https://huggingface.co/georgeck/Hacker-News-Comments-Summarization-gemma-3-27b-it
37
+ - **Dataset Repository:** https://huggingface.co/datasets/georgeck/hacker-news-discussion-summarization-large
38
+
39
+ ## Uses
40
+
41
+ ### Direct Use
42
+
43
+ This model is designed to generate structured summaries of Hacker News discussion threads. Given a thread with hierarchical comments, it produces a well-organized summary with:
44
+
45
+ 1. An overview of the discussion
46
+ 2. Main themes and key insights
47
+ 3. Detailed theme breakdowns with notable quotes
48
+ 4. Key perspectives including contrasting viewpoints
49
+ 5. Notable side discussions
50
+
51
+ The model is particularly useful for:
52
+ - Helping users quickly understand the key points of lengthy discussion threads
53
+ - Identifying community consensus on technical topics
54
+ - Surfacing expert explanations and valuable insights
55
+ - Highlighting diverse perspectives on topics
56
+
57
+ ### Downstream Use
58
+
59
+ This model was created for the [Hacker News Companion](https://github.com/levelup-apps/hn-enhancer) project.
60
+
61
+
62
+ ## Bias, Risks, and Limitations
63
+
64
+ - **Community Bias:** The model may inherit biases present in the Hacker News community, which tends to skew toward certain demographics and perspectives in tech.
65
+ - **Content Prioritization:** The scoring system prioritizes comments with high engagement, which may not always correlate with factual accuracy or diverse representation.
66
+ - **Technical Limitations:** The model's performance may degrade with extremely long threads or discussions with unusual structures.
67
+ - **Limited Context:** The model focuses on the discussion itself and may lack broader context about the topics being discussed.
68
+ - **Attribution Challenges:** The model attempts to properly attribute quotes, but may occasionally misattribute or improperly format references.
69
+ - **Content Filtering:** While the model attempts to filter out low-quality or heavily downvoted content, it may not catch all problematic content.
70
+
71
+ ### Recommendations
72
+
73
+ - Users should be aware that the summaries reflect community engagement patterns on Hacker News, which may include inherent biases.
74
+ - For critical decision-making, users should verify important information from the original source threads.
75
+ - Review the original discussion when the summary highlights conflicting perspectives to ensure fair representation.
76
+ - When repurposing summaries, maintain proper attribution to both the model and the original commenters.
77
+
78
+ ## How to Get Started with the Model
79
+
80
+ ```python
81
+ from transformers import AutoModelForCausalLM, AutoTokenizer
82
+
83
+ # Load model and tokenizer
84
+ model_name = "georgeck/Hacker-News-Comments-Summarization-gemma-3-27b-it"
85
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
86
+ model = AutoModelForCausalLM.from_pretrained(model_name)
87
+
88
+ # Format input with the expected structure
89
+ post_title = "Your Hacker News post title here"
90
+ comments = """
91
+ [1] (score: 800) <replies: 2> {downvotes: 0} user1: This is a top-level comment
92
+ [1.1] (score: 600) <replies: 1> {downvotes: 0} user2: This is a reply to the first comment
93
+ [1.1.1] (score: 400) <replies: 0> {downvotes: 0} user3: This is a reply to the reply
94
+ [2] (score: 700) <replies: 0> {downvotes: 0} user4: This is another top-level comment
95
+ """
96
+
97
+ prompt = f"""You are HackerNewsCompanion, an AI assistant specialized in summarizing Hacker News discussions.
98
+ Your task is to provide concise, meaningful summaries that capture the essence of the discussion while prioritizing high quality content.
99
+ Focus on high-scoring and highly-replied comments, while deprioritizing downvoted comments (EXCLUDE comments with more than 4 downvotes),
100
+ to identify main themes and key insights.
101
+ Summarize in markdown format with these sections: Overview, Main Themes & Key Insights, [Theme Titles], Significant Viewpoints, Notable Side Discussions.
102
+ In 'Main Themes', use bullet points. When quoting comments, include the hierarchy path and attribute the author, example '[1.2] (user1).'`;
103
+
104
+ Provide a concise and insightful summary of the following Hacker News discussion, as per the guidelines you've been given.
105
+ The goal is to help someone quickly grasp the main discussion points and key perspectives without reading all comments.
106
+ Please focus on extracting the main themes, significant viewpoints, and high-quality contributions.
107
+ The post title and comments are separated by three dashed lines:
108
+ ---
109
+ Post Title:
110
+ {post_title}
111
+ ---
112
+ Comments:
113
+ {comments}
114
+ ---
115
+ """
116
+
117
+ inputs = tokenizer(prompt, return_tensors="pt")
118
+ outputs = model.generate(inputs.input_ids, max_length=1024)
119
+ summary = tokenizer.decode(outputs[0], skip_special_tokens=True)
120
+ print(summary)
121
+ ```
122
+
123
+ ## Training Details
124
+
125
+ ### Training Data
126
+
127
+ This model was fine-tuned on the [georgeck/hacker-news-discussion-summarization-large](https://huggingface.co/datasets/georgeck/hacker-news-discussion-summarization-large) dataset, which contains 14,531 records of Hacker News front-page stories and their associated discussion threads.
128
+
129
+ The dataset includes:
130
+ - 6,300 training examples
131
+ - 700 test examples
132
+ - Structured representations of hierarchical comment threads
133
+ - Normalized scoring system that represents comment importance
134
+ - Comprehensive metadata about posts and comments
135
+
136
+ Each example includes a post title, and a structured representation of the comment thread with information about comment scores, reply counts, and downvotes.
137
+
138
+ ### Training Procedure
139
+
140
+ #### Preprocessing
141
+
142
+ - The hierarchical comment structure was preserved using a standardized format
143
+ - A normalized scoring system (1-1000) was applied to represent each comment's relative importance
144
+ - Comments were organized to maintain their hierarchical relationships
145
+
146
+ The training was done by using [Axolotl](https://axolotl-ai-cloud.github.io/axolotl/) and using GPUs from [Runpod](https://www.runpod.io/).
147
+
148
+ ## Evaluation
149
+
150
+ ### Testing Data, Factors & Metrics
151
+
152
+ #### Testing Data
153
+
154
+ The model was evaluated on the test split of the georgeck/hacker-news-discussion-summarization-large dataset.
155
+
156
+ #### Factors
157
+
158
+ Evaluation considered:
159
+ - Discussions of varying lengths and complexities
160
+ - Threads with differing numbers of comment hierarchies
161
+ - Discussions across various technical domains common on Hacker News
162
+ - Threads with different levels of controversy (measured by comment downvotes)
163
+
164
+
165
+ ## Technical Specifications
166
+
167
+ ### Model Architecture and Objective
168
+
169
+ This model is based on Llama-3.1-8B-Instruct, a causal language model.
170
+ The primary training objective was to generate structured summaries of hierarchical discussion threads that capture the most important themes, perspectives, and insights while maintaining proper attribution.
171
+
172
+ The model was trained to specifically understand and process the hierarchical structure of Hacker News comments, including their scoring system, reply counts, and downvote information to appropriately weight content importance.
173
+
174
+
175
+ ## Citation
176
+
177
+ **BibTeX:**
178
+
179
+ ```
180
+ @misc{georgeck2025HackerNewsSummarization,
181
+ author = {George Chiramattel, Ann Catherine Jose},
182
+ title = {Hacker-News-Comments-Summarization-gemma-3-27b-it},
183
+ year = {2025},
184
+ publisher = {Hugging Face},
185
+ journal = {Hugging Face Hub},
186
+ howpublished = {https://huggingface.co/georgeck/Hacker-News-Comments-Summarization-gemma-3-27b-it},
187
+ }
188
+ ```
189
+
190
+
191
+ ## Glossary
192
+
193
+ - **Hierarchy Path:** Notation (e.g., [1.2.1]) that shows a comment's position in the discussion tree. A single number indicates a top-level comment, while additional numbers represent deeper levels in the reply chain.
194
+ - **Score:** A normalized value between 1-1000 representing a comment's relative importance based on community engagement.
195
+ - **Downvotes:** Number of negative votes a comment received, used to filter out low-quality content.
196
+ - **Thread:** A chain of replies stemming from a single top-level comment.
197
+ - **Theme:** A recurring topic or perspective identified across multiple comments.
198
+
199
+ ## Model Card Authors
200
+
201
+ [George Chiramattel, Ann Catherine Jose]