georgeck
/

Hacker-News-Comments-Summarization-gemma-3-27b-it

+---
+license: gemma
+datasets:
+- georgeck/hacker-news-discussion-summarization-large
+language:
+- en
+base_model:
+- google/gemma-3-27b-it
+pipeline_tag: text-generation
+library_name: transformers
+tags:
+- summarization
+- hacker-news
+- hn-companion
+---
+# Model Card for Hacker-News-Comments-Summarization-gemma-3-27b-it
+This model specializes in generating concise, informative summaries of Hacker News discussion threads.
+It analyzes hierarchical comment structures to extract key themes, insights, and perspectives while prioritizing high-quality content based on community engagement.
+## Model Details
+### Model Description
+The `Hacker-News-Comments-Summarization-gemma-3-27b-it` is a fine-tuned version of `google/gemma-3-27b-it`, optimized for summarizing structured discussions from Hacker News.
+It processes hierarchical comment threads to identify main themes, significant viewpoints, and high-quality contributions, organizing them into a structured summary format that highlights community consensus and notable perspectives.
+- **Developed by:** George Chiramattel & Ann Catherine Jose
+- **Model type:** Fine-tuned Large Language Model (google/gemma-3-27b-it)
+- **Language(s):** English
+- **License:** gemma
+- **Finetuned from model:** google/gemma-3-27b-it
+### Model Sources
+- **Repository:** https://huggingface.co/georgeck/Hacker-News-Comments-Summarization-gemma-3-27b-it
+- **Dataset Repository:** https://huggingface.co/datasets/georgeck/hacker-news-discussion-summarization-large
+## Uses
+### Direct Use
+This model is designed to generate structured summaries of Hacker News discussion threads. Given a thread with hierarchical comments, it produces a well-organized summary with:
+1. An overview of the discussion
+2. Main themes and key insights
+3. Detailed theme breakdowns with notable quotes
+4. Key perspectives including contrasting viewpoints
+5. Notable side discussions
+The model is particularly useful for:
+- Helping users quickly understand the key points of lengthy discussion threads
+- Identifying community consensus on technical topics
+- Surfacing expert explanations and valuable insights
+- Highlighting diverse perspectives on topics
+### Downstream Use
+This model was created for the [Hacker News Companion](https://github.com/levelup-apps/hn-enhancer) project.
+## Bias, Risks, and Limitations
+- **Community Bias:** The model may inherit biases present in the Hacker News community, which tends to skew toward certain demographics and perspectives in tech.
+- **Content Prioritization:** The scoring system prioritizes comments with high engagement, which may not always correlate with factual accuracy or diverse representation.
+- **Technical Limitations:** The model's performance may degrade with extremely long threads or discussions with unusual structures.
+- **Limited Context:** The model focuses on the discussion itself and may lack broader context about the topics being discussed.
+- **Attribution Challenges:** The model attempts to properly attribute quotes, but may occasionally misattribute or improperly format references.
+- **Content Filtering:** While the model attempts to filter out low-quality or heavily downvoted content, it may not catch all problematic content.
+### Recommendations
+- Users should be aware that the summaries reflect community engagement patterns on Hacker News, which may include inherent biases.
+- For critical decision-making, users should verify important information from the original source threads.
+- Review the original discussion when the summary highlights conflicting perspectives to ensure fair representation.
+- When repurposing summaries, maintain proper attribution to both the model and the original commenters.
+## How to Get Started with the Model
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer
+# Load model and tokenizer
+model_name = "georgeck/Hacker-News-Comments-Summarization-gemma-3-27b-it"
+tokenizer = AutoTokenizer.from_pretrained(model_name)
+model = AutoModelForCausalLM.from_pretrained(model_name)
+# Format input with the expected structure
+post_title = "Your Hacker News post title here"
+comments = """
+[1] (score: 800) <replies: 2> {downvotes: 0} user1: This is a top-level comment
+[1.1] (score: 600) <replies: 1> {downvotes: 0} user2: This is a reply to the first comment
+[1.1.1] (score: 400) <replies: 0> {downvotes: 0} user3: This is a reply to the reply
+[2] (score: 700) <replies: 0> {downvotes: 0} user4: This is another top-level comment
+"""
+prompt = f"""You are HackerNewsCompanion, an AI assistant specialized in summarizing Hacker News discussions.
+Your task is to provide concise, meaningful summaries that capture the essence of the discussion while prioritizing high quality content.
+Focus on high-scoring and highly-replied comments, while deprioritizing downvoted comments (EXCLUDE comments with more than 4 downvotes),
+to identify main themes and key insights.
+Summarize in markdown format with these sections: Overview, Main Themes & Key Insights, [Theme Titles], Significant Viewpoints, Notable Side Discussions.
+In 'Main Themes', use bullet points. When quoting comments, include the hierarchy path and attribute the author, example '[1.2] (user1).'`;
+Provide a concise and insightful summary of the following Hacker News discussion, as per the guidelines you've been given.
+The goal is to help someone quickly grasp the main discussion points and key perspectives without reading all comments.
+Please focus on extracting the main themes, significant viewpoints, and high-quality contributions.
+The post title and comments are separated by three dashed lines:
+---
+Post Title:
+{post_title}
+---
+Comments:
+{comments}
+---
+"""
+inputs = tokenizer(prompt, return_tensors="pt")
+outputs = model.generate(inputs.input_ids, max_length=1024)
+summary = tokenizer.decode(outputs[0], skip_special_tokens=True)
+print(summary)
+```
+## Training Details
+### Training Data
+This model was fine-tuned on the [georgeck/hacker-news-discussion-summarization-large](https://huggingface.co/datasets/georgeck/hacker-news-discussion-summarization-large) dataset, which contains 14,531 records of Hacker News front-page stories and their associated discussion threads.
+The dataset includes:
+- 6,300 training examples
+- 700 test examples
+- Structured representations of hierarchical comment threads
+- Normalized scoring system that represents comment importance
+- Comprehensive metadata about posts and comments
+Each example includes a post title, and a structured representation of the comment thread with information about comment scores, reply counts, and downvotes.
+### Training Procedure
+#### Preprocessing
+- The hierarchical comment structure was preserved using a standardized format
+- A normalized scoring system (1-1000) was applied to represent each comment's relative importance
+- Comments were organized to maintain their hierarchical relationships
+The training was done by using [Axolotl](https://axolotl-ai-cloud.github.io/axolotl/) and using GPUs from [Runpod](https://www.runpod.io/).
+## Evaluation
+### Testing Data, Factors & Metrics
+#### Testing Data
+The model was evaluated on the test split of the georgeck/hacker-news-discussion-summarization-large dataset.
+#### Factors
+Evaluation considered:
+- Discussions of varying lengths and complexities
+- Threads with differing numbers of comment hierarchies
+- Discussions across various technical domains common on Hacker News
+- Threads with different levels of controversy (measured by comment downvotes)
+## Technical Specifications
+### Model Architecture and Objective
+This model is based on Llama-3.1-8B-Instruct, a causal language model.
+The primary training objective was to generate structured summaries of hierarchical discussion threads that capture the most important themes, perspectives, and insights while maintaining proper attribution.
+The model was trained to specifically understand and process the hierarchical structure of Hacker News comments, including their scoring system, reply counts, and downvote information to appropriately weight content importance.
+## Citation
+**BibTeX:**
+```
+@misc{georgeck2025HackerNewsSummarization,
+  author = {George Chiramattel, Ann Catherine Jose},
+  title = {Hacker-News-Comments-Summarization-gemma-3-27b-it},
+  year = {2025},
+    publisher = {Hugging Face},
+    journal = {Hugging Face Hub},
+    howpublished = {https://huggingface.co/georgeck/Hacker-News-Comments-Summarization-gemma-3-27b-it},
+}
+```
+## Glossary
+- **Hierarchy Path:** Notation (e.g., [1.2.1]) that shows a comment's position in the discussion tree. A single number indicates a top-level comment, while additional numbers represent deeper levels in the reply chain.
+- **Score:** A normalized value between 1-1000 representing a comment's relative importance based on community engagement.
+- **Downvotes:** Number of negative votes a comment received, used to filter out low-quality content.
+- **Thread:** A chain of replies stemming from a single top-level comment.
+- **Theme:** A recurring topic or perspective identified across multiple comments.
+## Model Card Authors
+[George Chiramattel, Ann Catherine Jose]