Social Post Explorers

community
Activity Feed

AI & ML interests

None defined yet.

Recent Activity

social-post-explorers's activity

nyuuzyou 
posted an update about 22 hours ago
view post
Post
382
🌐 Fandom.com Community Dataset - nyuuzyou/fandom

A comprehensive collection of 7.04M wiki pages from Fandom.com communities featuring:
- Full article content and metadata from current pages
- Rich structural data including templates, categories, and links
- Multilingual content across 40+ languages
- Complete metadata including titles and section structure

Content is available under CC-BY-SA 3.0 license, allowing reuse with attribution and share-alike requirements.

Key contents:
- 7.04M wiki articles with full text
- Metadata including templates, categories, sections
- Internal and external link information
- Multi-language support including major world languages

The dataset provides a valuable resource for:
- Text generation and classification tasks
- Topic modeling and categorization
- Cross-language information retrieval
- Wiki structure analysis

All content comes from public Fandom.com community wikis as of February 2025 and maintains original CC-BY-SA 3.0 licensing.
nyuuzyou 
posted an update 9 days ago
view post
Post
1716
🎓 Educational Text Collection - nyuuzyou/edutexts

A collection of 1.38M educational texts featuring:
- 1.33M educational presentations with full slide content
- 47K academic documents with complete text
- Multilingual content (Russian, Ukrainian, English)
- Full metadata including titles and descriptions

All content is available under CC0 license, allowing unrestricted use including commercial applications.
singhsidhukuldeep 
posted an update 12 days ago
view post
Post
4031
Fascinating deep dive into Swiggy's Hermes - their in-house Text-to-SQL solution that's revolutionizing data accessibility!

Hermes enables natural language querying within Slack, generating and executing SQL queries with an impressive <2 minute turnaround time. The system architecture is particularly intriguing:

Technical Implementation:
- Built on GPT-4 with a Knowledge Base + RAG approach for Swiggy-specific context
- AWS Lambda middleware handles communication between Slack UI and the Gen AI model
- Databricks jobs orchestrate query generation and execution

Under the Hood:
The pipeline employs a sophisticated multi-stage approach:
1. Metrics retrieval using embedding-based vector lookup
2. Table/column identification through metadata descriptions
3. Few-shot SQL retrieval with vector-based search
4. Structured prompt creation with data snapshots
5. Query validation with automated error correction

Architecture Highlights:
- Compartmentalized by business units (charters) for better context management
- Snowflake integration with seamless authentication
- Automated metadata onboarding with QA validation
- Real-time feedback collection via Slack

What's particularly impressive is how they've solved the data context challenge through charter-specific implementations, significantly improving query accuracy for well-defined metadata sets.

Kudos to the Swiggy team for democratizing data access across their organization. This is a brilliant example of practical AI implementation solving real business challenges.
singhsidhukuldeep 
posted an update 15 days ago
view post
Post
680
Exciting breakthrough in neural search technology!

Researchers from ETH Zurich, UC Berkeley, and Stanford University have introduced WARP - a groundbreaking retrieval engine that achieves remarkable performance improvements in multi-vector search.

WARP brings three major innovations to the table:
- A novel WARP SELECT algorithm for dynamic similarity estimation
- Implicit decompression during retrieval operations
- An optimized two-stage reduction process for efficient scoring

The results are stunning - WARP delivers a 41x reduction in query latency compared to existing XTR implementations, bringing response times down from 6+ seconds to just 171 milliseconds on single-threaded execution. It also achieves a 3x speedup over the current state-of-the-art ColBERTv2 PLAID engine while maintaining retrieval quality.

Under the hood, WARP uses highly-optimized C++ kernels and specialized inference runtimes. It employs an innovative compression strategy using k-means clustering and quantized residual vectors, reducing index sizes by 2-4x compared to baseline implementations.

The engine shows excellent scalability, with latency scaling with the square root of dataset size and effective parallelization across multiple CPU threads - achieving 3.1x speedup with 16 threads.

This work represents a significant step forward in making neural search more practical for production environments. The researchers have made the implementation publicly available for the community.
singhsidhukuldeep 
posted an update 16 days ago
view post
Post
1008
Exciting Research Alert: Remining Hard Negatives for Domain Adaptation in Dense Retrieval

Researchers from the University of Amsterdam have introduced R-GPL, an innovative approach to improve domain adaptation in dense retrievers. The technique enhances the existing GPL (Generative Pseudo Labeling) framework by continuously remining hard negatives during the training process.

Key Technical Insights:
- The method leverages domain-adapted models to mine higher quality hard negatives incrementally every 30,000 steps during training
- Uses MarginMSE loss for training with data triplets (Query, Relevant Doc, Hard Negative Doc)
- Implements mean pooling over hidden states for dense representations with 350 token sequence length
- Combines query generation with pseudo-labels from cross-encoder models

Performance Highlights:
- Outperforms baseline GPL in 13/14 BEIR datasets
- Shows significant improvements in 9/12 LoTTE datasets
- Achieves remarkable 4.4 point gain on TREC-COVID dataset

Under the Hood:
The system continuously refreshes hard negatives using the model undergoing domain adaptation. This creates a feedback loop where the model gets better at identifying relevant documents in the target domain, leading to higher quality training signals.

Analysis reveals that domain-adapted models retrieve documents with higher relevancy scores in top-100 hard negatives compared to baseline approaches. This confirms the model's enhanced capability to identify challenging but informative training examples.

This research opens new possibilities for efficient dense retrieval systems that can adapt to different domains without requiring labeled training data.
nyuuzyou 
posted an update 16 days ago
view post
Post
2459
📱 UI Navigation Corpus - teleren/ui-navigation-corpus

A comprehensive collection of mobile and web UI elements created by a new member of the Hugging Face community @teleren . I'm glad that I was able to provide a little help together with @its5Q to get this dataset published.

This dataset contains:
- Screenshots and recordings of mobile (iOS/Android) and web interfaces
- UI navigation annotations and metadata
- Screen categorization tags and text extractions
- Navigation paths and screen relationships
- Version control for UI imagery

Perfect for training UI navigation agents and understanding interface patterns. The dataset provides detailed annotations linking screens, sections, and navigation flows together.
singhsidhukuldeep 
posted an update 17 days ago
view post
Post
1760
Exciting breakthrough in Streaming Recommendation Systems! @BytedanceTalk researchers have developed "Long-Term Interest Clock" (LIC), a revolutionary approach to understand user preferences throughout the day.

>> Technical Innovation
The system introduces two groundbreaking modules:
- Clock-based General Search Unit (Clock-GSU): Intelligently retrieves relevant user behaviors by analyzing time patterns and content similarity
- Clock-based Exact Search Unit (Clock-ESU): Employs time-gap-aware attention mechanism to precisely model user interests

>> Key Advantages
LIC addresses critical limitations of existing systems by:
- Providing fine-grained time perception instead of discrete hour-based recommendations
- Analyzing long-term user behavior patterns rather than just short-term interactions
- Operating at item-level granularity versus broad category-level interests

>> Real-World Impact
Already deployed in Douyin Music App, the system has demonstrated remarkable results:
- 0.122% improvement in user active days
- Significant boost in engagement metrics including likes and play rates
- Enhanced user satisfaction with reduced dislike rates

>> Under the Hood
The system processes user behavior sequences spanning an entire year, utilizing multi-head attention mechanisms and sophisticated time-gap calculations to understand user preferences. It pre-computes embeddings stored in parameter servers for real-time performance, making it highly scalable for production environments.

This innovation marks a significant step forward in personalized content delivery, especially for streaming platforms where user preferences vary throughout the day. The research has been accepted for presentation at WWW '25, Sydney.
singhsidhukuldeep 
posted an update 18 days ago
view post
Post
3582
Exciting Research Alert: Revolutionizing Complex Information Retrieval!

A groundbreaking paper from researchers at MIT, AWS AI, and UPenn introduces ARM (Alignment-Oriented LLM-based Retrieval Method), a novel approach to tackle complex information retrieval challenges.

>> Key Innovations

Information Alignment
The method first decomposes queries into keywords and aligns them with available data using both BM25 and embedding similarity, ensuring comprehensive coverage of information needs.

Structure Alignment
ARM employs a sophisticated mixed-integer programming solver to identify connections between data objects, exploring relationships beyond simple semantic matching.

Self-Verification
The system includes a unique self-verification mechanism where the LLM evaluates and aggregates results from multiple retrieval paths, ensuring accuracy and completeness.

>> Performance Highlights

The results are impressive:
- Outperforms standard RAG by up to 5.2 points in execution accuracy on Bird dataset
- Achieves 19.3 points higher F1 scores compared to existing approaches on OTT-QA
- Reduces the number of required LLM calls while maintaining superior retrieval quality

>> Technical Implementation

The system uses a three-step process:
1. N-gram indexing and embedding computation for all data objects
2. Constrained beam decoding for information alignment
3. Mixed-integer programming optimization for structure exploration

This research represents a significant step forward in making complex information retrieval more efficient and accurate. The team's work demonstrates how combining traditional optimization techniques with modern LLM capabilities can solve challenging retrieval problems.