@singhsidhukuldeep on Hugging Face: "Remember when you had a few hundred rows of data that could easily be opened…"

Post

456

Remember when you had a few hundred rows of data that could easily be opened in Excel. 📊

Well, we are far from that with billion-parameter LLMs trained on trillions of tokens. 🌐

@Microsoft wants to bridge that using "SpreadsheetLLM": Encoding Spreadsheets for Large Language Models. 🤖📈

While it sounds simple, Spreadsheets, with their extensive two-dimensional grids, various layouts, and diverse formatting options, present notable challenges for large language models (LLMs). 🚧

They initially propose a vanilla serialization approach that incorporates cell addresses, values, and formats. However, this approach is limited by LLMs' token constraints, making it impractical for most applications. ⛔

Solution... A SheetCompressor, an innovative encoding framework that compresses spreadsheets effectively for LLMs. 🔧

It comprises three modules:
1️⃣ Structural-anchor-based compression
2️⃣ Inverse index translation
3️⃣ Data-format-aware aggregation

It significantly improves performance in spreadsheet table detection task, outperforming the vanilla approach by 25.6% in GPT4's in-context learning setting. 🏆

Sounds exciting, sadly no code, models OR datasets are released. 🙁

Moreover, there is a lot of research in encoding 2D position embeddings and this work has not been benchmarked against that! 🧐

Paper: SpreadsheetLLM: Encoding Spreadsheets for Large Language Models (2407.09025)

Join the conversation