Post
452
Remember when you had a few hundred rows of data that could easily be opened in Excel. π
Well, we are far from that with billion-parameter LLMs trained on trillions of tokens. π
@Microsoft wants to bridge that using "SpreadsheetLLM": Encoding Spreadsheets for Large Language Models. π€π
While it sounds simple, Spreadsheets, with their extensive two-dimensional grids, various layouts, and diverse formatting options, present notable challenges for large language models (LLMs). π§
They initially propose a vanilla serialization approach that incorporates cell addresses, values, and formats. However, this approach is limited by LLMs' token constraints, making it impractical for most applications. β
Solution... A SheetCompressor, an innovative encoding framework that compresses spreadsheets effectively for LLMs. π§
It comprises three modules:
1οΈβ£ Structural-anchor-based compression
2οΈβ£ Inverse index translation
3οΈβ£ Data-format-aware aggregation
It significantly improves performance in spreadsheet table detection task, outperforming the vanilla approach by 25.6% in GPT4's in-context learning setting. π
Sounds exciting, sadly no code, models OR datasets are released. π
Moreover, there is a lot of research in encoding 2D position embeddings and this work has not been benchmarked against that! π§
Paper: SpreadsheetLLM: Encoding Spreadsheets for Large Language Models (2407.09025)
Well, we are far from that with billion-parameter LLMs trained on trillions of tokens. π
@Microsoft wants to bridge that using "SpreadsheetLLM": Encoding Spreadsheets for Large Language Models. π€π
While it sounds simple, Spreadsheets, with their extensive two-dimensional grids, various layouts, and diverse formatting options, present notable challenges for large language models (LLMs). π§
They initially propose a vanilla serialization approach that incorporates cell addresses, values, and formats. However, this approach is limited by LLMs' token constraints, making it impractical for most applications. β
Solution... A SheetCompressor, an innovative encoding framework that compresses spreadsheets effectively for LLMs. π§
It comprises three modules:
1οΈβ£ Structural-anchor-based compression
2οΈβ£ Inverse index translation
3οΈβ£ Data-format-aware aggregation
It significantly improves performance in spreadsheet table detection task, outperforming the vanilla approach by 25.6% in GPT4's in-context learning setting. π
Sounds exciting, sadly no code, models OR datasets are released. π
Moreover, there is a lot of research in encoding 2D position embeddings and this work has not been benchmarked against that! π§
Paper: SpreadsheetLLM: Encoding Spreadsheets for Large Language Models (2407.09025)