Did not know text-splitter yet, thanks!
Christopher SchrΓΆder
AI & ML interests
Organizations
cschroeder's activity
This bold claim is not my opinion, but it has been made in a recent "report" of a group, whose stance is recognizable in their name. It is roughly translated as "Authors' Rights Initiative". They published a report which was also presented before the EU Parliament according to the LinkedIn post below.
I am not really interested in politics, but as an EU citizen I am of course somewhat interested in a reasonable and practical version of the EU AI Act. Not saying there should not be rules around data and AI, but this report is obviously very biased towards one side.
While I think the report itself does not deserve attention, I post it in the hope that you find more examples, where they did not address the issue adequately. Feel free to add to my LinkedIn posts (where the original authors will see it) or here.
[en] Executive summary: https://urheber.info/media/pages/diskurs/ai-training-is-copyright-infringement/3b900058e6-1725460935/executive-summary_engl_final_29-08-2024.pdf
[de] Full report: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4946214
LinkedIn: https://www.linkedin.com/posts/activity-7238912869268959232-6cFx
LIGER "is a [Hugging Face-compatible] collection of Triton kernels designed specifically for LLM training. It can effectively increase multi-GPU training throughput by 20% and reduces memory usage by 60%."
GitHub: https://github.com/linkedin/Liger-Kernel
Apparently, some papers from the ACL 2024 are still not listed in the ACL Anthology. While this issue will hopefully be fixed soon, we should give those papers additional spotlight.
Some of my favorites:
1. Dolma is an English corpus that encompasses 3 trillion tokens. Additionally, it is accompanied by an exceptional software package that consdierably advances the state-of-the-art in preparing data for LLM pretraining. (Source: I am currently using Dolma.)
Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research (2402.00159)
2. In the paper "Same Task, More Tokens: the Impact of Input Length on
the Reasoning Performance of Large Language Models", the authors show how extending the context length impacts an LLM's reasoning performance. I asked myself a similar question a few months ago, and therefore this paper is highly interesting to me.
Same Task, More Tokens: the Impact of Input Length on the Reasoning Performance of Large Language Models (2402.14848)
This was brought to my attention through a Linkedin post by @ShayeghB , who is also affected:
Ensemble-Based Unsupervised Discontinuous Constituency Parsing by Tree Averaging (2403.00143)
View all the missing papers here:
https://theshayegh.github.io/ACL2024MissingPapers/
I want to start by expressing my appreciation for the incredible work Hugging Face has done for the open-source community. Your contributions have been invaluable, and Iβm grateful for the tools and resources you've provided.
Please take the following as constructive feedback. I wouldnβt have mentioned these points if you hadnβt asked, and I hope they can be seen as suggestions for further improvement.
Software quality: When I first started using transformers, I was thoroughly impressed. The basic "hello world" examples work wonderfully, making the initial experience smooth and enjoyable. However, nowadays I am am regularly diving deeper into the library, and I am regularly facing challenges such as long-time standing bugs, undocumented issues, lack of API documentation, and occasionally broken functionality. I am only guessing here, but I think the majority of these repos is written by research engineers or researchers, whose focus might be more on the methodological correctness (which is of course crucial as well). That said, it might be helpful to include someone who is stronger in software development and less knowledgeable in ML. This would be the first person to complain about "clean code" issues, and also would be the first to notice problems with the software.
Posts: Great feature! However, it could be enhanced by adding basic text formatting options. This would make posts more visually appealing and easier to read.
Papers: Restricting this to arXiv is too limiting. While I understand the rationale in terms of implementation effort, if the goal is to be the "Github of ML/AI," it might be worth considering support for at least the high-ranking conferences (or a subset thereof). In many cases, the conference version of a paper supersedes the arXiv version, and this restriction may inadvertently encourage the use of preprints over the finalized versions.
Again, these are just my personal pain points, and Iβm sharing them with the intention of helping Hugging Face continue to improve.
The new release contains some smaller bugfixes. Check it out!
Github: https://github.com/webis-de/small-text
Paper: Small-Text: Active Learning for Text Classification in Python (2107.10314)
The new version provides a small-text-compatible implementation for the recent AnchorAL strategy by @pietrolesci .
Github: https://github.com/webis-de/small-text
Paper: https://aclanthology.org/2023.eacl-demo.11/
AnchorAL: AnchorAL: Computationally Efficient Active Learning for Large and Imbalanced Datasets (2404.05623)