having trouble with auto train hello there this is the first time i am testing auto train with a 1.8k SFT dataset. Howevery i am not quite sure the training is going smooth. Logs seem quite confusing, token did not match can not auth, generates confusing train splits, do you know how i can check my running job properly? what is being used for training as data? any ideas?
Detect hallucinations in answers based on context and questions using ModernBERT with 8192-token context support!
### Model Details - **Model Name**: [lettucedect-large-modernbert-en-v1](KRLabsOrg/lettucedect-large-modernbert-en-v1) - **Organization**: [KRLabsOrg](https://huggingface.co/KRLabsOrg) - **Github**: [https://github.com/KRLabsOrg/LettuceDetect](https://github.com/KRLabsOrg/LettuceDetect) - **Architecture**: ModernBERT (Large) with extended context support up to 8192 tokens - **Task**: Token Classification / Hallucination Detection - **Training Dataset**: [RagTruth](wandb/RAGTruth-processed) - **Language**: English - **Capabilities**: Detects hallucinated spans in answers, provides confidence scores, and calculates average confidence across detected spans.
LettuceDetect excels at processing long documents to determine if an answer aligns with the provided context, making it a powerful tool for ensuring factual accuracy.
๐๐ปโโ๏ธHey there folks , Open LLM Europe just released Lucie 7B-Instruct model , a billingual instruct model trained on open data ! You can check out my unofficial demo here while we wait for the official inference api from the group : Tonic/Lucie-7B hope you like it ๐
if you can record the problem and share it there , or on the forums in your own post , please dont be shy because i'm not sure but i do think it helps ๐ค๐ค๐ค
boomers still pick zenodo.org instead of huggingface ??? absolutely clownish nonsense , my random datasets have 30x more downloads and views than front page zenodos ... gonna write a comparison blog , but yeah... cringe.
scroll down for the datasets, still figuring out how to optimize for discoverability , i do think on that part it will be better than zenodo[dot}org , it would be nice to write a tutorial about that and compare : we already have more downloads than most zenodo datasets from famous researchers !
perhaps the largest single training dataset of high quality text to date of 7.8 trillion tokens in 35 European languages and code.
the best part : the data was correctly licenced so it's actually future-proof!
the completions model is really creative and instruct fine tuned version is very good also.
now you can use such models for multi-lingual enterprise applications with further finetunes , long response generation, structured outputs (coding) also works.