Teaching a 7B Model to Be Just the Right Amount of Snark
Ever wondered if a language model could get sarcasm? I fine-tuned Mistral-7B using LoRA and 4-bit quantisation—on just ~720 hand-picked sarcastic prompt–response pairs from Reddit, Twitter, and real-life conversations.
The challenge? Keeping it sarcastic but still helpful.
LoRA rank 16 to avoid overfitting
4-bit NF4 quantization to fit on limited GPU memory
10 carefully monitored epochs so it didn’t turn into a full-time comedian
Result: a model that understands “Oh great, another meeting” exactly as you mean it.
Read the full journey, tech details, and lessons learned on my blog: Fine-Tuning Mistral-7B for Sarcasm with LoRA and 4-Bit Quantisation
Try the model here on Hugging Face: sweatSmile/Mistral-7B-Instruct-v0.1-Sarcasm.
The task is: "Navigate to {random_url} and play the game until you reach a score of 5/5”....each task is set up by having claude generate a random app from a predefined list of prompts (multiple choice trivia, form filling, or color matching)"
Qwen3 is the latest version of the Qwen language models. It's smarter, faster, and now understands 119 languages instead of just 29. It can do both deep reasoning and quick answers using a single model, depending on what you need. The models range in size from small (0.6B) to huge (235B), with smart ways to save compute. It's trained on 36 trillion tokens and fine-tuned in four steps to boost performance. Qwen3 performs as well as or better than many top models, including some from big companies. It’s fully open-source under licence. Amazing!!!
Announcing Artificial Analysis Long Context Reasoning (AA-LCR), a new benchmark to evaluate long context performance through testing reasoning capabilities across multiple long documents (~100k tokens)
The focus of AA-LCR is to replicate real knowledge work and reasoning tasks, testing capability critical to modern AI applications spanning document analysis, codebase understanding, and complex multi-step workflows.
AA-LCR is 100 hard text-based questions that require reasoning across multiple real-world documents that represent ~100k input tokens. Questions are designed so answers cannot be directly found but must be reasoned from multiple information sources, with human testing verifying that each question requires genuine inference rather than retrieval.
Key takeaways: ➤ Today’s leading models achieve ~70% accuracy: the top three places go to OpenAI o3 (69%), xAI Grok 4 (68%) and Qwen3 235B 2507 Thinking (67%)
➤👀 We also already have gpt-oss results! 120B performs close to o4-mini (high), in-line with OpenAI claims regarding model performance. We will be following up shortly with a Intelligence Index for the models.
➤ 100 hard text-based questions spanning 7 categories of documents (Company Reports, Industry Reports, Government Consultations, Academia, Legal, Marketing Materials and Survey Reports)
➤ ~100k tokens of input per question, requiring models to support a minimum 128K context window to score on this benchmark
➤ ~3M total unique input tokens spanning ~230 documents to run the benchmark (output tokens typically vary by model)
We’re adding AA-LCR to the Artificial Analysis Intelligence Index, and taking the version number to v2.2. Artificial Analysis Intelligence Index v2.2 now includes: MMLU-Pro, GPQA Diamond, AIME 2025, IFBench, LiveCodeBench, SciCode and AA-LCR.
Our two new papers from the SJTU & Huawei: Powered by DeepSeek-V3, we've achieved a new SOTA on the SWE-Bench benchmark!
We introduce two innovative approaches: ⚔️ SWE-Debate: AI agents compete and "debate" to generate the best code fix. 🧠 SWE-Exp: An AI agent learns from past repair "experience" to solve new issues more efficiently.