Alibaba_Speech_Lab_SG PRO

alibabasglab

AI & ML interests

speech enhancement, separation, and codec

Recent Activity

Organizations

Alibaba-PAI's profile picture

alibabasglab's activity

reacted to their post with ๐Ÿคโค๏ธ๐Ÿ”ฅ about 23 hours ago
reacted to prithivMLmods's post with ๐Ÿ”ฅ 1 day ago
view post
Post
1373
ChemQwen-vL [ Qwen for Chem Vision ] ๐Ÿง‘๐Ÿปโ€๐Ÿ”ฌ

๐ŸงชModel : prithivMLmods/ChemQwen-vL

๐Ÿ“ChemQwen-vL is a vision-language model fine-tuned based on the Qwen2VL-2B Instruct model. It has been trained using the International Chemical Identifier (InChI) format for chemical compounds and is optimized for chemical compound identification. The model excels at generating the InChI and providing descriptions of chemical compounds based on their images. Its architecture operates within a multi-modal framework, combining image-text-text capabilities. It has been fine-tuned using datasets from: https://iupac.org/projects/

๐Ÿ“’Colab Demo: https://tinyurl.com/2pn8x6u7, Collection : https://tinyurl.com/2mt5bjju

Inference with the documentation is possible with the help of the ReportLab library. https://pypi.org/project/reportlab/

๐Ÿค—: @prithivMLmods
  • 1 reply
ยท
replied to prithivMLmods's post 1 day ago
reacted to m-ric's post with ๐Ÿ‘€ 1 day ago
view post
Post
735
๐— ๐—ถ๐—ป๐—ถ๐— ๐—ฎ๐˜…'๐˜€ ๐—ป๐—ฒ๐˜„ ๐— ๐—ผ๐—˜ ๐—Ÿ๐—Ÿ๐—  ๐—ฟ๐—ฒ๐—ฎ๐—ฐ๐—ต๐—ฒ๐˜€ ๐—–๐—น๐—ฎ๐˜‚๐—ฑ๐—ฒ-๐—ฆ๐—ผ๐—ป๐—ป๐—ฒ๐˜ ๐—น๐—ฒ๐˜ƒ๐—ฒ๐—น ๐˜„๐—ถ๐˜๐—ต ๐Ÿฐ๐—  ๐˜๐—ผ๐—ธ๐—ฒ๐—ป๐˜€ ๐—ฐ๐—ผ๐—ป๐˜๐—ฒ๐˜…๐˜ ๐—น๐—ฒ๐—ป๐—ด๐˜๐—ต ๐Ÿ’ฅ

This work from Chinese startup @MiniMax-AI introduces a novel architecture that achieves state-of-the-art performance while handling context windows up to 4 million tokens - roughly 20x longer than current models. The key was combining lightning attention, mixture of experts (MoE), and a careful hybrid approach.

๐—ž๐—ฒ๐˜† ๐—ถ๐—ป๐˜€๐—ถ๐—ด๐—ต๐˜๐˜€:

๐Ÿ—๏ธ MoE with novel hybrid attention:
โ€ฃ Mixture of Experts with 456B total parameters (45.9B activated per token)
โ€ฃ Combines Lightning attention (linear complexity) for most layers and traditional softmax attention every 8 layers

๐Ÿ† Outperforms leading models across benchmarks while offering vastly longer context:
โ€ฃ Competitive with GPT-4/Claude-3.5-Sonnet on most tasks
โ€ฃ Can efficiently handle 4M token contexts (vs 256K for most other LLMs)

๐Ÿ”ฌ Technical innovations enable efficient scaling:
โ€ฃ Novel expert parallel and tensor parallel strategies cut communication overhead in half
โ€ฃ Improved linear attention sequence parallelism, multi-level padding and other optimizations achieve 75% GPU utilization (that's really high, generally utilization is around 50%)

๐ŸŽฏ Thorough training strategy:
โ€ฃ Careful data curation and quality control by using a smaller preliminary version of their LLM as a judge!

Overall, not only is the model impressive, but the technical paper is also really interesting! ๐Ÿ“
It has lots of insights including a great comparison showing how a 2B MoE (24B total) far outperforms a 7B model for the same amount of FLOPs.

Read it in full here ๐Ÿ‘‰ MiniMax-01: Scaling Foundation Models with Lightning Attention (2501.08313)
Model here, allows commercial use <100M monthly users ๐Ÿ‘‰ MiniMaxAI/MiniMax-Text-01
reacted to Tonic's post with ๐Ÿ”ฅ 1 day ago
view post
Post
1115
๐Ÿ™‹๐Ÿปโ€โ™‚๏ธ Hey there folks ,

Facebook AI just released JASCO models that make music stems .

you can try it out here : Tonic/audiocraft

hope you like it
reacted to ariG23498's post with ๐Ÿš€ 1 day ago
reacted to singhsidhukuldeep's post with ๐Ÿš€ 1 day ago
view post
Post
891
Breaking News: LinkedIn's Content Search Engine Gets a Powerful Semantic Upgrade!

Excited to share insights about LinkedIn's innovative approach to content search, recently detailed in a groundbreaking paper by their Mountain View team. This advancement represents a significant shift from traditional keyword-based search to semantic understanding.

>> Technical Architecture

The new search engine employs a sophisticated two-layer architecture:

Retrieval Layer
- Token Based Retriever (TBR) for exact keyword matching
- Embedding Based Retriever (EBR) using a two-tower model with multilingual-e5 embeddings
- Pre-computed post embeddings stored in a dedicated embedding store for efficient retrieval

Multi-Stage Ranking
- L1 Stage: Initial filtering using a lightweight model
- L2 Stage: Advanced ranking with complex features including:
- Query-post semantic matching
- Author reputation analysis
- User engagement metrics
- Content freshness evaluation

>> Performance Improvements

The system has achieved remarkable results:
- 10%+ improvement in both on-topic rate and long-dwell metrics
- Enhanced ability to handle complex natural language queries
- Significant boost in sitewide engagement

This advancement enables LinkedIn to better serve complex queries like "how to ask for a raise?" while maintaining high performance at scale. The system intelligently balances between exact keyword matching and semantic understanding, ensuring optimal results for both navigational and conceptual searches.

What impresses me most is how the team solved the scale challenge - processing billions of posts efficiently using pre-computed embeddings and approximate nearest neighbor search. This is enterprise-scale AI at its finest.
reacted to Xenova's post with ๐Ÿ”ฅ 1 day ago
view post
Post
1561
Introducing Kokoro.js, a new JavaScript library for running Kokoro TTS, an 82 million parameter text-to-speech model, 100% locally in the browser w/ WASM. Powered by ๐Ÿค— Transformers.js. WebGPU support coming soon!
๐Ÿ‘‰ npm i kokoro-js ๐Ÿ‘ˆ

Try it out yourself: webml-community/kokoro-web
Link to models/samples: onnx-community/Kokoro-82M-ONNX

You can get started in just a few lines of code!
import { KokoroTTS } from "kokoro-js";

const tts = await KokoroTTS.from_pretrained(
  "onnx-community/Kokoro-82M-ONNX",
  { dtype: "q8" }, // fp32, fp16, q8, q4, q4f16
);

const text = "Life is like a box of chocolates. You never know what you're gonna get.";
const audio = await tts.generate(text,
  { voice: "af_sky" }, // See `tts.list_voices()`
);
audio.save("audio.wav");

Huge kudos to the Kokoro TTS community, especially taylorchu for the ONNX exports and Hexgrad for the amazing project! None of this would be possible without you all! ๐Ÿค—

The model is also extremely resilient to quantization. The smallest variant is only 86 MB in size (down from the original 326 MB), with no noticeable difference in audio quality! ๐Ÿคฏ
  • 2 replies
ยท
reacted to AdinaY's post with ๐Ÿ‘ 1 day ago
reacted to RudeBoi's post with ๐Ÿ‘€ 1 day ago
view post
Post
745
Can someone please explain to me why I am getting this error message? Please see the attached. I subscribe to the pro account and I am still getting this error message. Thanks.
  • 1 reply
ยท
reacted to nroggendorff's post with โค๏ธ 1 day ago
view post
Post
804
maybe a page where you can find open orgs to get started in collaboration with hf. i see so many people that dont have a direction.


i dont have ulterior motives, so dont ask
  • 1 reply
ยท
reacted to their post with ๐Ÿ‘๐Ÿš€ 1 day ago
reacted to prithivMLmods's post with ๐Ÿš€ 1 day ago
view post
Post
1373
ChemQwen-vL [ Qwen for Chem Vision ] ๐Ÿง‘๐Ÿปโ€๐Ÿ”ฌ

๐ŸงชModel : prithivMLmods/ChemQwen-vL

๐Ÿ“ChemQwen-vL is a vision-language model fine-tuned based on the Qwen2VL-2B Instruct model. It has been trained using the International Chemical Identifier (InChI) format for chemical compounds and is optimized for chemical compound identification. The model excels at generating the InChI and providing descriptions of chemical compounds based on their images. Its architecture operates within a multi-modal framework, combining image-text-text capabilities. It has been fine-tuned using datasets from: https://iupac.org/projects/

๐Ÿ“’Colab Demo: https://tinyurl.com/2pn8x6u7, Collection : https://tinyurl.com/2mt5bjju

Inference with the documentation is possible with the help of the ReportLab library. https://pypi.org/project/reportlab/

๐Ÿค—: @prithivMLmods
  • 1 reply
ยท
reacted to AdinaY's post with ๐Ÿ”ฅ 1 day ago
posted an update 1 day ago
posted an update 4 days ago
view post
Post
2525
We are thrilled to present the improved "ClearerVoice-Studio", an open-source platform designed to make speech processing easy use for everyone! Whether youโ€™re working on speech enhancement, speech separation, speech super-resolution, or target speaker extraction, this unified platform has you covered.

** Why Choose ClearerVoice-Studio?**

- Pre-Trained Models: Includes cutting-edge pre-trained models, fine-tuned on extensive, high-quality datasets. No need to start from scratch!
- Ease of Use: Designed for seamless integration with your projects, offering a simple yet flexible interface for inference and training.

**Where to Find Us?**

- GitHub Repository: ClearerVoice-Studio (https://github.com/modelscope/ClearerVoice-Studio)
- Try Our Demo: Hugging Face Space ( alibabasglab/ClearVoice)

**What Can You Do with ClearerVoice-Studio?**

- Enhance noisy speech recordings to achieve crystal-clear quality.
- Separate speech from complex audio mixtures with ease.
- Transform low-resolution audio into high-resolution audio. A full upscaled LJSpeech-1.1-48kHz dataset can be downloaded from alibabasglab/LJSpeech-1.1-48kHz .
- Extract target speaker voices with precision using audio-visual models.

**Join Us in Growing ClearerVoice-Studio!**

We believe in the power of open-source collaboration. By starring our GitHub repository and sharing ClearerVoice-Studio with your network, you can help us grow this community-driven platform.

**Support us by:**

- Starring it on GitHub.
- Exploring and contributing to our codebase .
- Sharing your feedback and use cases to make the platform even better.
- Joining our community discussions to exchange ideas and innovations.
- Together, letโ€™s push the boundaries of speech processing! Thank you for your support! :sparkling_heart:
reacted to their post with ๐Ÿ‘ 5 days ago
view post
Post
5224
๐ŸŽ‰ ClearerVoice-Studio New Feature: Speech Super-Resolution with MossFormer2 ! ๐Ÿš€
Weโ€™re excited to announce that ClearerVoice-Studio now supports speech super-resolution, powered by our latest MossFormer2-based model!
Whatโ€™s New?

๐Ÿ”Š Convert Low-Resolution to High-Resolution Audio:
Transform low-resolution audio (effective sampling rate โ‰ฅ 16 kHz) into crystal-clear, high-resolution audio at 48 kHz.

๐Ÿค– Cutting-Edge Technology:
Leverages the MossFormer2 model plus HiFi-GAN, optimised for generating high-quality audio with enhanced perceptual clarity.

๐ŸŽง Enhanced Listening Experience:
Perfect for speech enhancement, content restoration, and high-fidelity audio applications.

๐ŸŒŸ Try It Out!
Upgrade to the latest version of ClearerVoice-Studio (https://github.com/modelscope/ClearerVoice-Studio) to experience this powerful feature. Check out the updated documentation and examples in our repository.

Let us know your thoughts, feedback, or feature requests in the Issues section.