AI & ML interests

🚀 Creators of Spaces with the most cumulative all-time likes at the end of each month (users only, no orgs)

multimodalart 
posted an update 10 days ago
view post
Post
4124
Self-Forcing - a real-time video distilled model from Wan 2.1 by @adobe is out, and they open sourced it 🐐

I've built a live real time demo on Spaces 📹💨

multimodalart/self-forcing
  • 2 replies
·
jbilcke-hf 
posted an update 22 days ago
jbilcke-hf 
posted an update 22 days ago
view post
Post
1823
Hi everyone,

I've seen some unsuccessful attempts at running Wan2GP inside a Hugging Face Space, which is a shame as it is a great Gradio app!

So here is a fork that you can use, with some instructions on how to do this:

jbilcke-hf/Wan2GP_you_must_clone_this_space_to_use_it#1

Note : some things like persistent models/storage/custom LoRAs might not be fully working out of the box. If you need those, you might have to dig into the Wan2GP codebase, see how to tweak the storage folder. Happy hacking!

fffiloni 
posted an update 4 months ago
fffiloni 
posted an update 5 months ago
view post
Post
3581
Explain like i'm 5 the last take from @thomwolf on X about Dario's essay on DeepSeek:

—› Open-source AI is like a big cookbook that everyone can read and improve. Instead of a few chefs keeping their recipes secret, anyone can cook, test, and invent new things.

If only one company controls AI, everything stops if they have a problem—like when the internet goes down. With open-source, many people can help, making sure it keeps running smoothly.

AI isn’t just a race between two countries; it’s a team effort around the world. By sharing, we move faster and create safer technology for everyone.

🤗
akhaliq 
posted an update 6 months ago
view post
Post
22061
Google drops Gemini 2.0 Flash Thinking

a new experimental model that unlocks stronger reasoning capabilities and shows its thoughts. The model plans (with thoughts visible), can solve complex problems with Flash speeds, and more

now available in anychat, try it out: https://huggingface.co/spaces/akhaliq/anychat
·
jbilcke-hf 
posted an update 6 months ago
view post
Post
15413
Doing some testing with HunyuanVideo on the Hugging Face Inference Endpoints 🤗

prompt: "a Shiba Inu is acting as a DJ, he wears sunglasses and is mixing and scratching with vinyl discs at a Ibiza sunny sand beach party"

1280x720, 22 steps, 121 frames

There are still some things to iron out regarding speed and memory usage, right now it takes 20min on an A100 (see attached charts)

but you can check it out here:

https://huggingface.co/jbilcke-hf/HunyuanVideo-for-InferenceEndpoints

There are various things I want to try like the 100% diffusers version and other models (LTX-Video..)
akhaliq 
posted an update 7 months ago
akhaliq 
posted an update 7 months ago
akhaliq 
posted an update 7 months ago
fffiloni 
posted an update 8 months ago
fffiloni 
posted an update 9 months ago
view post
Post
20027
Visionary Walter Murch (editor for Francis Ford Coppola), in 1999:

“ So let's suppose a technical apotheosis some time in the middle of the 21st century, when it somehow becomes possible for one person to make an entire feature film, with virtual actors. Would this be a good thing?

If the history of oil painting is any guide, the broadest answer would be yes, with the obvious caution to keep a wary eye on the destabilizing effect of following too intently a hermetically personal vision. One need only look at the unraveling of painting or classical music in the 20th century to see the risks.

Let's go even further, and force the issue to its ultimate conclusion by supposing the diabolical invention of a black box that could directly convert a single person's thoughts into a viewable cinematic reality. You would attach a series of electrodes to various points on your skull and simply think the film into existence.

And since we are time-traveling, let us present this hypothetical invention as a Faustian bargain to the future filmmakers of the 21st century. If this box were offered by some mysterious cloaked figure in exchange for your eternal soul, would you take it?

The kind of filmmakers who would accept, even leap, at the offer are driven by the desire to see their own vision on screen in as pure a form as possible. They accept present levels of collaboration as the evil necessary to achieve this vision. Alfred Hitchcock, I imagine, would be one of them, judging from his description of the creative process: "The film is already made in my head before we start shooting."”

Read "A Digital Cinema of the Mind? Could Be" by Walter Murch: https://archive.nytimes.com/www.nytimes.com/library/film/050299future-film.html

  • 1 reply
·
multimodalart 
posted an update 11 months ago
jbilcke-hf 
posted an update 12 months ago
view post
Post
32535
I am working to add automatic speech bubbles to the AI Comic Factory!

This is not finished yet - there are a few things to tweak - but so far results are pretty promising!

(it's available through a "beta" toggle)
·
akhaliq 
posted an update about 1 year ago
view post
Post
21036
Phased Consistency Model

Phased Consistency Model (2405.18407)

The consistency model (CM) has recently made significant progress in accelerating the generation of diffusion models. However, its application to high-resolution, text-conditioned image generation in the latent space (a.k.a., LCM) remains unsatisfactory. In this paper, we identify three key flaws in the current design of LCM. We investigate the reasons behind these limitations and propose the Phased Consistency Model (PCM), which generalizes the design space and addresses all identified limitations. Our evaluations demonstrate that PCM significantly outperforms LCM across 1--16 step generation settings. While PCM is specifically designed for multi-step refinement, it achieves even superior or comparable 1-step generation results to previously state-of-the-art specifically designed 1-step methods. Furthermore, we show that PCM's methodology is versatile and applicable to video generation, enabling us to train the state-of-the-art few-step text-to-video generator.
fffiloni 
posted an update about 1 year ago
view post
Post
19554
🇫🇷
Quel impact de l’IA sur les filières du cinéma, de l’audiovisuel et du jeu vidéo?
Etude prospective à destination des professionnels
— CNC & BearingPoint | 09/04/2024

Si l’Intelligence Artificielle (IA) est utilisée de longue date dans les secteurs du cinéma, de l’audiovisuel et du jeu vidéo, les nouvelles applications de l’IA générative bousculent notre vision de ce dont est capable une machine et possèdent un potentiel de transformation inédit. Elles impressionnent par la qualité de leurs productions et suscitent par conséquent de nombreux débats, entre attentes et appréhensions.

Le CNC a donc décider de lancer un nouvel Observatoire de l’IA Afin de mieux comprendre les usages de l’IA et ses impacts réels sur la filière de l’image. Dans le cadre de cet Observatoire, le CNC a souhaité dresser un premier état des lieux à travers la cartographie des usages actuels ou potentiels de l’IA à chaque étape du processus de création et de diffusion d’une œuvre, en identifiant les opportunités et risques associés, notamment en termes de métiers et d’emploi. Cette étude CNC / Bearing Point en a présenté les principaux enseignements le 6 mars, lors de la journée CNC « Créer, produire, diffuser à l’heure de l’intelligence artificielle ».

Le CNC publie la version augmentée de la cartographie des usages de l’IA dans les filières du cinéma, de l’audiovisuel et du jeu vidéo.

Lien vers la cartographie complète: https://www.cnc.fr/documents/36995/2097582/Cartographie+des+usages+IA_rapport+complet.pdf/96532829-747e-b85e-c74b-af313072cab7?t=1712309387891
·
akhaliq 
posted an update about 1 year ago
view post
Post
21252
Chameleon

Mixed-Modal Early-Fusion Foundation Models

Chameleon: Mixed-Modal Early-Fusion Foundation Models (2405.09818)

We present Chameleon, a family of early-fusion token-based mixed-modal models capable of understanding and generating images and text in any arbitrary sequence. We outline a stable training approach from inception, an alignment recipe, and an architectural parameterization tailored for the early-fusion, token-based, mixed-modal setting. The models are evaluated on a comprehensive range of tasks, including visual question answering, image captioning, text generation, image generation, and long-form mixed modal generation. Chameleon demonstrates broad and general capabilities, including state-of-the-art performance in image captioning tasks, outperforms Llama-2 in text-only tasks while being competitive with models such as Mixtral 8x7B and Gemini-Pro, and performs non-trivial image generation, all in a single model. It also matches or exceeds the performance of much larger models, including Gemini Pro and GPT-4V, according to human judgments on a new long-form mixed-modal generation evaluation, where either the prompt or outputs contain mixed sequences of both images and text. Chameleon marks a significant step forward in a unified modeling of full multimodal documents.
multimodalart 
posted an update about 1 year ago
view post
Post
28406
The first open Stable Diffusion 3-like architecture model is JUST out 💣 - but it is not SD3! 🤔

It is Tencent-Hunyuan/HunyuanDiT by Tencent, a 1.5B parameter DiT (diffusion transformer) text-to-image model 🖼️✨, trained with multi-lingual CLIP + multi-lingual T5 text-encoders for english 🤝 chinese understanding

Try it out by yourself here ▶️ https://huggingface.co/spaces/multimodalart/HunyuanDiT
(a bit too slow as the model is chunky and the research code isn't super optimized for inference speed yet)

In the paper they claim to be SOTA open source based on human preference evaluation!
jbilcke-hf 
posted an update about 1 year ago