Any plan to release a Vision enabled version with the same or near the same base and instruct model?
As you can tell we all really enjoy the model and the way it sounds because it doesn't sound as robotic as other SOTA. It would be a very welcome addition to enable full vision capability later on for the model in the form of a update/upgrade to allow it to be a better replacement for some use cases. especially if you follow the same design philosophy that you did to create this amazing model.
Thanks. We do have the plan to make Kimi K2 having vision ability. We already have the relevant technical expertise (see kimi-vl in our homepage), but it will still take some time.
Will it still act like this model or will it retrain it entirely and make it act different?
You can refer to our kimi-vl report https://arxiv.org/abs/2504.07491. The act should be similar but cannot be exact the same.
Thank you for your interest in the vision-enabled version. We will ship the vision-enabled version when it meets our expectations on such a scale. And we hope when it is here, it shall not let you down.
As for a prototype on tiny vision-language models, please refer to Kimi-VL-A3B: https://huggingface.co/moonshotai/Kimi-VL-A3B-Thinking-2506.
Hey @teowu @lsw825 is it feasible to finetune the moonshotai/Kimi-K2-Instruct or Base for vision in your assessment? Or should the prototype VL model mentioned above be sufficient for general video tasks as a SOTA open weights VL model as of today? I'm particuarly interested in the agentic capabilities from K2, so am not sure if Kimi-VL-A3B-Thinking-2506 will perform similarly and respond to agentic tool usage well, but I have not yet tested it against tasks.
No, we didn't optimize Kimi-VL-A3B-Thinking-2506 for agentic tool usage
are you planning to open source kimi's chat platform? with context management, ocr etc? or not at all.