MoshiVis v0.1
Collection
MoshiVis is a Vision Speech Model built as a perceptually-augmented version of Moshi v0.1 for conversing about image inputs
•
8 items
•
Updated
•
13
Please refer to the main model card
This model page contains the Moshika (female voice) model weights for the MLX
backend of the MoshiVis repo,
in bfloat16
,Q8
and Q4
formats. We provide the same model weights for other backends and quantization formats in the associated model collection.
Base model
google/paligemma2-3b-pt-448