ARIA - Artistic Rendering of Images into Audio
ARIA is a multimodal AI model that generates MIDI music based on the emotional content of artwork. It uses a CLIP-based image encoder to extract emotional valence and arousal from images, then generates emotionally appropriate music using conditional MIDI generation.
Model Description
- Developed by: Vincent Amato
- Model type: Multimodal (Image-to-MIDI) Generation
- Language(s): English
- License: MIT
- Parent Model: Uses CLIP for image encoding and midi-emotion for music generation
- Repository: GitHub
Model Architecture
ARIA consists of two main components:
- A CLIP-based image encoder fine-tuned to predict emotional valence and arousal from images
- A transformer-based MIDI generation model (midi-emotion) that conditions on these emotional values
The model offers three different conditioning modes:
continuous_concat
: Emotions as continuous vectors concatenated to all tokenscontinuous_token
: Emotions as continuous vectors prepended to sequencediscrete_token
: Emotions quantized into discrete tokens
Usage
The repository contains three variants of the MIDI generation model, each trained with a different conditioning strategy. Each variant includes:
model.pt
: The trained model weightsmappings.pt
: Token mappings for MIDI generationmodel_config.pt
: Model configuration
Additionally, image_encoder.pt
contains the CLIP-based image emotion encoder.
Intended Use
This model is designed for:
- Generating music that matches the emotional content of artwork
- Exploring emotional transfer between visual and musical domains
- Creative applications in art and music generation
Limitations
- Music generation quality depends on the emotional interpretation of input images
- Generated MIDI may require human curation for professional use
- Model's emotional understanding is limited to valence-arousal space
Training Data
The model combines:
- Image encoder: Uses ArtBench with emotional annotations
- MIDI generation: Uses the Lakh-Spotify dataset as processed by the midi-emotion project
Attribution
This project builds upon:
- midi-emotion by Serkan Sulun et al. (GitHub)
- Paper: "Symbolic music generation conditioned on continuous-valued emotions" (IEEE Access)
- Citation: S. Sulun, M. E. P. Davies and P. Viana, "Symbolic Music Generation Conditioned on Continuous-Valued Emotions," in IEEE Access, vol. 10, pp. 44617-44626, 2022
- CLIP by OpenAI for the base image encoder architecture
License
This model is released under the MIT License. However, usage of the midi-emotion component should comply with its GPL-3.0 license.