ARIA - Artistic Rendering of Images into Audio

ARIA is a multimodal AI model that generates MIDI music based on the emotional content of artwork. It uses a CLIP-based image encoder to extract emotional valence and arousal from images, then generates emotionally appropriate music using conditional MIDI generation.

Model Description

Developed by: Vincent Amato
Model type: Multimodal (Image-to-MIDI) Generation
Language(s): English
License: MIT
Parent Model: Uses CLIP for image encoding and midi-emotion for music generation
Repository: GitHub

Model Architecture

ARIA consists of two main components:

A CLIP-based image encoder fine-tuned to predict emotional valence and arousal from images
A transformer-based MIDI generation model (midi-emotion) that conditions on these emotional values

The model offers three different conditioning modes:

continuous_concat: Emotions as continuous vectors concatenated to all tokens
continuous_token: Emotions as continuous vectors prepended to sequence
discrete_token: Emotions quantized into discrete tokens

Usage

The repository contains three variants of the MIDI generation model, each trained with a different conditioning strategy. Each variant includes:

model.pt: The trained model weights
mappings.pt: Token mappings for MIDI generation
model_config.pt: Model configuration

Additionally, image_encoder.pt contains the CLIP-based image emotion encoder.

Intended Use

This model is designed for:

Generating music that matches the emotional content of artwork
Exploring emotional transfer between visual and musical domains
Creative applications in art and music generation

Limitations

Music generation quality depends on the emotional interpretation of input images
Generated MIDI may require human curation for professional use
Model's emotional understanding is limited to valence-arousal space

Training Data

The model combines:

Image encoder: Uses ArtBench with emotional annotations
MIDI generation: Uses the Lakh-Spotify dataset as processed by the midi-emotion project

Attribution

This project builds upon:

midi-emotion by Serkan Sulun et al. (GitHub)
- Paper: "Symbolic music generation conditioned on continuous-valued emotions" (IEEE Access)
- Citation: S. Sulun, M. E. P. Davies and P. Viana, "Symbolic Music Generation Conditioned on Continuous-Valued Emotions," in IEEE Access, vol. 10, pp. 44617-44626, 2022
CLIP by OpenAI for the base image encoder architecture

License

This model is released under the MIT License. However, usage of the midi-emotion component should comply with its GPL-3.0 license.

vincentamato
/

ARIA