Abstract
A highly effective music synthesizer should deliver high-fidelity audio for a mix of instruments and voices. Current synthesizers often need to choose between specialized models that provide detailed control over specific instruments and flexible waveform models that accommodate a variety of music at the expense of precision. To transcend the existing limitations, this paper introduces MIAO, an avant-garde neural music synthesizer that revolutionizes the domain of interactive and expressive music synthesis by converting MIDI sequences into rich, dynamic audio outputs. Specifically, MIAO can be cultivated through training on diverse transcription datasets that correlate MIDI with audio, thereby deepening its comprehension of MIDI intricacies and elevating its capacity for robust representation learning. This approach allows MIAO to offer precise note-level control over composition and instrumentation, effectively handling a wide spectrum of instruments. We evaluate MIAO's performance by benchmarking it against six datasets: MAESTROv3 (piano), Slakh2100 (synthetic multi-instrument), Cerberus4 (synthetic multi-instrument), Guitarset (guitar), MusicNet (orchestral multi-instrument), and URMP (orchestral multi-instrument), where it sets new performance benchmarks.