15-826 Final Report
Dawen Liang,† Haijie Gu,‡ and Brendan O’Connor‡
† School of Music, ‡ Machine Learning Department Carnegie Mellon University
December 3, 2011
1
Introduction
The field of Music Information Retrieval (MIR) draws from musicology, signal processing, and artificial intelligence. A long line of work addresses problems including: music understanding (extract the musically-meaningful information from audio waveforms), automatic music annotation (measuring song and artist similarity), and other problems. However, very little work has scaled to commercially sized data sets. The algorithms and data are both complex. An extraordinary range of information is hidden inside of music waveforms, ranging from perceptual to auditory—which inevitably makes largescale applications challenging. There are a number of commercially successful online music services, such as Pandora, Last.fm, and Spotify, but most of them are merely based on traditional text IR. Our course project focuses on large-scale data mining of music information with the recently released Million Song Dataset (Bertin-Mahieux et al., 2011),1 which consists of
1
http://labrosa.ee.columbia.edu/millionsong/
1
300GB of audio features and metadata. This dataset was released to push the boundaries of Music IR research to commercial scales. Also, the associated musiXmatch dataset2 provides textual lyrics information for many of the MSD songs. Combining these two datasets, we propose a cross-modal retrieval framework to combine the music and textual data for the task of genre classification: Given N song-genre pairs: (S1 , GN ), . . . , (SN , GN ), where Si ∈ F for some feature space F, and Gi ∈ G for some genre set G, output the classifier with the highest classification accuracy on the hold-out test set. The raw feature space F contains multiple domains of sub features which can be of variable length. The genre label set G is