Multi-modal semantics has relied on feature norms or raw image data for perceptual input. In this paper we examine grounding semantic representations in raw auditory data, using standard evaluations for multi-modal semantics, including measuring conceptual similarity and relatedness. We also evaluate cross-modal mappings, through a zero-shot learning task mapping between linguistic and auditory modalities. In addition, we evaluate multi- modal representations on an unsupervised musical instrument clustering task. To our knowledge, this is the first work to combine linguistic and auditory information into multi-modal representations.

Authors: Douwe Kiela, Stephen Clark
Published in: Conference on Empirical Methods in Natural Language Processing (EMNLP) (2015)
URL: http://www.cl.cam.ac.uk/~dk427/papers/emnlp2015a.pdf