Adrian Ulges1 , Christian Schulze2 , Daniel Keysers2 , Thomas M. Breuel1
1 University
2 German
of Kaiserslautern, Germany
Research Center for Artificial Intelligence (DFKI), Kaiserslautern
{a ulges,tmb}@informatik.uni-kl.de,
{christian.schulze,daniel.keysers}@dfki.de
Abstract
Despite the increasing economic impact of the online video market, search in commercial video databases is still mostly based on user-generated meta-data. To complement this manual labeling, recent research efforts have investigated the interpretation of the visual content of a video to automatically annotate it. A key problem with such methods is the costly acquisition of a manually annotated training set.
In this paper, we study whether content-based tagging can be learned from user-tagged online video, a vast, public data source. We present an extensive benchmark using a database of real-world videos from the video portal youtube.com. We show that a combination of several visual features improves performance over our baseline system by about 30%.
1
Introduction
Due to the rapid spread of the web and growth of its bandwidth, millions of users have discovered online video as a source of information and entertainment. A market of significant economic impact has evolved that is often seen as a serious competitor for traditional TV broadcast.
However, accessing the desired pieces of information in an efficient manner is a difficult problem due to the enormous quantity and diversity of video material published. Most commercial systems organize video access and search via meta-data like the video title or user-generated tags (e.g., youtube, myspace, clipfish) – an indexing method that requires manual work and is time-consuming, incomplete, and subjective.
While commercial systems neglect another valuable source of information, namely the content of a video, research in content-based video retrieval strives to
References: [1] Deselaers T. and Keysers D. and Ney H., ‘Discriminative Training for Object Recognition Using Image Patches’, CVPR, pp.157-162, Washington, DC, 2005. Vol. 9, No. 8, pp.1280-1289, 1999. Pattern Anal. Mach. Intell., Vol. 20, No. 3, pp.226-239, 1998. [8] Li J. and Wang J., ‘Real-time Computerized Annotation of Pictures’, Intern. Conf. on Multimedia, pp.911-920, Santa Barbara, CA, 2006. [11] Feng S.L. and Manmatha R. and Lavrenko V., ‘Multiple Bernoulli Relevance Models for Image and Video Annotation’, CVPR, pp.1002-1009, Washington, DC, 2004. [14] Fei-Fei L. and Perona P., ‘A Bayesian Hierarchical Model for Learning Natural Scene Categories’, CVPR, pp.524-531, San Diego, CA, 2005. [15] Sivic J. and Zisserman A., ‘Video Google: A Text Retrieval Approach to Object Matching in Videos’, ICCV, pp.1470-1477, Washington, DC, 2003. Vol. 22, No. 12, pp.1349-1380, 2000. [17] Snoek C. et al., ‘The MediaMill TRECVID 2006 Semantic Video Search Engine’, TRECVID Workshop (unreviewed workshop paper), Gaithersburg, MD, 2006. [19] Vasconcelos N. and Lippman A., ‘Statistical Models of Video Structure for Content Analysis and Characterization’, IEEE Trans. Image Process., Vol. 9, No. 1, pp.3-19, 2000. Intern. Conf. on Multimedia, pp.421-430, Santa Barbara, CA, 2006.