Gabriella Csurka, Christopher R. Dance, Lixin Fan, Jutta Willamowski, Cédric Bray
Xerox Research Centre Europe
6, chemin de Maupertuis
38240 Meylan, France
{gcsurka,cdance}@xrce.xerox.com
Abstract. We present a novel method for generic visual categorization: the problem of identifying the object content of natural images while generalizing across variations inherent to the object class. This bag of keypoints method is based on vector quantization of affine invariant descriptors of image patches.
We propose and compare two alternative implementations using different classifiers: Naïve Bayes and SVM. The main advantages of the method are that it is simple, computationally efficient and intrinsically invariant. We present results for simultaneously classifying seven semantic visual categories. These results clearly demonstrate that the method is robust to background clutter and produces good categorization accuracy even without exploiting geometric information.
1. Introduction
The proliferation of digital imaging sensors in mobile phones and consumer-level cameras is producing a growing number of large digital image collections. To manage such collections it is useful to have access to high-level information about objects contained in the image. Given an appropriate categorization of image contents, one may efficiently search, recommend, react to or reason with new image instances.
We are thus confronted with the problem of generic visual categorization. We should like to identify processes that are sufficiently generic to cope with many object types simultaneously and which are readily extended to new object types. At the same time, these processes should handle the variations in view, imaging, lighting and occlusion, typical of the real world, as well as the intra-class variations typical of semantic classes of everyday objects.
The task-dependent and evolving nature of visual categories
References: [1] E. Osuna, R. Freund, F and Girosi. Training support vector machines: An application to face detection, CVPR (Computer Vision and Pattern Recognition), 1997. [2] C. Papageorgiou, T. Evgeniou and T. Poggio. A trainable pedestrian detection system, IEEE Conference on Intelligent Vehicles, 1998. [3] H. Schneiderman and T. Kanade, "A Statistical method for 3D object detection applied to faces and cars", CVPR, 2000. [4] P. Viola and M. Jones, Rapid object detection using a boosted cascade of simple features, CVPR, 2001 [5] S.Z. Li, L. Zhu, Z.Q. Zhang, A. Blake, H.J. Zhang and H. Shum, Statistical learning of multi-view face detection, ECCV (European Conference on Computer Vision), 2002. [7] T. Joachims. Text categorization with support vector machines: Learning with many relevant features, ECML, 1998. [10] N. Cristianini, J.Shawe-Taylor and H. Lodhi, Latent Semantic Kernels, Journal of Intelligent Information Systems, 18 (2), 127-152, 2002. [11] L. Zhu, A. Rao and A. Zhang, Theory of Keyblock-based image retrieval, ACM Transactions on Information Systems, 20, (2), 224-257, 2002. [17] T. Lindenberg, Scale-space theory in computer vision, Kluwer Academic Publishers, 1994. [18] D. G. Lowe, Object Recognition from local scale–invariant features, ICCV (International Conference on Computer Vision), 1999. [19] J. Matas, J. Burianek, and J. Kittler. Object recognition using the invariant pixel-set signature, BMVC (British Machine Vision Conference), 2000. [20] F. Schaffalitzky and A. Zisserman. Viewpoint invariant texture matching and wide baseline stereo, ICCV, 2001. [21] K. Mikolajczyk and C. Schmid. An affine invariant interest point detector, ECCV, 2002. [22] K. Mikolajczyk and C. Schmid, A performance evaluation of local descriptors, CVPR, 2003. [23] O. Duda, P.E. Hart, D.G. Stork, Pattern classification, John Wiley & Sons, 2000. [24] D. Pelleg and A. Moore. X-Means: Extending K-means with Efficient Estimation of the Number of Clusters, International Conference on Machine Learning, 2000. [25] V. Vapnik. Statistical Learning Theory. Wiley, 1998 [26] D [27] P. Domingos and M. Pazzani, On the optimality of simple Bayesian classifier under zeroone loss, Machine Learning, 29, 1997.