IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,
VOL. 25,
NO. 6,
JUNE 2013
Spatial Approximate String Search
Feifei Li, Member, IEEE, Bin Yao, Mingwang Tang, and Marios Hadjieleftheriou
Abstract—This work deals with the approximate string search in large spatial databases. Specifically, we investigate range queries augmented with a string similarity search predicate in both euclidean space and road networks. We dub this query the spatial approximate string (SAS) query. In euclidean space, we propose an approximate solution, the MHR-tree, which embeds min-wise signatures into an R-tree. The min-wise signature for an index node u keeps a concise representation of the union of q-grams from strings under the subtree of u. We analyze the pruning functionality of such signatures based on the set resemblance between the query string and the q-grams from the subtrees of index nodes. We also discuss how to estimate the selectivity of a SAS query in euclidean space, for which we present a novel adaptive algorithm to find balanced partitions using both the spatial and string information stored in the tree. For queries on road networks, we propose a novel exact method, RSASSOL, which significantly outperforms the baseline algorithm in practice. The RSASSOL combines the q-gram-based inverted lists and the reference nodes based pruning. Extensive experiments on large real data sets demonstrate the efficiency and effectiveness of our approaches.
Index Terms—Approximate string search, range query, road network, spatial databases
Ç
1
INTRODUCTION
K
search over a large amount of data is an important operation in a wide range of domains. Felipe et al. have recently extended its study to spatial databases
[17], where keyword search becomes a fundamental building block for an increasing number of real-world applications, and proposed the IR2 -Tree. A main limitation of the IR2 -Tree is that it only supports exact keyword
search.
References: Management of Data, pp. 13-24, 1999. Conf. Advances in Geographic Information Systems (GIS), pp. 61-70, 2010. pp. 322-331, 1990. ACM 30th Symp. Theory of Computing (STOC), pp. 327-336, 1998. SIGMOD Int’l Conf. Management of Data, pp. 805-818, 2008. SIGMOD Int’l Conf. Management of Data, pp. 313-324, 2003. Proc. Int’l Conf. Data Eng. (ICDE), pp. 227-238, 2004. (ICDE), pp. 5-16, 2006. Sciences, vol. 55, no. 3, pp. 441-453, 1997. vol. 2, no. 1, pp. 337-348, 2009. (ICDM), pp. 139-146, 2002. Symp. Discrete Algorithms (SODA), pp. 156-165, 2005. Bases (VLDB), pp. 491-500, 2001. Real Attributes,” The VLDB J., vol. 14, no. 2, pp 137-154, 2005. pp. 47-57, 1984. pp. 397-408, 2005. vol. 17, no. 5, pp. 1213-1229, 2008. Structure,” Proc. Int’l Conf. Very Large Data Bases (VLDB), pp. 325336, 2005. Very Large Data Bases (VLDB), pp. 1078-1086, 2004.