June Boltzis
LILAC Centre, School of Library Studies Clyde College, Elgin, Australia
Abstract
Ranking techniques are used to evaluate natural-language queries on text databases. Text databases are an important component of digital libraries. Effective ranking can be costly in memory and time: the database may contain millions of documents and queries can contain large numbers of terms. These information retrieval systems must access large volumes of text, often divided into several collections that may be held on separate machines. In many environments, such as current desktop computers, standard CPU speeds and volumes of mem- ory are more than adequate to rapidly resolve queries, even on databases of many gigabytes of text. Techniques for locating answers to queries must therefore consider identification of probable collections as well as identification of documents that are probable answers, to avoid the situation in which all queries must be answered in full by all servers. In other environ- ments, however, both memory and time are limited: examples include Internet search engines, corporate data servers, online product databases, and, at the other extreme, handheld com- puters with PCIMIA-slot disk drives. In this paper we show that use of centralised blocked indexes, expressly designed for a multi-collection environment, can meet these objectives and simultaneously reduce overall query processing costs.
1 Introduction
The use of information retrieval systems for management of text data is widespread, and their use is likely to accelerate with the advent of the digital library. All of these techniques reduce the time or memory required to resolve a query. Newspaper archives, library catalogues, and legislation repositories all require access by record content if they are to be useful and effective. However, they do not necessarily bound it.
References: [BCW90] T. C. Bell, J. G. Cleary, and I. H. Witten. Text Compression. Prentice-Hall, Englewood Cliffs, New Jersey, 1990. [Dat83] C. J. Date. An Introduction to Database Systems, volume II. Addison-Wesley, Massachusetts, 1983. [FBY92] W. B. Frakes and R. Baeza-Yates, editors. Information Retrieval: Data Structures and Algorithms. Prentice-Hall, 1992. [GGM95] L. Gravano and J. H. Garcia-Molina. Generalising GlOSS to vector-space databases and broker hierarchies. In Proc. Int. Conf. on Very Large Databases, Zurich, Switzerland, 1995. [OV91] M. T. O ̈ zsu and P. Valduriez. Principles of Distributed Database Systems. Prentice-Hall, New Jersey, 1991. [PZSD96] M. Persin, J. Zobel, and R. Sacks-Davis. Filtered document retrieval with frequency-sorted indexes. Jour. of the American Society for Information Science, 47(10):749–764, 1996. [Sal89] G. Salton. Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer. Addison-Wesley, Reading, MA, 1989. [VGJL94] E. M. Voorhees, N. K. Gupta, and B. Johnson-Laird. The collection fusion problem. In D. K. Harman, editor, Proc. Text Retrieval Conf. (TREC), pages 95–104, Gaithersburg, Maryland, 1994. NIST Special Publication 500-225. [vR79] C. J. van Rijsbergen. Information Retrieval. Butterworths, London, second edition, 1979. [WMB99] I. H. Witten, A. Moffat, and T. C. Bell. Managing Gigabytes: Compressing and Indexing Documents and Images [ZMR98] J. Zobel, A. Moffat, and K. Ramamohanarao. Inverted files versus signature files for text indexing. ACM Transactions on Database Systems, 23(4):453–490, 1998.