I. Early Work
• Early Work assumed that files were on tape.
• Access was sequential
– The cost of access grew in direct proportion to the size of the file.
II. The emergence of Disks and Indexes
• As files grew very large, sequential access was not a good solution.
• Disks allowed for direct access.
– Indexes made it possible to keep a list of keys and pointers in a small file that could be searched very quickly.
– With the key and pointer, the user had direct access to the large, primary file.
III. The emergence of Tree Structures
• Indexes also have a sequential flavor
• When they grow too much, they also become difficult to manage.
• The idea of using tree structures to manage the index emerged in the early 60’s.
• However, trees can grow very unevenly as records are added and deleted
• Resulting in long searches requiring many disk accesses to find a record.
IV. Balanced Trees
• In 1963, researchers came up with the idea of AVL trees for data in memory.
• However, AVL trees did not apply to files
• Because they work well when tree nodes are composed of single records rather than dozens or hundreds of them.
• In the 1970’s came the idea of B-Trees which require an O(logkN) access time
• Where N is the number of entries in the file and k is the number of entries indexed in a single block of the B-Tree structure
• B-Trees can guarantee that we can find an entry among millions of others with only 3 or 4 trips to the disk.
V. Hash Tables
• Retrieving entries in 3 or 4 accesses is good
• But it does not reach the goal of accessing data with a single request.
• Hashing was a good way to reach this goal with files that do not change size greatly over time.
• Recently, Extendible Dynamic Hashing guarantees one or at most two disk accesses no matter how big a file becomes.