As it is evident from the related work discussed in the section 2, when small files are stored on HDFS, disk utilization is not a bottleneck. In general, small file problem occurs when memory of NameNode is highly consumed by the metadata and BlockMap of huge numbers of files. NameNode stores file system metadata in main memory and the metadata of one file takes about 250 bytes of memory. For each block by default three replicas are created and its metadata takes about 368 bytes [9]. Let the number of memory bytes that NameNode consumed by itself be denoted as α. Let the number of memory bytes that are consumed by the BlockMap be denoted as β. The size of an HDFS block is denoted as S. Further assume that there are N …show more content…
FIndex: Local index file for set of merged small files.
Phase 4: Uploading of files to HDFS: Both of the files, local index file and merged file are written to HDFS which avoid overhead involved in keeping the information at NameNode. NameNode keeps the information of merged file and index file only. File correlations are considered when storing the files to improve the access efficiency.
Phase 5: File caching strategy: The caching strategy is used to cache local index file and correlated files. Based on the strategy, communications with HDFS are drastically reduced thus to improve the access efficiency, when downloading files. When a requested file misses in cache, the client needs to query NameNode for file metadata. According to the metadata, the client connects with appropriate DataNodes where blocks locate. When the local index file is firstly read, based on the offset and length, the requested file is split from the block, and is returned to the client.
5. Theoretical Validation Of the Proposed Technique
Suppose there are N small files, which are merged into K merged files whose lengths are denoted as LM1, LM2, …, and LMK. The computational formula of the consumed memory of NameNode in file merging and caching technique is given