May Phyu Htun
Computer University (Mandalay) mphyutun@gmail.com. Abstract
The large size and the dynamic nature of the Web highlight the need for continuous support and updating of Web-based information retrieval systems. Crawlers facilitate the process by following the hyperlinks in Web pages to automatically download a partial snapshot of the Web. Traversing the web graph in breadth-first search order is a good crawling. This system is intended to study a crawling infrastructure and basic concepts in Web crawling. Then, web crawler application is implemented by using breadth-first search technique. Breadth-First Crawling checks each link on a page before proceeding to the next page. Thus, it crawls each link on the first page and then crawls each link on the first page’s first’ link, and so on, until each level of link has been exhausted. While Crawling the links of a URL address, the local HTML web pages are saved in a folder as MHTML format: (Single File Web Page).
Introduction
The Web is a very large collection of pages and search engines serve as the primary discovery mechanism to the content. To be able to provide the search functionality, search engines use crawlers that automatically follow links to web pages and extract. Web crawlers are programs that exploit the graph structure of the Web to move from page to page. In their infancy such programs were also called wanderers, robots, spiders, fish, and worms, words that are quite evocative of Web imagery. Crawler can be viewed as a graph search problem. The Web is seen as a large graph with pages at its nodes and hyperlinks as its edges. Web Crawler moves from node to node by means of the hyperlinks that each node contains and that define the edges of the web graph. Therefore, many algorithms used in graph searching can be frequently observed in web crawling of transformed versions. Traversing the web graph in breadth-first search
References: [3] Pinkerton, B. 1994. “Finding what people want: Experiences with the WebCrawler”. In Proc. 1stInternational World Wide Web Conference (Geneva). [4] Najork, M. and Wiener, J. L. 2001. “Breadth-First search crawling yields high-quality pages”. In Proc. 10th International World Wide Web Conference.