IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS,
VOL. 20,
NO. 10,
OCTOBER 2009
A Parameterized Approach to Spam-Resilient
Link Analysis of the Web
James Caverlee, Member, IEEE, Steve Webb, Member, IEEE,
Ling Liu, Senior Member, IEEE, and William B. Rouse, Fellow, IEEE
Abstract—Link-based analysis of the Web provides the basis for many important applications—like Web search, Web-based data mining, and Web page categorization—that bring order to the massive amount of distributed Web content. Due to the overwhelming reliance on these important applications, there is a rise in efforts to manipulate (or spam) the link structure of the Web. In this manuscript, we present a parameterized framework for link analysis of the Web that promotes spam resilience through a source-centric view of the
Web. We provide a rigorous study of the set of critical parameters that can impact source-centric link analysis and propose the novel notion of influence throttling for countering the influence of link-based manipulation. Through formal analysis and a large-scale experimental study, we show how different parameter settings may impact the time complexity, stability, and spam resilience of Web link analysis. Concretely, we find that the source-centric model supports more effective and robust rankings in comparison with existing Web algorithms such as PageRank.
Index Terms—Internet search, information search and retrieval, information storage and retrieval, information technology and systems, distributed systems, systems and software, Web search, general, Web-based services, online information services.
Ç
1
INTRODUCTION
T
HE Web is arguably the most massive and successful distributed computing application today. Millions of
Web servers support the autonomous sharing of billions of
Web pages. From its earliest days, the Web has been the subject of intense focus for organizing, sorting, and understanding its massive amount of data.
References: Statistics,” Proc. Seventh Int’l Workshop the Web and Databases (WebDB), 2004. First Int’l Workshop Adversarial Information Retrieval on the Web (AIRWeb), 2005. [3] C. Mann, “Spam þ Blogs ¼ Trouble,” Wired, 2006. [4] J.M. Kleinberg, “Authoritative Sources in a Hyperlinked Environment,” J. ACM, vol. 46, no. 5, 1999. Stanford Univ., 1998. Conf. Data Mining (ICDM), 2001. technical report, Stanford Univ., 2003. World Wide Web Conf. (WWW), 2004. World Wide Web Conf. (WWW), 2004. Patterns,” Proc. 14th ACM Conf. Hypertext and Hypermedia, 2003. Proc. 15th Int’l World Wide Web Conf. (WWW), 2006. Data Bases (VLDB), 2004. Principles of Distributed Computing (PODC), 2007. 14th Int’l World Wide Web Conf. (WWW), 2005. Wide Web Conf. (WWW), 2002. Conf. Web Intelligence (WI), 2005. Interest Group on Information Retrieval (SIGIR), 2005. 31st Int’l Conf. Very Large Data Bases (VLDB), 2005. World Wide Web Conf. (WWW), 2007. [30] M. Kendall and J.D. Gibbons, Rank Correlation Methods. Edward Arnold, 1990. (SIGIR), 2001. Technology, vol. 2, no. 3, 2002. Data Bases (VLDB), 2004. (ASLIB), vol. 56, no. 1, 2004. (SIGIR), 2004.