Providers on the Web
∗
∗
Xiaoxin Yin
UIUC
xyin1@cs.uiuc.edu
Jiawei Han
UIUC
hanj@cs.uiuc.edu
ABSTRACT
Philip S. Yu
IBM T. J. Watson Res. Center psyu@us.ibm.com of information on the web. Even worse, different web sites often provide conflicting information, as shown below.
The world-wide web has become the most important information source for most of us. Unfortunately, there is no guarantee for the correctness of information on the web.
Moreover, different web sites often provide conflicting information on a subject, such as different specifications for the same product. In this paper we propose a new problem called Veracity, i.e., conformity to truth, which studies how to find true facts from a large amount of conflicting information on many subjects that is provided by various web sites.
We design a general framework for the Veracity problem, and invent an algorithm called TruthFinder, which utilizes the relationships between web sites and their information, i.e., a web site is trustworthy if it provides many pieces of true information, and a piece of information is likely to be true if it is provided by many trustworthy web sites. Our experiments show that TruthFinder successfully finds true facts among conflicting information, and identifies trustworthy web sites better than the popular search engines.
Keywords: data quality, web mining, link analysis.
Example 1: Authors of books. We tried to find out who wrote the book “Rapid Contextual Design” (ISBN:
0123540518). We found many different sets of authors from different online bookstores, and we show several of them in
Table 1. From the image of the book cover we found that
A1 Books provides the most accurate information. In comparison, the information from Powell’s books is incomplete, and that from Lakeside books is incorrect.
Web site
A1 Books
Powell’s books
Cornwall books
Mellon’s books
Lakeside books
Blackwell
References: Transactions on Internet Technology, 5(1):231–297, 2005. Technical report, Stanford Digital Library Technologies Project, 1998. http://mathworld.wolfram.com/SigmoidFunction.html This query was submitted on Feb 7, 2007.