Document similarity detection is very useful in many areas like copyright and plagiarism discovery. However, it is difficult to test the similarity between documents when there is no information disclosure or when privacy is a concern. This paper provides a suggested solution using two metrics that are utility and security.
Problem
Suppose that there are two parties whose concern is finding wither or not they have related or similar documents. These parties have concerns about privacy. Their target is to only discover if there is similarity among their documents without disclosing them.
Solution a. Without Privacy concerns If the parties have no concern about their privacy, then there are many ways to discover the similarities. One among is using “similarity of ranked list”. Given a document D from A entity, find a ranked list of Top 10 documents with B, which are similar to D.
b. Privacy is a concern If the two entities do not want to disclose the documents to each others, then a secure solution has to be found. Using the same utility above, “Similarity of ranked list” and using the security metrics “t-Plausibility” below is a suggested solution: Given a document D, produce D’: a generalized document using t-Plausibility. Pass D’ to party B and retrieve the ranked list of similar documents.
Analysis and testing
To measure the efficiency of the solution suggested above, the top 10 ranked list output from solution (a) is compared with the top list output from solution (b). If for a given threshold, the documents that are common on both lists are close to threshold, then we can say the solution is sufficient.
Comments and Ideas: - The more general D’ is, the less of probability that D’ was generated from D. This may cause the similarity deduction difficult - The top rated list will contain the documents in the domain of D, not the documents similar to it. On other