[ISSN: 2231-4946]
Development of Data leakage Detection Using Data Allocation Strategies
Rudragouda G Patil
Dept of CSE, The Oxford College of Engg, Bangalore. patilrudrag@gmail.com Abstract-A data distributor has given sensitive data to a set of supposedly trusted agents (third parties). If the data distributed to third parties is found in a public/private domain then finding the guilty party is a nontrivial task to distributor. Traditionally, this leakage of data is handled by water marking technique which requires modification of data. If the watermarked copy is found at some unauthorized site then distributor can claim his ownership. To overcome the disadvantages of using watermark [2], data allocation strategies are used to improve the probability of identifying guilty third parties. In this project, we implement and analyze a guilt model that detects the agents using allocation strategies without modifying the original data. The guilty agent is one who leaks a portion of distributed data. The idea is to distribute the data intelligently to agents based on sample data request and explicit data request in order to improve the chance of detecting the guilty agents. The algorithms implemented using fake objects will improve the distributor chance of detecting guilty agents. It is observed that by minimizing the sum objective the chance of detecting guilty agents will increase. We also developed a framework for generating fake objects. Keywords - sensitive data; fake objects; data allocation strategies; I. INTRODUCTION In the course of doing business, sometimes sensitive data must be handed over to supposedly trusted third parties. For example, a hospital may give patient records to researchers who will devise new treatments. Similarly, a company may have partnerships with other companies that require sharing customer data. We call owner of the data,
References: [1] P. Papadimitriou and H. Garcia-Molina, “Data leakage detection,” IEEE Transactions on Knowledge and Data Engineering, pages 51-63, volume 23, 2011. [2] S. Czerwinski, R. Fromm, and T. Hodes. Digital music distribution and audio watermarking. [3] L. Sweeney. Achieving k-anonymity privacy protection using generalization and suppression, 2002. [4] S. U. Nabar, B. Marthi, K. Kenthapadi, N. Mishra, and R. Motwani. Towards robustness in query auditing. In VLDB ’06. Hence, there are different allocations. In every allocation, the distributor can permute T objects and keep the same chances of guilty agent detection. The reason is that the guilt probability depends only on which agents have received the leaked objects and not on the identity of the leaked objects. Therefore, from the distributor’s perspective there are different allocations. An object allocation that satisfies requests and ignores the distributor’s objective is to give each agent a unique subset of T of size m. The s-max algorithm allocates to an agent the data record that yields the minimum increase of the maximum relative overlap among any pair of agents. The s-max algorithm is as follows. Step 1: Initialize Min_overlap ← 1, the minimum out of the maximum relative overlaps that the allocations of different objects to Step 2: for k ∈ {k | ∈ } do Initialize max_rel_ov ← 0, the maximum relative overlap between and any set that the allocation of to Step 3: for all j = 1,..., n : j = i and ∈ do Calculate absolute overlap as abs_ov ← | ∩ | + 1 Calculate relative overlap as rel_ov ← abs_ov / min ( , ) Step 4: Find maximum relative as max_rel_ov ← MAX (max_rel_ov, rel_ov) If max_rel_ov ≤ min_overlap then min_overlap ← max_rel_ov ret_k ← k Return ret_k It can be shown that algorithm s-max is optimal for the sum-objective and the max-objective in problems where M ≤ |T| and n < |T|. It is also optimal for the maxobjective if |T| ≤ M ≤ 2 |T| or all agents request data of the same size. It is observed that the relative performance of algorithm and main conclusion do not change. If p approaches to 0, it becomes easier to find guilty agents and algorithm performance converges. On the other hand, if p approaches 1, the relative differences among algorithms grow since more evidence is need to find an agent guilty. The algorithm presented implements a variety of data distribution strategies that can improve the distributor’s chances of identifying a leaker. It is shown that distributing objects judiciously can make a significant difference in identifying guilty agents, especially in cases where there is large overlap in the 200 | P a g e