Post Graduate Programme – Term IV – AY 20012-13
Business Intelligence And Data Mining
Group Assignment on NGO Donations Maximization
Abstract
The problem is associated to devising a strategy to maximize the profits from a Direct Marketing Campaign to a selected group of customers while minimizing costs . The exercise requires the use of Business Intelligence tools and techniques to build a model , trained and tested on the historical data for the last year’s donation raising campaign . From this model it should be possible to predict the profitability of a prospective donor , hence allowing a more targeted campaign at lower cost . The difficulty is due to extremely imbalanced data and the inverse correlation between the probability of response and the dollar amount generated from it . The available data set and problem is of the KDD-CUP-98 challenge . The solution would be applicable to any direct marketing campaign which has historical data available .
Table of Contents Introduction 4 Performance Based Management 4 Balanced Scorecard 4 Problem in implementation of BSC 8 Literature Review 8 Company Name: Cipla 10 Introduction of the company 10 History 11 Vission & Mission of Cipla 12 Scorecard for Cipla 12 Market 12 Culture 12 Internal 13 R&D 13 Key Learning 15 Outcome/Conclusion 16 References 16
Introduction
The KDD-CUP-98 challenge is related to creation of a model trained and tested on historical data and capable of providing a prediction on the potential donors so as to maximise profit . It will provide a good mailing list so as to target only valuable customers . Typically the existing models predict future response behaviour . The historical database has information about mailing campaigns in the past and the response of customers and the collected dollar amount . The model should predict current customers who are likely to respond and maximize net profit
References: 16 Introduction The KDD-CUP-98 challenge is related to creation of a model trained and tested on historical data and capable of providing a prediction on the potential donors so as to maximise profit . It will provide a good mailing list so as to target only valuable customers . Typically the existing models predict future response behaviour . The historical database has information about mailing campaigns in the past and the response of customers and the collected dollar amount . The model should predict current customers who are likely to respond and maximize net profit ( Donation amount – Mailing cost ) over the contacted customers . The records are from the results of the 1997 Paralyzed Veterans of America fundraising mailing campaign and only 5% records are responders . Thus classification with response value can give 95% accuracy . An approach in ranking customers by estimated probability to respond and selecting top portion , if top 5% of the list contains 30% of responders and hence a lift of 6 , but the drawback is not using the donation amount for the customer . Here there is an inverse correlation between probability to donate and dollar amount as the donors donating higher amount are more cautious . Therefore probability based ranking tends to rank down valuable customers . Another method which adapts accuracy to cost-sensitive learning tries to minimize cost but since the initial list considers probability of response and then considers profitability , tends to ignores valuable consumers who are usually infrequent . The tweaked use of association rules leads to better result then the above suggested methods . It involves the identification of subsets of attributes which are correlated to “respond class” and then a small subset of generated association rules to identify potential customers in the current campaign . The solution tries to increase customer value by selecting association rules and increase profitability over the current customers . Negative association rules may also suggest , given some attributes the chances of not donating . The association rules do not tell how to maximize an objective function especially when there is inverse correlation . The dataset has 191,799 records of customers contacted in the 1997 mailing campaign . Each record has 479 non-target variables and two target variables indicating respond / not_respond and actual donation in dollars . 5% records are respond records and dataset is split into 50% for learning and 50% for validation . The customers are to be evaluated and predicted based on a mailing cost of $0.68 .The inverse correlation could exist in offering for the same customer which can be reduced by avoiding multiple mailings within a time period or for different customers meaning many small contributions and few big customers . The second type of inverse correlation has to be addressed . It can be done in two steps obtain probability estimation from decision trees and re-rank it using customer value , but this also ignores the value in the first step . The other problem is high dimensionality , having 481 variables and small target population leading to difficulty in identifying features for respond class . The one attribute at a time “ gain criterion “ does not search for correlated variables although it is good for maximising class probability but not when non-maximum class probability is also used for ranking customers .The notion of focussed association rules leads to features typical of response class and not of not_respond class i.e. a subset of variables in the respond class which occur infrequently in the not_respond class . This leads to data pruning of not_respond class leading to solution to scarcity of data in target class and also removal variables that are frequent in the non_respond class . The focussed association rules can then be converted into a model for predicting the donation amount for a customer by trying to cover customers using these rules and pruning over-fitting rules and estimating donation amount for rules . The assumption is that current customers follow the same class and donation distribution as that of historical records . Rule Generation ,finds a set of good rules that capture features of responders , Model Building combines rules into prediction model for donation amount and Model Pruning prunes rules that do not generalize to the entire population . Our Approach