Wine was once viewed as a luxury good, but now it is increasingly enjoyed by a wider range of consumers. According to the different qualities, the prices of wines are quite different. So when the wine sellers buy wines from wine makers, it’s important for them to understand the wine quality, which is in some degrees affected by some chemical attributes. When wine sellers get the wine samples, it makes difference for them to accurately classify or predict the wine quality and this will differentiate their profits. So our goal is to model the wine quality based on physicochemical tests and give the reference for wine sellers to select high, moderate and low qualities of wines.
We download wine quality data set that is the white vinho verde wine samples from the north of Portugalthe from UC Irvine Machine Learning Repository. This white wine data set includes 4898 observations and 12 variables, among which quality is the dependent variable, and other 11 attributes- fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, sulphates, and alcohol-are independent variables.
Technical summary 1. Data pre-process
The first step to analyze data is to pre-process it. First, observing all the data, we found several outliers, so we eliminate these outliers. Then we found that the dependent variables are numerical, and some values are focused in a narrow range, like variable density, ranging from 0.98 to 1.02 , so in the initial analysis, we decided not to bin them. Also we observed the correlation of each variable; since we mainly want to make prediction, even though some variables are correlated, we didn’t eliminate them.Overall, we just eliminate several outliers of this data set. 2. Preliminary Models
We use many models to make classification and prediction. The three models are multiple linear regression, classification tree and neural network.
2.1 Multiple linear regressions
Based on