The objective of this study is to build a classification model to predict red wine quality based on given physicochmeical characteristics. The dataset used for this analysis includes 1599 red wine samples with 11 physicochemical attributes like alcohol or pH, and 1 sensory attribute called quality. More details on the dataset can be referred to the original paper Cortez et al., 2009.
In EDA phase, data is mainly explored by graphical visualizations assisted with descriptive statistics to help understand the distribution of univariate as well as the the association relationships between variables. EDA discovery will lay the foundation to help select important attributes to build red wine quality prediction model.
To begin with, let’s load the Red Wine csv file and preview the dataset.
## X fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1 1 7.4 0.70 0.00 1.9 0.076
## 2 2 7.8 0.88 0.00 2.6 0.098
## 3 3 7.8 0.76 0.04 2.3 0.092
## 4 4 11.2 0.28 0.56 1.9 0.075
## 5 5 7.4 0.70 0.00 1.9 0.076
## 6 6 7.4 0.66 0.00 1.8 0.075
## free.sulfur.dioxide total.sulfur.dioxide density pH sulphates alcohol
## 1 11 34 0.9978 3.51 0.56 9.4
## 2 25 67 0.9968 3.20 0.68 9.8
## 3 15 54 0.9970 3.26 0.65 9.8
## 4 17 60 0.9980 3.16 0.58 9.8
## 5 11 34 0.9978 3.51 0.56 9.4
## 6 13 40 0.9978 3.51 0.56 9.4
## quality
## 1 5
## 2 5
## 3 5
## 4 6
## 5 5
## 6 5
It’s also necessary to check the bottom of the dataset to make sure the format consistency and be aware of situations like comment lines at the end of a dataset.
## X fixed.acidity volatile.acidity citric.acid residual.sugar
## 1594 1594 6.8 0.620 0.08 1.9
## 1595 1595 6.2 0.600 0.08 2.0
## 1596 1596 5.9 0.550 0.10 2.2
## 1597 1597 6.3 0.510 0.13 2.3
## 1598 1598 5.9 0.645 0.12 2.0
## 1599 1599 6.0 0.310 0.47 3.6
## chlorides free.sulfur.dioxide total.sulfur.dioxide density pH
## 1594 0.068 28 38 0.99651 3.42
## 1595 0.090 32 44 0.99490 3.45
## 1596 0.062 39 51 0.99512 3.52
## 1597 0.076 29 40 0.99574 3.42
## 1598 0.075 32 44 0.99547 3.57
## 1599 0.067 18 42 0.99549 3.39
## sulphates alcohol quality
## 1594 0.82 9.5 6
## 1595 0.58 10.5 5
## 1596 0.76 11.2 6
## 1597 0.75 11.0 6
## 1598 0.71 10.2 5
## 1599 0.66 11.0 6
The data is formated consistently and ready for analysis. Now let’s check overall structure.
## 'data.frame': 1599 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
Observations:
The focus of this section is to understand the distribution characteristics for each variable, which would help to reveal data central tendency, extreme outliers as well as any needs for data transformation.
We start with creating a bar plot for the dependent variable quality.
Quality in our samples ranges from 3 to 8, and the majority samples have quality score 5 or 6. Noticed that the absence of “Excellent:10” or “Very bad:0” scores probably indicates there is no outliers in terms of red wine quality, although it is also possible due to that people tend to avoid giving extreme judgements.
Let’s also see the statistic summary.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.636 6.000 8.000
Observations:
Therefore, the next step is to create a new categorical variable “level” to represent wine quality in 3 levels to reduce the perception noise:
“High”: quality score 7 or 8
The categorical variable level will be our new dependent variable. Instead of predicting the exact quality score, our goal is to predict the quality level given certain physicochemcial attributes.
The next step is to check the independent variable group.
Independent variable group includes 11 attributes and all of them are continuous numerical variables. To better visualize the central tendency as well as detect the presence of any outliers, it would be helpful to overlay the key statistics like median and outliers on distribution plots. Since there are 11 of them, we will write functions for outlier computation as well as for univariate distribution plotting.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.60 7.10 7.90 8.32 9.20 15.90
There are 3 red dash lines imposed to histogram-density plot to mark the lower-bound outlier, median and upper-bound outlier respectively.
Fixed acidity falling within a range from 4.6 to 15.9 is close to Gaussian distribution. There are a few samples above upper-bound outlier line but no special handling is required for this attribute.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1200 0.3900 0.5200 0.5278 0.6400 1.5800
Volatile acidity seems to have bimodal distribution, as the kernel density plot suggested. This attribute might be a promising predictor since two modes might be signal for different classes in our new dependent variable level. There are also a few upper-bound outliers observed, but no special outlier handling is required.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.090 0.260 0.271 0.420 1.000
Citric acid data have three peaks clearly observed in both histogram and density smooth line. This attribute also seems to be a promising predictor.
For this attribute, data ranges from 0 to 1, sample mean is very close to median, and data has no outliers.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.900 1.900 2.200 2.539 2.600 15.500
Residual sugar has a positively skewd distribution with noticable outliers.
## Upper-bound Outlier is: 3.95
As we can see, the upper-bound outlier is about 4, which is far less than the maximum value 15.5 in sample data. The The presence of outliers might affect both assumptions and results of logistic regression model, since outliers could add huge penalty when residual error is squared.
There are a couple of ways to handle outliers:
Both Square root and log transformations can pull in large numbers, so let’s compare them both:
Although outliers can still be observed after both transformation, Log10 transformation provides a better performance–the kernel density plot is close to bell shape, while the distribution produced by square root transformationdistribution still has a relatively obvious tail on the right. So for further analysis, log10 transformation can be applied if residual sugar is selected as predictor.
At this point, we can use log scale to better view the original data range:
From the above log-scaled plot, we can see that data within range 1 to 4 is very close to normal distribution.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100
Chlorides distribution is even more positively skewed than residual sugar: the maximum value goes up to 0.6, which is about 60 times larger than the 3rd quartile value 0.09. Similarly, to pull in large numbers, we can apply log10 transformation if this varaible is selected as predictor. For now, we just use log-scaling to better view the distribution.
## Upper-bound Outlier is: 0.129
After transformation, we can see the x-axis label 10^(-1) is in the middle of median and upper bound outlier. And it is also easy to see that the majority data falls within 0.04 to 0.129 (upper-bound outlier), with the median value 0.08.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 7.00 14.00 15.87 21.00 72.00
Free sulfur dioxide is slightly positive skewed, which the maximum value 72 is about 3.5 times of third quartile.For this variable, outlier handling is not needed since the scale can capture the majority of sample data.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.00 22.00 38.00 46.47 62.00 289.00
The distribution shape for total sulfur dioxide is very similar to free sulfur dioxide, which makes sense since free sulfur dioxide is a subset of total sulfur dioxide. However the ratio of maximum value to 3rd quartile (4.7) is greater than the same ratio observed in free.sulfur dioxide, which also makes sense since the subset only contributes part of the variance. Now let’s compare their variance:
## var.dta.free.sulfur.dioxide. var.dta.total.sulfur.dioxide.
## 1 109.4149 1082.102
Not surprised, total sulfur dioxide does have larger variance than free sulfur dioxide. For total sulfur dioxide, there is no need for transformation since the scale can still depict the majority trend.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9901 0.9956 0.9968 0.9967 0.9978 1.0040
Density looks highly like a normal distribution, with data evenly distributed two sides from median. It is worth noticed that the data range for this variable is very small–from 0.99 to 1. The majority samples with density less than 1 makes sense since the density of ethanol is less than water. One possible cause for wine denser than water could be high sugar level since the density of sucrose is about 1.59 g/cm3. We can further compare relationship between density and residual sugar in bivariate section.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.740 3.210 3.310 3.311 3.400 4.010
All wine samples with pH values lie on the acidic side of pH spectrum: from 2.7 to 4. The pH distribution is another one very close to Gaussian/normal bell shape.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3300 0.5500 0.6200 0.6581 0.7300 2.0000
For sulphates, we can see a light tail on the positive side. Data ranges from 0.33 to 2, and the ratio of maximum value to 3rd quartile is about 3, also indicating the slight positive skewed trend, but no need for outlier handling.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.50 10.20 10.42 11.10 14.90
For alcohol, the majority of data within range of 8.8 to 12.5. We can observe the two peaks revealed by the above plot, which might be signal for another promising predictor.
Observations for univariate exploratory:
According to source data description, some variables might be correlated. Therefore it is necessary to their correlation, and if strong correlation observed between independent variable pairs, we must carefully select predictors to reduce the multicolinearity issue.
Since correlation essentially is estimated by linear regression, which is very sensitive to outliers, we need apply log transformation to residual sugar and chlorides.
Though we can use ployserial correlation to quantify the relathionship between continous variable and dichotomous variable, quality instead of level will be used to compute correlation matrix for the below reasons:
Therefore, we will first re-organize the dataset, and then check the correlation. To better view the strength of association patterns, we will customize a correlation heatmap:
According to Evans (1996), correlation value r can be interpreted as below:
Now let’s look at the moderate or strong correlated pairs only:
From above heat map, we can see there are 9 pairs show moderate or strong association patterns in our dataset. Along the y-axis label, let’s summarize the association patterns as below:
Now let’s visualize these correlated pairs.
The first group is fixed.acidity correlated variables:
Among these three pairs, we can infer their possible causal relationship:
Second, let’s check volatile.acidity correlation:
For this particular pair, there should not be any causal effect since citric.acid is non-volatile organic acids. The observed correlation is possible due to the ratio competiton between volatile.acidity and non-volatile acidity.
The 3rd pair is citric.acid and pH:
It is clear that the higher ciric.acid contribute to lower pH value.
The 4th pair is free.sulfur.dioxide vesus total.sulfur.dioxide:
The correlation between free.sulfur.dioxide and total.sulfur.dioxide is not surprising, since one is the subset of the other one.
Next, let’s check the 5th variable density. Other than correlated to fixed.acidity, density is also correlated alcohol and log-transformed residual.sugar:
In the univariate section, we saw some wine samples denser than water, and one of our hypothesis is that those samples with high sulcose content. From the fixed.acidity correlation analysis, we also found density is positive correlated to fixed.acidity. Now let’s find out which one plays a major role in causing denser wine.
Wine samples with density greater than 1 have been defined as Dense Wine, and those with density less or equal than 1 have been defined as Regular Wine.
Observations:
In short, residual.sugar is more likely to be the cause of denser wine.
Next, let’s look at the last pair: alcohol and quality. Among all 11 physicochemical attributes, alcohol is the only attribute show moderate or above correlation with wine quality, which might indicates its significance as a predictor for wine quality.
From y-axis, we can actually see points are roughly grouped at 3 discrete level: quality score 5, 6, and 7.
Let’s add one more comparison by computing the polyserial correlation between alcohol vesus target categorical variable level:
## Correlation Alcohol vs Quality 0.4761663
## Correlation Alcohol vs Level: 0.4856328
## Correlation Quality vs Level: 1
Here we see the two correlations are very close, and the polyserial correlation is even a bit higher.
Let’s also check all correlation r values for dependent variable quality:
From the above graph, we saw that variables positively correlated to quality including:
Let’s check whether the positive correlation observed to quality persist for target response variable level.
The positive trend holds true for variable level.
Now let’s check those variables with negative correlation to wine quality:
## Negative Correlation Variables: volatile.acidity free.sulfur.dioxide total.sulfur.dioxide density pH log_chlorides level
In this group, we can see only volatile.acidity, pH and log_chlorides show negative correlation to level. The other three variables seem no particular trend with respect to target response variable.
Now we can re-organzie the positive, negative and unclear trend groups respectively:
The above 5 physicochemcial attrbutes shown positive correlation to wine quality level can be considered as candidate predictors. But citric.acid and fixed.acidity are strong correlated, so we need to further investigate how are they influencing wine quality level in multivariate section.
Although the lower pH seems contributing higher quality wine, we should not include pH as a predictor, since we know the pH value is an overall measure of acidity, which has neen already represented by fixed.acidity and citric.acid. So in this negative correlated group, only volatile.acidity and log_chloridesto will be considred as condidate predicotors.
The above 3 physicochemcial attrbutes will not be included for further analysis since there’s no clear correlation to wine quality level.
So far, we shrinked down predict candidates from 11 to 7:
Out of these 7 variables, we still need to figure out the multicolinearity among citric.acid, fixed.acidity and voaltile.acidity. So in the next multivariate analysis section, we will isolate them in pairs and see how are they influencing our target response variable wine quality level.
Now we have seven candidate predictors, and three of them seem to have multicolinearity. Therefore the purpose of this section is to figure out how one variable is influenced by another, and most importantly how wine quality level is influenced by their association.
Let’s start with our first pair of correlated candidates: fixed.acidity ~ citric.acid.
As we can see, at same fixed.acidity level, the higher the citri.acid components, the higher the wine quality level.
Now we acknowledged that citric.acid does play an important role elevating wine quality level, but how does it compared to alcohol, which is the only variable observed moderate positive correlation to wine quality.
The trend is not very clear, probably because the medium alcohol in low level wine and medium level wine is too close, and also the large variance observed in medium level wine samples.
So instead of plotting citric.acid against each alcohol value, let’s review this pair by dividing alcohol into the below 4 categories:
Now it is very clear that at each given range of alcohol, wine with higher citric.acid is tend to be evaluated as higher quality.
I wonder if at each given range of citric.acid, would higher alcohol be equaly likely influencing quality level. Let’s divide citric.acid to 4 categories:
Here we see two patterns:
Next, let’s move to the second correlated candidates: citric.acid versus volatile.acidity. Although the correlation was found between volatile.acidity and citric.acid, we know citric.acid itself is a kind of non-voaltile acidity, so the observed correlation is likely due to the composition competition between fixed.acidity and volatile.acidity.
Therefore, let’s check the how wine quality level responds to the composition competition between volatile.acidity and fixed.acidity.
Observations:
Up till now, we are able to identify 6 predictors, in which no significant colinearity observed, and each represents different dimension of wine characteristics.
In next section, we are going to build a prediction model based on these 6 predictors:
Since our target response variable level is categorical data, we need to select muli-class logistic regression model.
The dataset for prediction model should include 6 predictors and 1 target variable.
We first need to split dataset into training and test subsets. Training data for prediction model training, and test data for model performance evaluation. Random sampling will applied within dependent variable wine quality level for total 1599 samples, which will result in 70% training and 30% test data.
For this study, Random Forest algorithm is selected due to its advantage on both accuracy and efficiency.
The first step is to train our model by feeding the training set.
##
## Call:
## randomForest(formula = level ~ citric.acid + log_chlorides + sulphates + volatile.acidity + log_residual.sugar + alcohol, data = candi_train)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 13.29%
## Confusion matrix:
## Low Medium High class.error
## Low 3 40 2 0.93333333
## Medium 3 891 30 0.03571429
## High 0 74 78 0.48684211
The overall error rate estimated on training dataset is about 14%.
Now let’s check the prediction performance on test dataset.
## Predicted
## Observed Low Medium High
## Low 1 17 0
## Medium 0 380 15
## High 0 32 33
## Accuracy of Prediction Model is: 0.8661088
The prediction model achieved 88% accuracy rate.
In the univariate plot section, we proposed that volatile.acidity, alcohol and citric.acid would be potential predictors based on observed two or three peaks on historgram.
In the bivariate plot section, boxplots also revealed that citric.acid, sulphates, log_residual.sugar and alcohol all show positive correlation with wine quality level.
Now let’s visualize the rank of all 6 attributes based on the variable importance.
Our previous assumptions on alcohol, volatile.acidity, citric.acid are ranked as No.1, No.2, No.4 respectively. The most important variable alcohol is positively correlated to wine quality level, and the second important variable volatile.acidity is negatively correlated to wine quality level.
By apply random forest algorithm, we achieved 88% accuracy on predicting red wine quality level by 6 physicochemical attributes.
The process that helped us to narrow down the 6 predictors from 11 variable is mainly by EDA phase. The below section includes three final plots developed during EDA phase.
Final Plot 1: The reason to keep this plot is because we can see 3 peaks from histogram and 2 mode from kerney density. Together it suggested volatile.acidity might be a good predictor for wine quality level, and this has been further confirmed as we can see it is the 2nd most important variable in the Random Forest model. Another reason is that the statistic outliers are overlaied in a compact and informative way.
Final Plot 2: this plot is selected because it reveals that residual.sugar is probably the cause of dense wine. In regular wine group, fixed.acidity show very strong positive correlation to density, but the change of regression slope suggested that fixed.acidity is not the main driver causing wine denser than water. For future study topics, t-test can be selected to see whether the difference is statistically significant.
Final Plot 3: this plot is selected because it revealed how wine quality is impacted by the confluence of alcohol and citric.acid.
At first attempt, we plotted citric.acid against each alcohol value, but we found no clear trend because the medium alcohol in low level wine and medium level wine is too close, and also the large variance observed in medium level wine samples.
Then after we dividing alcohol into the below 4 categories:
It is very clear that at each given range of alcohol, wine with higher citric.acid is tend to be evaluated as higher quality.
In this study, we applied random forest algorithm to predict wine quality level. The methodology involves two major phases: EDA phase and prediction model development phase.
EDA phase: it is a highly iterative process and this is also the part I spent most time with. One lesson’s learned is that in EDA phase, it is very tempting to going on and on, but it is also very important to know when to stop because real projects will have time and monetary budget. Fortunately at last this EDA phase reached a relative satisfactory status.
In prediction model section, idealy, we should divide dateset into training, validation and test subsets. By doing so, we can train a couple of prediction models by fitting training data, and then compare error rate in validation dataset, at last apply the model with best performance in validation dataset to the test dataset. In this study, the validation process is not included.
P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553, 2009.
Correlation matrix: An R function to do all you need (http://www.sthda.com/english/wiki/correlation-matrix-an-r-function-to-do-all-you-need#at_pco=smlre-1.0&at_si=589272c605773292&at_ab=per-2&at_pos=3&at_tot=4)
Source code for Variable Importance Plot (https://www.kaggle.com/mrisdal/titanic/exploring-survival-on-the-titanic)
Evans, J. D. (1996). Straightforward statistics for the behavioral sciences. Pacific Grove, CA: Brooks/Cole Publishing.
Correlation Coefficients: Describing Relationships (https://www.rasch.org/rmt/rmt193c.htm)
ggplot2: Quick correlation matrix heatmap (http://www.sthda.com/english/wiki/ggplot2-quick-correlation-matrix-heatmap-r-software-and-data-visualization)