1 Introduction

The objective of this study is to build a classification model to predict red wine quality based on given physicochmeical characteristics. The dataset used for this analysis includes 1599 red wine samples with 11 physicochemical attributes like alcohol or pH, and 1 sensory attribute called quality. More details on the dataset can be referred to the original paper Cortez et al., 2009.

2 Exploratory Data Analysis (EDA)

In EDA phase, data is mainly explored by graphical visualizations assisted with descriptive statistics to help understand the distribution of univariate as well as the the association relationships between variables. EDA discovery will lay the foundation to help select important attributes to build red wine quality prediction model.

To begin with, let’s load the Red Wine csv file and preview the dataset.

##   X fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1 1           7.4             0.70        0.00            1.9     0.076
## 2 2           7.8             0.88        0.00            2.6     0.098
## 3 3           7.8             0.76        0.04            2.3     0.092
## 4 4          11.2             0.28        0.56            1.9     0.075
## 5 5           7.4             0.70        0.00            1.9     0.076
## 6 6           7.4             0.66        0.00            1.8     0.075
##   free.sulfur.dioxide total.sulfur.dioxide density   pH sulphates alcohol
## 1                  11                   34  0.9978 3.51      0.56     9.4
## 2                  25                   67  0.9968 3.20      0.68     9.8
## 3                  15                   54  0.9970 3.26      0.65     9.8
## 4                  17                   60  0.9980 3.16      0.58     9.8
## 5                  11                   34  0.9978 3.51      0.56     9.4
## 6                  13                   40  0.9978 3.51      0.56     9.4
##   quality
## 1       5
## 2       5
## 3       5
## 4       6
## 5       5
## 6       5

It’s also necessary to check the bottom of the dataset to make sure the format consistency and be aware of situations like comment lines at the end of a dataset.

##         X fixed.acidity volatile.acidity citric.acid residual.sugar
## 1594 1594           6.8            0.620        0.08            1.9
## 1595 1595           6.2            0.600        0.08            2.0
## 1596 1596           5.9            0.550        0.10            2.2
## 1597 1597           6.3            0.510        0.13            2.3
## 1598 1598           5.9            0.645        0.12            2.0
## 1599 1599           6.0            0.310        0.47            3.6
##      chlorides free.sulfur.dioxide total.sulfur.dioxide density   pH
## 1594     0.068                  28                   38 0.99651 3.42
## 1595     0.090                  32                   44 0.99490 3.45
## 1596     0.062                  39                   51 0.99512 3.52
## 1597     0.076                  29                   40 0.99574 3.42
## 1598     0.075                  32                   44 0.99547 3.57
## 1599     0.067                  18                   42 0.99549 3.39
##      sulphates alcohol quality
## 1594      0.82     9.5       6
## 1595      0.58    10.5       5
## 1596      0.76    11.2       6
## 1597      0.75    11.0       6
## 1598      0.71    10.2       5
## 1599      0.66    11.0       6

The data is formated consistently and ready for analysis. Now let’s check overall structure.

## 'data.frame':    1599 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...

Observations:

Sample size is 1599, with 13 variables, which all are numerical type.
There are 11 variables with continuous numerical value, which are physicochemical attributes of wine samples.
There are 2 variables represented by integers:
- X: representing the index of sample identifier, which will not be included for further analysis.
- quality: according to source description, quality data is developed based on the median value of sensory evaluation given by at least three wine experts. Wine experts judged those samples using the rule that “0 means very bad and 10 means excellent”, which means the larger the quality number, the higher the wine quality. This is also the dependent variable of interest and the next step is to look at how the quality distributed in our sample data.

2.1 Univariate Plot & Analysis

The focus of this section is to understand the distribution characteristics for each variable, which would help to reveal data central tendency, extreme outliers as well as any needs for data transformation.

2.1.1 Dependent Variable

We start with creating a bar plot for the dependent variable quality.

Quality in our samples ranges from 3 to 8, and the majority samples have quality score 5 or 6. Noticed that the absence of “Excellent:10” or “Very bad:0” scores probably indicates there is no outliers in terms of red wine quality, although it is also possible due to that people tend to avoid giving extreme judgements.

Let’s also see the statistic summary.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.636   6.000   8.000

Observations:

Quality score 5 and 6 also mark the 1st and 3rd quartile of data respectively.
- 25% of samples with quality score below 5
- 75% of samples with quality score above 6.
The quality score is created by taking the median of at least three evaluations. This measure can be considered as a method to counter the personal bias in terms of flavor preferences. Therefore, we can assume the quality score is unbiased and score is directly related wine quality.
However, it might be safe to interpret that wine samples with quality score 8 are much better than those with quality score 3 , it is hard to differenciate whether the differnce in score 4 group and score 3 group is due to quality variation or due to human interpretation on number categories. For example, one expert might use 8 to represent high quality sensory experience with one sample, and another expert might think 7 already represent high quality wine. In other words, though we know the higher the better, the variance may come from human perception differences on those number categories.
Therefore, the next step is to create a new categorical variable “level” to represent wine quality in 3 levels to reduce the perception noise:
“Low”: quality score 3 or 4
“Medium”: quality score 5 or 6
“High”: quality score 7 or 8

The categorical variable level will be our new dependent variable. Instead of predicting the exact quality score, our goal is to predict the quality level given certain physicochemcial attributes.

The next step is to check the independent variable group.

2.1.2 Independent Variables

Independent variable group includes 11 attributes and all of them are continuous numerical variables. To better visualize the central tendency as well as detect the presence of any outliers, it would be helpful to overlay the key statistics like median and outliers on distribution plots. Since there are 11 of them, we will write functions for outlier computation as well as for univariate distribution plotting.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    4.60    7.10    7.90    8.32    9.20   15.90

There are 3 red dash lines imposed to histogram-density plot to mark the lower-bound outlier, median and upper-bound outlier respectively.

Fixed acidity falling within a range from 4.6 to 15.9 is close to Gaussian distribution. There are a few samples above upper-bound outlier line but no special handling is required for this attribute.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1200  0.3900  0.5200  0.5278  0.6400  1.5800

Volatile acidity seems to have bimodal distribution, as the kernel density plot suggested. This attribute might be a promising predictor since two modes might be signal for different classes in our new dependent variable level. There are also a few upper-bound outliers observed, but no special outlier handling is required.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.090   0.260   0.271   0.420   1.000

Citric acid data have three peaks clearly observed in both histogram and density smooth line. This attribute also seems to be a promising predictor.

For this attribute, data ranges from 0 to 1, sample mean is very close to median, and data has no outliers.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.900   1.900   2.200   2.539   2.600  15.500

Residual sugar has a positively skewd distribution with noticable outliers.

## Upper-bound Outlier is: 3.95

As we can see, the upper-bound outlier is about 4, which is far less than the maximum value 15.5 in sample data. The The presence of outliers might affect both assumptions and results of logistic regression model, since outliers could add huge penalty when residual error is squared.

There are a couple of ways to handle outliers:

remove outliers from original dataset: not adopted since we are not sure whether the outliers are part of data characteristics or due to incorrect collection process.
data transformation:
- scale transformation for better view of central tendency
- actual transformaiton to variables: this step might be needed if we decide to use this variable for prediction model.

Both Square root and log transformations can pull in large numbers, so let’s compare them both:

Although outliers can still be observed after both transformation, Log10 transformation provides a better performance–the kernel density plot is close to bell shape, while the distribution produced by square root transformationdistribution still has a relatively obvious tail on the right. So for further analysis, log10 transformation can be applied if residual sugar is selected as predictor.

At this point, we can use log scale to better view the original data range:

From the above log-scaled plot, we can see that data within range 1 to 4 is very close to normal distribution.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100

Chlorides distribution is even more positively skewed than residual sugar: the maximum value goes up to 0.6, which is about 60 times larger than the 3rd quartile value 0.09. Similarly, to pull in large numbers, we can apply log10 transformation if this varaible is selected as predictor. For now, we just use log-scaling to better view the distribution.

## Upper-bound Outlier is: 0.129

After transformation, we can see the x-axis label 10^(-1) is in the middle of median and upper bound outlier. And it is also easy to see that the majority data falls within 0.04 to 0.129 (upper-bound outlier), with the median value 0.08.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    7.00   14.00   15.87   21.00   72.00

Free sulfur dioxide is slightly positive skewed, which the maximum value 72 is about 3.5 times of third quartile.For this variable, outlier handling is not needed since the scale can capture the majority of sample data.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    6.00   22.00   38.00   46.47   62.00  289.00

The distribution shape for total sulfur dioxide is very similar to free sulfur dioxide, which makes sense since free sulfur dioxide is a subset of total sulfur dioxide. However the ratio of maximum value to 3rd quartile (4.7) is greater than the same ratio observed in free.sulfur dioxide, which also makes sense since the subset only contributes part of the variance. Now let’s compare their variance:

##   var.dta.free.sulfur.dioxide. var.dta.total.sulfur.dioxide.
## 1                     109.4149                      1082.102

Not surprised, total sulfur dioxide does have larger variance than free sulfur dioxide. For total sulfur dioxide, there is no need for transformation since the scale can still depict the majority trend.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9901  0.9956  0.9968  0.9967  0.9978  1.0040

Density looks highly like a normal distribution, with data evenly distributed two sides from median. It is worth noticed that the data range for this variable is very small–from 0.99 to 1. The majority samples with density less than 1 makes sense since the density of ethanol is less than water. One possible cause for wine denser than water could be high sugar level since the density of sucrose is about 1.59 g/cm³. We can further compare relationship between density and residual sugar in bivariate section.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.740   3.210   3.310   3.311   3.400   4.010

All wine samples with pH values lie on the acidic side of pH spectrum: from 2.7 to 4. The pH distribution is another one very close to Gaussian/normal bell shape.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3300  0.5500  0.6200  0.6581  0.7300  2.0000

For sulphates, we can see a light tail on the positive side. Data ranges from 0.33 to 2, and the ratio of maximum value to 3rd quartile is about 3, also indicating the slight positive skewed trend, but no need for outlier handling.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.50   10.20   10.42   11.10   14.90

For alcohol, the majority of data within range of 8.8 to 12.5. We can observe the two peaks revealed by the above plot, which might be signal for another promising predictor.

Observations for univariate exploratory:

There are 2 attributes normallly distributed: density and pH.
- density : there are samples with density greater than 1. In bivariate section, we will further investigate the cause by examing how density is related to residual sugar.
the other 9 attributes are more or less positively skewed:
- residual.sugar and chlorides are highly skewed to the right, log-scalling is applied to both variables to better view the central tendency.
- Volatile acidity, citric.acidity and alcohol seem to be promissing predictor since there are more than one peaks observed on histogram or kernel density plots. This kind of characteristic might be signal for different categories of new response variable level.

2.2 Bivariate Plot & Analysis

According to source data description, some variables might be correlated. Therefore it is necessary to their correlation, and if strong correlation observed between independent variable pairs, we must carefully select predictors to reduce the multicolinearity issue.

Since correlation essentially is estimated by linear regression, which is very sensitive to outliers, we need apply log transformation to residual sugar and chlorides.

Though we can use ployserial correlation to quantify the relathionship between continous variable and dichotomous variable, quality instead of level will be used to compute correlation matrix for the below reasons:

Variable level is derived from variable quality, and intunitively we are interested in whether increasing a certain attribute would contribute to increasing or decreasing in wine quality response. Therefore, if correlation relationship is observed between an independent variable and original dependent variable quality, we can extend the conclusion to level too.
For simpliticity and consistency, correlation matrix for all numerical type is relative easy to read and to interpret.

Therefore, we will first re-organize the dataset, and then check the correlation. To better view the strength of association patterns, we will customize a correlation heatmap:

According to Evans (1996), correlation value r can be interpreted as below:

.00-.19 “very weak”
.20-.39 “weak”
.40-.59 “moderate”
.60-.79 “strong”
.80-1.0 “very strong”

Now let’s look at the moderate or strong correlated pairs only:

From above heat map, we can see there are 9 pairs show moderate or strong association patterns in our dataset. Along the y-axis label, let’s summarize the association patterns as below:

fixed.acidity is correlated to
- citric.acid
- density
- pH
volatile.acidity is correlated to
- citric.acid
citric.acid is correlated to
- pH
free.sulfur.dioxide is correlated to
- total.sulfur.dioxide
density is correlated to
- alcohol
- log_residual.sugar
alcohol is correlated to
- quality

Now let’s visualize these correlated pairs.

The first group is fixed.acidity correlated variables:

Among these three pairs, we can infer their possible causal relationship:

Citric.acid is one of the contributer to fixed.acidity. So it’s likely that the increase of citric.acid caused the increase of fixed.acidity, not the other way around.
It is more likely that fixied.acidity contributing to denser wine, instead of density affecting fixed.acidity.
The causal direction is clear between fixed.acidity and pH: the higher the fixed.acidity, the lower pH value.

Second, let’s check volatile.acidity correlation:

For this particular pair, there should not be any causal effect since citric.acid is non-volatile organic acids. The observed correlation is possible due to the ratio competiton between volatile.acidity and non-volatile acidity.

The 3rd pair is citric.acid and pH:

It is clear that the higher ciric.acid contribute to lower pH value.

The 4th pair is free.sulfur.dioxide vesus total.sulfur.dioxide:

The correlation between free.sulfur.dioxide and total.sulfur.dioxide is not surprising, since one is the subset of the other one.

Next, let’s check the 5th variable density. Other than correlated to fixed.acidity, density is also correlated alcohol and log-transformed residual.sugar:

In the univariate section, we saw some wine samples denser than water, and one of our hypothesis is that those samples with high sulcose content. From the fixed.acidity correlation analysis, we also found density is positive correlated to fixed.acidity. Now let’s find out which one plays a major role in causing denser wine.

Wine samples with density greater than 1 have been defined as Dense Wine, and those with density less or equal than 1 have been defined as Regular Wine.

Observations:

The relationship between residual.sugar and density is consistent in both Regular Wine and Dense Wine groups, as we can see the slope of esimated fitting lines are almost parallel.
Fixed.acidity does show very strong positive correlation to regular wine, but the trend flattend in dense wine group, which indicates this variable is not the main driver causing dense wine.

In short, residual.sugar is more likely to be the cause of denser wine.

Next, let’s look at the last pair: alcohol and quality. Among all 11 physicochemical attributes, alcohol is the only attribute show moderate or above correlation with wine quality, which might indicates its significance as a predictor for wine quality.

From y-axis, we can actually see points are roughly grouped at 3 discrete level: quality score 5, 6, and 7.

Let’s add one more comparison by computing the polyserial correlation between alcohol vesus target categorical variable level:

## Correlation Alcohol vs Quality  0.4761663

## Correlation Alcohol vs Level: 0.4856328

## Correlation Quality vs Level: 1

Here we see the two correlations are very close, and the polyserial correlation is even a bit higher.

Let’s also check all correlation r values for dependent variable quality:

From the above graph, we saw that variables positively correlated to quality including:

log_reisidual.sugar
alcohol
sulphates
citric.acid
fixed.acidity

Let’s check whether the positive correlation observed to quality persist for target response variable level.

The positive trend holds true for variable level.

Now let’s check those variables with negative correlation to wine quality:

## Negative Correlation Variables: volatile.acidity free.sulfur.dioxide total.sulfur.dioxide density pH log_chlorides level

In this group, we can see only volatile.acidity, pH and log_chlorides show negative correlation to level. The other three variables seem no particular trend with respect to target response variable.

Now we can re-organzie the positive, negative and unclear trend groups respectively:

The above 5 physicochemcial attrbutes shown positive correlation to wine quality level can be considered as candidate predictors. But citric.acid and fixed.acidity are strong correlated, so we need to further investigate how are they influencing wine quality level in multivariate section.

Although the lower pH seems contributing higher quality wine, we should not include pH as a predictor, since we know the pH value is an overall measure of acidity, which has neen already represented by fixed.acidity and citric.acid. So in this negative correlated group, only volatile.acidity and log_chloridesto will be considred as condidate predicotors.

The above 3 physicochemcial attrbutes will not be included for further analysis since there’s no clear correlation to wine quality level.

So far, we shrinked down predict candidates from 11 to 7:

log_residual.sugar
alcohol
sulphates
citric.acid
fixed.acidity
volatile.acidity
log_chlorides

Out of these 7 variables, we still need to figure out the multicolinearity among citric.acid, fixed.acidity and voaltile.acidity. So in the next multivariate analysis section, we will isolate them in pairs and see how are they influencing our target response variable wine quality level.

2.3 Multi-variate Plot & Analysis

Now we have seven candidate predictors, and three of them seem to have multicolinearity. Therefore the purpose of this section is to figure out how one variable is influenced by another, and most importantly how wine quality level is influenced by their association.

Let’s start with our first pair of correlated candidates: fixed.acidity ~ citric.acid.

As we can see, at same fixed.acidity level, the higher the citri.acid components, the higher the wine quality level.

Now we acknowledged that citric.acid does play an important role elevating wine quality level, but how does it compared to alcohol, which is the only variable observed moderate positive correlation to wine quality.

The trend is not very clear, probably because the medium alcohol in low level wine and medium level wine is too close, and also the large variance observed in medium level wine samples.

So instead of plotting citric.acid against each alcohol value, let’s review this pair by dividing alcohol into the below 4 categories:

Below 30 percentile: Low Alcohol
30 to 60 percentile: Medium Alcohol
60 to 90 percentile: High Alcohol
Above to 90 percentile: Very High Alcohol

Now it is very clear that at each given range of alcohol, wine with higher citric.acid is tend to be evaluated as higher quality.

I wonder if at each given range of citric.acid, would higher alcohol be equaly likely influencing quality level. Let’s divide citric.acid to 4 categories:

Below 30 percentile: Low Citric
30 to 60 percentile: Medium Citric
60 to 90 percentile: High Citric
Above to 90 percentile: Very High Citric

Here we see two patterns:

In High and Very High Citric groups, wine samples with higher alcohol content are also tend to be evaluated as high quality wine.
While in Low and Medium Citric groups, high quality wine does have high alcohol content, but for medium or low quality wine, alcohol content doesn’t seem to be a decisive factor.

Next, let’s move to the second correlated candidates: citric.acid versus volatile.acidity. Although the correlation was found between volatile.acidity and citric.acid, we know citric.acid itself is a kind of non-voaltile acidity, so the observed correlation is likely due to the composition competition between fixed.acidity and volatile.acidity.

Therefore, let’s check the how wine quality level responds to the composition competition between volatile.acidity and fixed.acidity.

Observations:

The regression lines for for Medium and High level wine are almost parallel, which indicates that in wine making process, the ratio control for volatile.acidity and fixed.acidity is relatively similar and fairly constant.
The regression line Low level wine with a slope steeper than the other two lines probably indicates that the issue causing low level wine is because of the inbalance ratio between volatile.acidity and fixed.acidity, or more preciesely, too much volatile.acidity and not sufficient fixed.acidity to create good sensory experience.
In terms of predictor selection, volatile.acidity should be kept not only because at given level of fixed.acidity, volatile.acidity show consistent influence for different levels of wine samples, but also because volatile providies another dimension of acidity which will provide complementary information for prediction model.

Up till now, we are able to identify 6 predictors, in which no significant colinearity observed, and each represents different dimension of wine characteristics.

In next section, we are going to build a prediction model based on these 6 predictors:

log_residual.sugar
alcohol
sulphates
citric.acid
volatile.acidity
log_chlorides

3 Prediction Model

Since our target response variable level is categorical data, we need to select muli-class logistic regression model.

The dataset for prediction model should include 6 predictors and 1 target variable.

3.1 Data Partition: Training & Test

We first need to split dataset into training and test subsets. Training data for prediction model training, and test data for model performance evaluation. Random sampling will applied within dependent variable wine quality level for total 1599 samples, which will result in 70% training and 30% test data.

3.2 Random Forest Prediction Model

For this study, Random Forest algorithm is selected due to its advantage on both accuracy and efficiency.

The first step is to train our model by feeding the training set.

## 
## Call:
##  randomForest(formula = level ~ citric.acid + log_chlorides +      sulphates + volatile.acidity + log_residual.sugar + alcohol,      data = candi_train) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 13.29%
## Confusion matrix:
##        Low Medium High class.error
## Low      3     40    2  0.93333333
## Medium   3    891   30  0.03571429
## High     0     74   78  0.48684211

The overall error rate estimated on training dataset is about 14%.

3.3 Model Performance

Now let’s check the prediction performance on test dataset.

##         Predicted
## Observed Low Medium High
##   Low      1     17    0
##   Medium   0    380   15
##   High     0     32   33

## Accuracy of Prediction Model is: 0.8661088

The prediction model achieved 88% accuracy rate.

In the univariate plot section, we proposed that volatile.acidity, alcohol and citric.acid would be potential predictors based on observed two or three peaks on historgram.

In the bivariate plot section, boxplots also revealed that citric.acid, sulphates, log_residual.sugar and alcohol all show positive correlation with wine quality level.

Now let’s visualize the rank of all 6 attributes based on the variable importance.

Our previous assumptions on alcohol, volatile.acidity, citric.acid are ranked as No.1, No.2, No.4 respectively. The most important variable alcohol is positively correlated to wine quality level, and the second important variable volatile.acidity is negatively correlated to wine quality level.

4 Summary

By apply random forest algorithm, we achieved 88% accuracy on predicting red wine quality level by 6 physicochemical attributes.

The process that helped us to narrow down the 6 predictors from 11 variable is mainly by EDA phase. The below section includes three final plots developed during EDA phase.

4.1 Final Plots

Final Plot 1: The reason to keep this plot is because we can see 3 peaks from histogram and 2 mode from kerney density. Together it suggested volatile.acidity might be a good predictor for wine quality level, and this has been further confirmed as we can see it is the 2nd most important variable in the Random Forest model. Another reason is that the statistic outliers are overlaied in a compact and informative way.

Final Plot 2: this plot is selected because it reveals that residual.sugar is probably the cause of dense wine. In regular wine group, fixed.acidity show very strong positive correlation to density, but the change of regression slope suggested that fixed.acidity is not the main driver causing wine denser than water. For future study topics, t-test can be selected to see whether the difference is statistically significant.

Final Plot 3: this plot is selected because it revealed how wine quality is impacted by the confluence of alcohol and citric.acid.

At first attempt, we plotted citric.acid against each alcohol value, but we found no clear trend because the medium alcohol in low level wine and medium level wine is too close, and also the large variance observed in medium level wine samples.

Then after we dividing alcohol into the below 4 categories:

Below 30 percentile: Low Alcohol
30 to 60 percentile: Medium Alcohol
60 to 90 percentile: High Alcohol
Above to 90 percentile: Very High Alcohol

It is very clear that at each given range of alcohol, wine with higher citric.acid is tend to be evaluated as higher quality.

4.2 Reflection

In this study, we applied random forest algorithm to predict wine quality level. The methodology involves two major phases: EDA phase and prediction model development phase.

EDA phase: it is a highly iterative process and this is also the part I spent most time with. One lesson’s learned is that in EDA phase, it is very tempting to going on and on, but it is also very important to know when to stop because real projects will have time and monetary budget. Fortunately at last this EDA phase reached a relative satisfactory status.
Prediction Model Development: there are a few limitations in this phase.
In bivariate section, the major step to locate our candidate predictors is to select the ones show clear positive or negative trend by plotting variable versus target response variable level. Even though we see the median in boxplots show clear trend, we didn’t actually perform statistic tests to validate whether the differences are statistic significant.
When we created the heat map, we used variable quality for simplicity. One thing we could do is to computer polyserial correlation between each independent variable against target response variable level, then use that together with pearson correlations computed among independent variables to create a heatmap.
In prediction model section, idealy, we should divide dateset into training, validation and test subsets. By doing so, we can train a couple of prediction models by fitting training data, and then compare error rate in validation dataset, at last apply the model with best performance in validation dataset to the test dataset. In this study, the validation process is not included.

5 Reference

P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553, 2009.
Correlation matrix: An R function to do all you need (http://www.sthda.com/english/wiki/correlation-matrix-an-r-function-to-do-all-you-need#at_pco=smlre-1.0&at_si=589272c605773292&at_ab=per-2&at_pos=3&at_tot=4)
Source code for Variable Importance Plot (https://www.kaggle.com/mrisdal/titanic/exploring-survival-on-the-titanic)
Evans, J. D. (1996). Straightforward statistics for the behavioral sciences. Pacific Grove, CA: Brooks/Cole Publishing.
Correlation Coefficients: Describing Relationships (https://www.rasch.org/rmt/rmt193c.htm)
ggplot2: Quick correlation matrix heatmap (http://www.sthda.com/english/wiki/ggplot2-quick-correlation-matrix-heatmap-r-software-and-data-visualization)

Red Wine Quality Prediction

Wen Yang

February 20, 2017

1 Introduction

2 Exploratory Data Analysis (EDA)

2.1 Univariate Plot & Analysis

2.1.1 Dependent Variable

2.1.2 Independent Variables

2.2 Bivariate Plot & Analysis

2.3 Multi-variate Plot & Analysis

3 Prediction Model

3.1 Data Partition: Training & Test

3.2 Random Forest Prediction Model

3.3 Model Performance

4 Summary

4.1 Final Plots

4.2 Reflection

5 Reference