White Wine Quality Exploration by Tamara Makarova

In further analysis I will consider a set of observations on a number of white wine varieties involving their chemical properties and ranking by tasters. Wine quality assessment is quite subjective and therefor it is interesting to know if there are any significant relations between objective tests (factors like acidity, pH level, presence of sugar and other chemical properties) and subjective quality scores. Analysis results can be interesting and helpful both for wine makers and for wine lovers.

Univariate Plots Section

Dataset dimensions and variable descriptions

## [1] 4898   13
## 'data.frame':    4898 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile.acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric.acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ residual.sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free.sulfur.dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
##  $ total.sulfur.dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
##  $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ quality             : int  6 6 6 6 6 6 6 6 6 6 ...
##                      vars    n   mean    sd  min    max  range   se   IQR  Q0.25  Q0.75
## fixed.acidity           1 4898   6.85  0.84 3.80  14.20  10.40 0.01  1.00   6.30   7.30
## volatile.acidity        2 4898   0.28  0.10 0.08   1.10   1.02 0.00  0.11   0.21   0.32
## citric.acid             3 4898   0.33  0.12 0.00   1.66   1.66 0.00  0.12   0.27   0.39
## residual.sugar          4 4898   6.39  5.07 0.60  65.80  65.20 0.07  8.20   1.70   9.90
## chlorides               5 4898   0.05  0.02 0.01   0.35   0.34 0.00  0.01   0.04   0.05
## free.sulfur.dioxide     6 4898  35.31 17.01 2.00 289.00 287.00 0.24 23.00  23.00  46.00
## total.sulfur.dioxide    7 4898 138.36 42.50 9.00 440.00 431.00 0.61 59.00 108.00 167.00
## density                 8 4898   0.99  0.00 0.99   1.04   0.05 0.00  0.00   0.99   1.00
## pH                      9 4898   3.19  0.15 2.72   3.82   1.10 0.00  0.19   3.09   3.28
## sulphates              10 4898   0.49  0.11 0.22   1.08   0.86 0.00  0.14   0.41   0.55
## alcohol                11 4898  10.51  1.23 8.00  14.20   6.20 0.02  1.90   9.50  11.40

Following the rule that outliers lies out of interval (3Q-1.5IQR, 3+1.5IQR), I printed out left and right outlier bounds (outl.rbound and outl.lbound) along with max distance to bounds measured in mean values (outl.rbound.dist and outl.lbound.dist).

##                       min    max outl.lbound outl.rbound outl.rbound.dist outl.lbound.dist
## fixed.acidity        3.80  14.20        4.80        8.80             1.23            -0.15
## volatile.acidity     0.08   1.10        0.04        0.49             3.40             0.13
## citric.acid          0.00   1.66        0.09        0.57             4.34            -0.27
## residual.sugar       0.60  65.80      -10.60       22.20            10.67             1.75
## chlorides            0.01   0.35        0.01        0.07             6.93            -0.13
## free.sulfur.dioxide  2.00 289.00      -11.50       80.50             7.86             0.38
## total.sulfur.dioxide 9.00 440.00       19.50      255.50             2.61            -0.08
## density              0.99   1.04        0.99        1.00             0.05             0.00
## pH                   2.72   3.82        2.80        3.56             0.26            -0.03
## sulphates            0.22   1.08        0.20        0.76             1.51             0.04
## alcohol              8.00  14.20        6.65       14.25             0.54             0.13

Most outliers lies on the right tail of distribution. Such variables like residual.sugar, free.sulfur dioxide, chlorides, citric.acid have outliers which can have significant effect on analysis and therefor should be taken into account. In the next sections I will consider variable distributions in more detail.

Wine quality

Wine quality is defined as a score between 1 and 10, where 1 is very poor and 10 is very excellent.

In some cases I will use quality variable as integer and calculate for example average or median quality, but in some cases I would rather use quality levels as classes. For this purpose I will add factor variable quality.cls. Distribution of new factor variable is presented below.

##      3      4      5       6       7       8      9     
## num  "20"   "163"  "1457"  "2198"  "880"   "175"  "5"   
## perc "0.4%" "3.3%" "29.7%" "44.9%" "18.0%" "3.6%" "0.1%"

Most wines have normal quality, only few have poor or excellent quality. For further analysis I also added new variable quality.cls.agg which aggregates levels 3, 4 and 8,9 in levels 1-4 and 8-10. Distribution of this new variable is more balanced than distribution of the original variable.

##      1-4    5       6       7       8-10  
## num  "183"  "1457"  "2198"  "880"   "180" 
## perc "3.7%" "29.7%" "44.9%" "18.0%" "3.7%"

Acidity

Acids are major wine constituents and contribute greatly to its taste. Traditionally total acidity is divided into two groups, namely the volatile acids and the nonvolatile or fixed acids. Most acids involved with wine or fixed or nonvolatile (do not evaporate readily).

##               mean Q0.25 Q0.75 min  max outl.lbound outl.rbound outl.rbound.dist outl.lbound.dist
## fixed.acidity 6.85   6.3   7.3 3.8 14.2         4.8         8.8             1.23            -0.15

Fixed acidity is measured in g/dm^3. Distribution of fixed.acidity is quite symmetric although there are a few outliers on the right tail.

##                  mean Q0.25 Q0.75  min max outl.lbound outl.rbound outl.rbound.dist
## volatile.acidity 0.28  0.21  0.32 0.08 1.1        0.04        0.49              3.4
##                  outl.lbound.dist
## volatile.acidity             0.13

Volatile acidity refers to the steam distillable acids present in wine, primarily acetic acid but also lactic, formic, butyric, and propionic acids. The amount of acetic acid in wine at too high of levels can lead to an unpleasant, vinegar taste. Acetic acid bacteria require oxygen to grow, therefore, elimination of any air in wine containers and sulfur dioxide addition will limit their growth. It is measured in g/dm^3.

Distribution of volatile.acidity is slightly right skewed with some outliers on the right tail.

##             mean Q0.25 Q0.75 min  max outl.lbound outl.rbound outl.rbound.dist outl.lbound.dist
## citric.acid 0.33  0.27  0.39   0 1.66        0.09        0.57             4.34            -0.27

Citric acid, found in small quantities it can add ‘freshness’ and flavor to wines. It is measured in g/dm^3.

Distribution of citric.acid is symmetric with a few outliers on the right tail, there are outliers which are significantly higher than most values, distance to the right outlier bound is greater than 4 mean values. There is a strange peak at 0.49.

##    mean Q0.25 Q0.75  min  max outl.lbound outl.rbound outl.rbound.dist outl.lbound.dist
## pH 3.19  3.09  3.28 2.72 3.82         2.8        3.56             0.26            -0.03

pH is another measure of acidity. It is related to an acid’s strength in solution and is measured on a logarithmic scale.

Distribution of pH is symmetric with only few outliers, which are quite close to the outlier bounds.

Alcohol

##          mean Q0.25 Q0.75 min  max outl.lbound outl.rbound outl.rbound.dist outl.lbound.dist
## alcohol 10.51   9.5  11.4   8 14.2        6.65       14.25             0.54             0.13

Alcohol refers to the percent alcohol content of the wine.Distribution of alcohol is right skewed without significant outliers.

Residual Sugar

##                mean Q0.25 Q0.75 min  max outl.lbound outl.rbound outl.rbound.dist outl.lbound.dist
## residual.sugar 6.39   1.7   9.9 0.6 65.8       -10.6        22.2            10.67             1.75

Residual sugar refers to the amount of sugar remaining after fermentation stops, it’s measured in g/dm^3. Distribution of residual sugar is not symmetric and have a few significant outliers on the right tail. Distance to some outliers (from the right outlier bound) is greater than 10 means.

After log transformation it is evident that distribution of residual.sugar is bimodal. One peak is located around 1.5 and the other one is around 10, the bound can be assumed to be nearly at 3. These two peaks probably correspond to different wine types: dry and sweet wines.

Sulfur Dioxide

##                      mean Q0.25 Q0.75 min max outl.lbound outl.rbound outl.rbound.dist
## free.sulfur.dioxide 35.31    23    46   2 289       -11.5        80.5             7.86
##                     outl.lbound.dist
## free.sulfur.dioxide             0.38

Free sulfur dioxide is a free form of SO2 that exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion. It prevents microbial growth and the oxidation of wine, it is measured in mg/dm^3. Distribution of free.sulfur.dioxide is slightly right skewed with outliers on the right tail. Some outliers are significantly higher than most values, distance to the right outlier bound is more than 7 means.

##                        mean Q0.25 Q0.75 min max outl.lbound outl.rbound outl.rbound.dist
## total.sulfur.dioxide 138.36   108   167   9 440        19.5       255.5             2.61
##                      outl.lbound.dist
## total.sulfur.dioxide            -0.08

Total sulfur dioxide refers to the amount of free and bound forms of S02, it is measured in mg/dm^3. Distribution of total.sulfur.dioxide is quite symmetric with a few single outliers on the right tail.

Chlorides, Sulphates, etc.

##           mean Q0.25 Q0.75  min  max outl.lbound outl.rbound outl.rbound.dist outl.lbound.dist
## chlorides 0.05  0.04  0.05 0.01 0.35        0.01        0.07             6.93            -0.13

Chlorides refers to the amount of salt in the wine, it is measured in g/dm^3. Distribution of chlorides has outliers on the right tail and is right skewed. Log transformed data is showed in the next plot.

##           mean Q0.25 Q0.75  min  max outl.lbound outl.rbound outl.rbound.dist outl.lbound.dist
## sulphates 0.49  0.41  0.55 0.22 1.08         0.2        0.76             1.51             0.04

Sufate is a wine additive which can contribute to sulfur dioxide gas (S02) levels, which acts as an antimicrobial and antioxidant, it is measured in g/dm3. Distribution of sulfates is slightly right skewed.

##         mean Q0.25 Q0.75  min  max outl.lbound outl.rbound outl.rbound.dist outl.lbound.dist
## density 0.99  0.99     1 0.99 1.04        0.99           1             0.05                0

Density depends on the percent alcohol and sugar content, it is measured in g/cm^3. Distribution of density is symmetric with a few single outliers.

Univariate Analysis

What is the structure of your dataset?

Dataset contains various characteristics of white wines. It has 4898 observations (wines) and 14 variables. Among these 14 variables one (X) is wine id number, 12 variables refer to the chemical properties of each wine and one variable (quality) determines wine quality as a s rating between 1 (very bad) and 10 (very excellent). Wine quality was defined as a median of at least 3 evaluations made by wine experts.

What is/are the main feature(s) of interest in your dataset?

Wine quality will be the main and the target feature in further analysis. The main goal of analysis will be to understand which particular chemical properties or their combinations influence wine quality. From the common sense I can assume that the right balance of acidity, alcohol and sweetness makes the good wine. In further analysis I will be focused mostly on these three factors, which are presented by one or more features in the dataset.

Acidity

  • fixed.acidity
  • volatile.acidity
  • citric.acid
  • pH

Alcohol

  • alcohol
  • density

Sweetness

  • residual.sugar

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

Amount of salt (chlorides) in a wine can also have interesting effect on wine taste, especially for sweeter wines. Other group of variables like sulfates, free.sulfur.dioxide and total.sulfur dioxide can also be interesting for consideration. Moderate amount of free sulfur dioxide can prevent grow of volatile acidity. Sulfur dioxide is responsible for the words “contains sulfites” found on wine labels and there are a lot of talks about its presence in the wine.

Did you create any new variables from existing variables in the dataset?

Wine quality is a score from 1 to 10, where 1 means very poor and 10 means very excellent. Variable quality had initially integer type. To work with quality estimates as discrete classes I have created factor variable quality.cls.
Wine quality classes are very unbalanced, e.g. there are munch more normal wines than excellent or poor ones. For further analysis new variable quality.cls.agg was introduced. It has 5 levels: “1-4”, “5”, “6”, “7” and “8-10”

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

Distribution of alcohol is right skewed so logarithmic transformation can be useful for further analysis. Log transformation makes distribution closer to normal and therefor easier for analysis. Many models and tests work properly only with close to normal distributions.

Distribution of residual.sugar is bimodal which is especially evident after log transformation. Two peaks probably correspond to different types of wines: dry and sweet wines.

A lot of variables have outliers, mostly on the right tail of distribution. Such variables like residual.sugar, free.sulfur dioxide, chlorides, citric.acid have outliers which can have significant effect on analysis and therefor should be taken into account.

Bivariate Plots Section

Correlation matrix as a heat map

Density is related to alcohol and sugar concentration, therefor correlation between density and residual.sugar and alcohol is high. Alcohol has negative correlation with residual.sugar (~ -0.5).

Free.sulfur.dioxide being a part of total.sulfur dioxide has high correlation with it (~ 0.6)

Moderate correlation also exists between fixed.acidity and pH (~ -0.4), total.sulfur.dioxide and residual.sugar (~0.4), alcohol and chlorides (~ -0.4), alcohol and total.sulfur.dioxide (~ -0.4).

Alcohol has the highest correlation with quality variable.

Relation between Features

Relation between residual sugar and other features is considered separately for dry wines (residual.sugar < 3) and sweet wines ( 3 < residual.sugar < 25 ).

Residual sugar and pH

Correlation coefficient and line coefficients

  • pH vs Residual Sugar (dry wines)
## [1] 0.04243425
##    (Intercept) residual.sugar 
##     3.19477549     0.01339249
  • pH vs Residual Sugar (sweet wines)
## [1] -0.1760285
##    (Intercept) residual.sugar 
##    3.230840959   -0.006446805

For dry wines pH seem to stay the same for all residual.sugar levels, although dispersion is higher for medium values. As for sweet wines there is a negative relation between amount of residual sugar and pH. Sweeter wines probably need more acids (lower pH) to balance the taste. Correlation analysis supports this observation. For sweet wines correlation between residual sugar and pH is stronger than for dry wines and has negative sign.

Residual sugar and alcohol

Correlation coefficient and line coefficients

  • Alcohol vs Residual Sugar (dry wines)
## [1] 0.2065043
##    (Intercept) residual.sugar 
##     10.2410377      0.4535757
  • Alcohol vs Residual Sugar (sweet wines)
## [1] -0.4626705
##    (Intercept) residual.sugar 
##     11.5531577     -0.1423637

For dry wines amount of alcohol increases with increase of residual sugar and for sweet wines relation is the opposite, alcohol decreases with increase of residual sugar. Correlation analysis shows similar results, for dry wines correlation is nearly 0.2, for sweet wines correlation is stronger and coefficient equals to -0.46.

Evident decrease of alcohol for sweeter wines can be explained by wine making process itself. Basically, when wine making happens, yeast eats sugar and makes ethanol (alcohol) as a by-product. A dry wine is when the yeast eats almost all the sugars and a sweet wine is when the yeast is stopped (usually by chilling the fermentation) before it eats all the sugars. This is why some sweet wines have less alcohol than dry wines.

Volatile.acidity and Free Sulfur Dioxide

Correlation coefficient and line coefficients

## [1] -0.1961609
##                                 (Intercept) I(free.sulfur.dioxide/total.sulfur.dioxide) 
##                                   0.3319993                                  -0.2103405

In the graph above volatile.acidity is plotted against ratio free.sulfur.dioxide/total.sulfur.dioxide. As concentration of free sulfur dioxide is growing, level of volatile.acidity is going down. Free sulfur dioxide serves as an antibiotic and antioxidant, protecting wine from spoilage by bacteria and oxidation. Its antimicrobial action also helps to minimize volatile acidity.

Alcohol and pH

It is interesting to note that for low alcohol wines pH level is lower. For low alcohol values pH increases with increase of alcohol, nearly after 10% alcohol level, pH stays the same and has slight increase for high alcohol levels.

Influence on Wine Quality

Next I will consider relation between wine quality and other factors. For better visualization outliers will be excluded.

## wines$quality.cls.agg: 1-4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.00    9.40   10.10   10.17   10.80   13.50 
## --------------------------------------------------------------------------- 
## wines$quality.cls.agg: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   8.000   9.200   9.500   9.809  10.300  13.600 
## --------------------------------------------------------------------------- 
## wines$quality.cls.agg: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.50    9.60   10.50   10.58   11.40   14.00 
## --------------------------------------------------------------------------- 
## wines$quality.cls.agg: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.60   10.60   11.40   11.37   12.30   14.20 
## --------------------------------------------------------------------------- 
## wines$quality.cls.agg: 8-10
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.50   11.00   12.00   11.65   12.60   14.00

Correlation Coefficient:

## [1] 0.4355747

Higher quality wines seem to contain more alcohol. Correlation coefficient is quite high.

## wines$quality.cls.agg: 1-4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.700   1.350   2.700   4.821   7.500  17.550 
## --------------------------------------------------------------------------- 
## wines$quality.cls.agg: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.600   1.800   7.000   7.335  11.500  23.500 
## --------------------------------------------------------------------------- 
## wines$quality.cls.agg: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.700   1.700   5.300   6.442   9.900  65.800 
## --------------------------------------------------------------------------- 
## wines$quality.cls.agg: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.900   1.700   3.650   5.186   7.325  19.250 
## --------------------------------------------------------------------------- 
## wines$quality.cls.agg: 8-10
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.800   2.075   4.300   5.628   8.150  14.800

Correlation Coefficient:

## [1] -0.10066

Considering influence of residual sugar, I can notice that high quality and low quality wines contain less sugar than medium wines.

## wines$quality.cls.agg: 1-4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.110   0.260   0.320   0.376   0.460   1.100 
## --------------------------------------------------------------------------- 
## wines$quality.cls.agg: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.100   0.240   0.280   0.302   0.340   0.905 
## --------------------------------------------------------------------------- 
## wines$quality.cls.agg: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0800  0.2000  0.2500  0.2606  0.3000  0.9650 
## --------------------------------------------------------------------------- 
## wines$quality.cls.agg: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0800  0.1900  0.2500  0.2628  0.3200  0.7600 
## --------------------------------------------------------------------------- 
## wines$quality.cls.agg: 8-10
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.120   0.200   0.260   0.278   0.330   0.660

Correlation Coefficient:

## [1] -0.194723

Volatile acidity of low quality wines is higher in comparison with other quality groups.

## wines$quality.cls.agg: 1-4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    3.00    9.00   18.00   26.63   33.50  289.00 
## --------------------------------------------------------------------------- 
## wines$quality.cls.agg: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2.00   22.00   35.00   36.43   50.00  131.00 
## --------------------------------------------------------------------------- 
## wines$quality.cls.agg: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    3.00   24.00   34.00   35.65   46.00  112.00 
## --------------------------------------------------------------------------- 
## wines$quality.cls.agg: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    5.00   25.00   33.00   34.13   41.00  108.00 
## --------------------------------------------------------------------------- 
## wines$quality.cls.agg: 8-10
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    6.00   28.00   34.50   36.63   44.25  105.00

Correlation Coefficient:

## [1] 0.03571914

Amount of free.sulfur.dioxide is significantly lower for low quality wines.

Density has influence on wine quality, as we also saw on previous plots for alcohol and residual sugar. Chlorides and citric acid seem not to have significant effect on wine quality.

As for pH level boxplot did not show any significant relation. From the other hand I can assume that relation between pH level and wine quality is not linear. Low pH level as well as high pH level can be a reason of poor wine taste. In the next plot first average quality is calculated for different pH levels, then the main trend is identified with solid blue line.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

Alcohol and volatile.acidity seem to have the strongest influence on wine quality. According to the dataset low quality wines have lower alcohol level. High volatile.acidity corresponds to low quality wines as well. It is well known problem of all winemakers, high level of volatile acidity makes wine to have vinegar taste. In its turn, free sulfur dioxide, being an antibiotic and antioxidant, reduces volatile.acidity and therefor lack of its concentration also correspond to low quality wines.

As for residual sugar high quality and low quality wines contain less sugar than medium wines. High quality wines also have less chlorides. Low pH level as well as high pH level can also be a reason of poor wine taste.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

I have introduced distinction between dry and sweet wines at the 3 g/dm^3 level of residual sugar. This cut level was selected according to the bimodal distribution of residual sugar variable. Some differences between dry and sweet wines were investigated. Sweeter wines probably need more acids (lower pH) to balance the taste. There is an evident decrease of alcohol for sweeter wines which can be explained by wine making process itself.

As concentration of free sulfur dioxide is growing, level of volatile.acidity is going down. Free sulfur dioxide serves as an antibiotic and antioxidant, protecting wine from spoilage by bacteria and oxidation. Its antimicrobial action also helps to minimize volatile acidity.

It is interesting to note that for low alcohol wines pH level is lower. For low alcohol values pH increases with increase of alcohol, nearly after 10% alcohol level, pH stays the same and has slight increase for high alcohol levels.

What was the strongest relationship you found?

Alcohol and volatile.acidity seem to have the strongest influence on wine quality.

Density is highly correlated with alcohol and residual sugar. It agrees with density definition being dependent on the percent alcohol and sugar content. In further analysis this correlation should be taken into account. For example in regression models it is better to include either density or residual sugar and alcohol.

Free.sulfur.dioxide being a part of total.sulfur dioxide has high correlation with it (~ 0.6)

Multivariate Plots Section

Alcohol and Volatile Acidity

According to previous section alcohol and volatile.acidity have significant effect on wine quality. Next I will consider combined effect of this factors on the target variable quality.cls.agg.

Low quality wines correspond to high values of volatile.acidity and low alcohol level. It is interesting to note that for higher alcohol level there are not much high values of volatile.acidity, and moreover, medium values of volatile.acidity for higher alcohol correspond already to high and medium quality wines. More clearly I see it in the second plot, for high quality wines relation between alcohol and volatile.acidity is positive, as alcohol level grows acceptable level of volatile.acidity also grows.

Latter suggests that alcohol neutralize or balance excess of volatile.acidity.

Alcohol and Free Sulfur Dioxide Concentration

Next plot shows relation between quality, alcohol and free sulfur dioxide concentration.

Low free sulfur dioxide concentrations correspond to low quality wines and it seems be independent of alcohol level. As concentration increases, alcohol level starts to be more significant for wine quality. Again note that higher alcohol wines normally do not have low sulfur dioxide concentrations.

Volatile.acidity and sulfur.dioxide

Again we see that as concentration of free.sulfur.dioxide increases, amount of volatile.acidity decreases. A lot of lower quality wines are located in the area of small free sulfur dioxide concentrations and they correspond to both high and low levels of volatile.acidity. Latter means that low free sulfur dioxide concentrations can have as consequences not only high volatile acidity, but also presence of some other unwanted substances.

Model

First I will exclude outliers for residual.sugar, free.sulfur.dioxide, chlorides and citric.acid.

Next I will scale all continuous variables in order to have zero mean and unit standard deviation. Then I will build three prediction models: linear regression assuming that response quality is continuous variable, logistic regression with response quality.cls.agg and ordinal logistic regression with response quality.cls.agg. Note that I will not include density since it is highly correlated with residual.sugar and alcohol

Linear Regression

## 
## Call:
## lm(formula = quality ~ . - density, data = stnd_wines[1:12])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.3170 -0.5069 -0.0255  0.4510  3.0948 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           5.935469   0.011417 519.896  < 2e-16 ***
## fixed.acidity        -0.019146   0.013371  -1.432  0.15224    
## volatile.acidity     -0.149712   0.012302 -12.170  < 2e-16 ***
## citric.acid          -0.004793   0.012344  -0.388  0.69781    
## residual.sugar        0.125129   0.013993   8.942  < 2e-16 ***
## chlorides            -0.056242   0.014028  -4.009 6.19e-05 ***
## free.sulfur.dioxide   0.107535   0.015226   7.062 1.90e-12 ***
## total.sulfur.dioxide -0.027579   0.017037  -1.619  0.10557    
## pH                    0.040785   0.013098   3.114  0.00186 ** 
## sulphates             0.051826   0.011733   4.417 1.03e-05 ***
## alcohol               0.425189   0.015902  26.737  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.752 on 4328 degrees of freedom
## Multiple R-squared:  0.2501, Adjusted R-squared:  0.2484 
## F-statistic: 144.3 on 10 and 4328 DF,  p-value: < 2.2e-16

Citric.acid, fixed.acidity and total.sulfur.dioxide appears to be not significant.

As it was shown in exploratory analysis increase of alcohol does increase wine quality, the same is true for residual.sugar, free.sulfur.dioxide, pH and sulfates. Increase of other variables has negative effect. Most important feature is alcohol, next are volatile.acidity, residual.sugar and free.sulfur.dioxide.

Logistic Regression

## 
## Call:
## glm(formula = quality.cls.agg ~ . - density, family = binomial, 
##     data = stnd_wines[c(1:11, 13)])
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.9119   0.1023   0.1531   0.2412   1.0117  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)           4.29895    0.15085  28.498  < 2e-16 ***
## fixed.acidity        -0.29728    0.09724  -3.057 0.002233 ** 
## volatile.acidity     -0.41304    0.09350  -4.418 9.98e-06 ***
## citric.acid          -0.02240    0.09535  -0.235 0.814263    
## residual.sugar        0.18901    0.12042   1.570 0.116501    
## chlorides            -0.22736    0.11224  -2.026 0.042805 *  
## free.sulfur.dioxide   1.00460    0.15499   6.482 9.08e-11 ***
## total.sulfur.dioxide  0.18358    0.13281   1.382 0.166883    
## pH                   -0.11083    0.11101  -0.998 0.318087    
## sulphates             0.10710    0.10927   0.980 0.327024    
## alcohol               0.48189    0.13486   3.573 0.000352 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1104.86  on 4338  degrees of freedom
## Residual deviance:  924.13  on 4328  degrees of freedom
## AIC: 946.13
## 
## Number of Fisher Scoring iterations: 7

Citric.acid, residual.sugar, total.sulfur.dioxide, pH, sulfates are not significant. Positive effect: alcohol and free.sulfur.dioxide Negative: Volatile.acidity, chlorides Most important: free.sulfur.dioxide, alcohol, volatile.acidity

McFadden`s Pseudo R-Squared

## [1] 0.1635724

Ordinal Logistic Regression

## Call:
## polr(formula = quality.cls.agg ~ . - density, data = stnd_wines[c(1:11, 
##     13)])
## 
## Coefficients:
##                          Value Std. Error  t value
## fixed.acidity        -0.048275    0.03408  -1.4164
## volatile.acidity     -0.382912    0.03220 -11.8932
## citric.acid           0.008401    0.03127   0.2687
## residual.sugar        0.312540    0.03595   8.6947
## chlorides            -0.169292    0.03581  -4.7280
## free.sulfur.dioxide   0.273006    0.03910   6.9817
## total.sulfur.dioxide -0.097703    0.04323  -2.2603
## pH                    0.124054    0.03365   3.6861
## sulphates             0.150268    0.02991   5.0234
## alcohol               1.098378    0.04326  25.3907
## 
## Intercepts:
##        Value    Std. Error t value 
## 1-4|5   -4.1332   0.0975   -42.4123
## 5|6     -1.0389   0.0376   -27.6541
## 6|7      1.5110   0.0417    36.2606
## 7|8-10   3.8268   0.0859    44.5577
## 
## Residual Deviance: 9565.667 
## AIC: 9593.667

Most important: alcohol, volatile.acidity, residual.sugar, free.sulfur.dioxide.

McFadden`s Pseudo R-Squared:

## [1] 0.1224475

Superficial model analysis shows quite similar results. Most important factors are alcohol, volatile.acidity and free.sulfur.dioxide.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

Multivariate analysis showed that the most important features for wine quality are alcohol and free sulfur dioxide concentration. Low free sulfur dioxide concentrations correspond to low quality wines and it seems be independent of alcohol level. As concentration increases, alcohol level starts to be more significant for wine quality. Again note that higher alcohol wines normally do not have low sulfur dioxide concentrations.

Were there any interesting or surprising interactions between features?

Free.sulfur dioxide seems to have stronger effect than volatile.acidity. Again it was shown that as concentration of free.sulfur.dioxide increases, amount of volatile.acidity decreases. A lot of lower quality wines are located in the area of small free sulfur dioxide concentrations and they correspond to both high and low levels of volatile.acidity. Latter means that low free sulfur dioxide concentration can have as consequences not only high volatile acidity, but also presence of some other unwanted substances.

OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.

Three prediction models were built: linear regression assuming that response quality is continuous variable, logistic regression with response quality.cls.agg and ordinal logistic regression with response quality.cls.agg.

All models used all features for prediction except for density. Density were excluded because it is highly correlated with alcohol and residual sugar.

Linear regression model is significant however R-Squared is quite low, meaning low proportion of explained variation. It means that most features are related to wine quality, however linear model is too simple for considered relation or there are other important factors, not included in the model.

Logistic regression and ordered logistic regression also have quite low R-Squared (here McFadden Pseudo R-squared), however Pseudo R-Squared measure is quite controversial and not always can be a good indicator of model fit. For further analysis non-linearities should be also considered and models can be more accurately compared using cross-validation.

Final Plots and Summary

Plot One

## wines$quality.cls.agg: 1-4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.00    9.40   10.10   10.17   10.80   13.50 
## --------------------------------------------------------------------------- 
## wines$quality.cls.agg: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   8.000   9.200   9.500   9.809  10.300  13.600 
## --------------------------------------------------------------------------- 
## wines$quality.cls.agg: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.50    9.60   10.50   10.58   11.40   14.00 
## --------------------------------------------------------------------------- 
## wines$quality.cls.agg: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.60   10.60   11.40   11.37   12.30   14.20 
## --------------------------------------------------------------------------- 
## wines$quality.cls.agg: 8-10
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.50   11.00   12.00   11.65   12.60   14.00

Correlation Coefficient:

## [1] 0.4355747

Description One

One of the most significant features that influence wine quality appeared to be alcohol. Lower quality wines contain less alcohol. Difference between quality classes can be identified by exploring medians and quartiles as well.

Plot Two

## wines$quality.cls.agg: 1-4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    3.00    9.00   18.00   26.63   33.50  289.00 
## --------------------------------------------------------------------------- 
## wines$quality.cls.agg: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2.00   22.00   35.00   36.43   50.00  131.00 
## --------------------------------------------------------------------------- 
## wines$quality.cls.agg: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    3.00   24.00   34.00   35.65   46.00  112.00 
## --------------------------------------------------------------------------- 
## wines$quality.cls.agg: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    5.00   25.00   33.00   34.13   41.00  108.00 
## --------------------------------------------------------------------------- 
## wines$quality.cls.agg: 8-10
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    6.00   28.00   34.50   36.63   44.25  105.00

Description Two

Level of free sulfur dioxide is significantly lower for low quality wines. Free sulfur dioxide serves as an antibiotic and antioxidant, protecting wine from spoilage by bacteria and oxidation. Its antimicrobial action also helps to minimize volatile acidity. Recently there were a lot of talks about its presence in the wine since there are people who have a genuine allergy to sulfites. A number of “natural” wines appeared on the market, where little or no sulfites is added. However conducted analysis shows that wines with low sulfate level in average have lower quality.

Plot Three

Description Three

Colored scatter plot showing relation between free sulfur dioxide concentration (in total amount of sulfur dioxide), amount of alcohol and wine quality. These features appeared to be one the most significant for wine quality. Low free sulfur dioxide concentrations correspond to low quality wines and it seems be independent of alcohol level. As concentration increases, alcohol level starts to be more significant for wine quality. Again note that higher alcohol wines normally do not have low sulfur dioxide concentrations.


Reflection

Wine tasting as well as tasting other things is very subjective. An interesting results were published in the Journal of Wine Economics, experiment showed a typical judge’s scores varied by plus or minus four points over the three blind tastings. A wine deemed to be a good would be rated as an acceptable by the same judge minutes later and then an excellent.

Conducted analysis in some ways agrees with that state. Most revealed relations were not surprising. Volatile acidity and sulfur dioxide are known terms for all wine makers.

Sulfur dioxide is responsible for the words “contains sulfites” found on wine labels and there are a lot of talks about its presence in the wine. There are people who have a genuine allergy to sulfites, and these allergies are often linked with asthma. The amount of sulfites that a wine can contain is highly regulated around the world. Any wine containing more than 10 parts per million (ppm) of sulfur dioxide must affix to the label ‘contains sulfites’. All that said, we are beginning to see a number of “natural” wines on the market, where little or no sulfites is added. It is known that leaving out sulfites is easier with red wines, because the tannin acts as a as a natural antioxidant. However analysis showed that for white wines it is not true. Lack of sulfur dioxide quite often corresponds to low quality white wines.

An interesting relation was observed between wine quality and alcohol. Wines with higher alcohol level seem to be higher quality. Moreover low free sulfur dioxide concentration and high volatile acidity normally do not influence quality wines with higher alcohol level.

One the central difficulties that I faced during analysis was that I knew very little about wine making. I had to spend some time to understand for example what is volatile acidity or free sulfur dioxide and why they are important. Understanding of underlying data is very important in order to come up with a good analysis and provide a new vision of the problem.

I thing it could be interesting to include price information in the database. Will be price correlated with wine quality or not? If not, what does that mean? It will be also interesting to include wines from different regions and countries, will experts value french wines more than wines from Czech Republic? Another interesting question I think is comparing blind and open testings. In blind testing experts do not know which wine they are evaluating and in open testing they are provided with this information. Will be any bias between estimates obtained in different testings?