Professional Documents
Culture Documents
finaldocmp.doc
finaldocmp.doc
finaldocmp.doc
on
Submitted for partial fulfillment of the requirements for the award of the degree
of
BACHELOR OF ENGINEERING
IN
BY
D. Haritha
Assistant Professor
Department of CSE
M.V.S.R.E.C., Hyderabad.
1
M.V.S.R. ENGINEERING COLLEGE
(Affiliated to Osmania University, Hyderabad)
Nadergul(P.O.), Hyderabad-501510
Certificate
This is to certify that the mini-project work entitled “Prediction of wine quality using
machine learning in python” is a bonafide work carried out by K.R. Lathasree
(2451-17-733-064), T. Meghana (2451-17-733-065.), B. Sony (2451-17-733-067) in partial
fulfillment of the requirements for the award of degree of BACHELOR OF ENGINEERING IN
COMPUTER SCIENCE AND ENGINEERING from M.V.S.R. Engineering College, affiliated to
OSMANIA UNIVERSITY, Hyderabad, under our guidance and supervision.
The results embodied in this report have not been submitted to any other university or
institute for the award of any degree or diploma.
Internal Guide
D. Haritha Dr.Akhil Khare
Assistant Professor Professor and Head
Department of CSE Department of CSE
MVSREC, Hyderabad. MVSREC, Hyderabad.
2
3
4
5
6
DECLARATION
This is to certify that the work reported in the present mini-project entitled “ Prediction
of Wine Quality using Machine Learning in python” is a record of bonafide work done by
us in the Department of Computer Science and Engineering, M.V.S.R. Engineering College,
Osmania University. The reports are based on the mini-project work done entirely by us and
not copied from any other source.
The results embodied in this mini-project report have not been submitted to any other
University or Institute for the award of any degree or diploma to the best of our/ my
knowledge and belief.
7
ACKNOWLEDGEMENTS
We are also thankful to our principal Dr. G. Kanaka Durga and Dr. Akhil Khare, Professor
and Head, Department of Computer Science and Engineering, MVSR Engineering College,
Hyderabad for providing excellent infrastructure for completing this mini-project
successfully as a part of our B.E. Degree (CSE). We would like to thank our mini-project
coordinators M. Anupama, Md. Abdul Azeem, P. Subhashini for their constant monitoring,
guidance and support.
We convey our heartfelt thanks to the lab staff for allowing me to use the required
equipment whenever needed.
Finally, we would like to take this opportunity to thank my family for their support
through the work. We sincerely acknowledge and thank all those who gave directly or
indirectly their support in completion of this work.
8
ABSTRACT
Guide
D. HARITHA,
Assistant Professor,
Dept. of CSE,
MVSREC, Hyderabad.
9
TABLE OF CONTENTS
PAGE NOS.
Title
page……………………………………………………………………………………………
……… …..i
Certificate ......................................................................................................ii - vi
Declaration .......................................................................................................vii
Acknowledgements ...........................................................................................viii
Abstract.............................................................................................................. ix
Table of contents................................................................................................ x
List of Figures .................................................................................................... xi
List of Tables .................................................................................................... xii
10
LIST OF FIGURES
Figure 1 Coefficients of linear regression model in predicting Page no. 02
wine quality
11
LIST OF TABLES
12
CONTENTS INSIDE THE DOCUMENT
Chapter 1
1. Introduction 01 - 05
1.1 Problem statement 01
1.2 Existing system 01
1.3 Proposed system 01 - 03
1.4 Scope of the mini-project 03 - 05
Chapter 2
Chapter 3
3. System design 11
3.1 System architecture 11
Chapter 4
4. System implementation & methodologies 12 - 16
4.1 Methodologies 12 - 15
4.2 Code skeleton 16
Chapter 5
5. Testing 17 - 19
5.1 Screen shots 17 - 18
5.2 Results (screen shots) 19
Chapter 6
6. Conclusion & future enhancements 20
References/ bibliography 21
Appendix 22 - 26
Steps to install anaconda and launch spyder (windows) 22
Code 23 - 26
13
1.Introduction
To build a regression model which will predict the quality of wine depending on multiple
factors and chemical composition of it.
Nowadays, industries are using product quality certifications to promote their products.
This is a time taking process and requires the assessment given by human experts which
makes this process very expensive. Our project explores the usage of machine learning
techniques such as linear regression for predicting wine quality.
Machine learning techniques are used to determine dependency of wine quality on other
independent variables. First, wine dataset is pre-processed. That is done by removing
duplicate instances and checking for null values. Then after, independent attributes are
selected according to dependency of attributes on each other. The dependency between
attributes is determined by checking the correlation between them. Now, the dataset is
split into two parts train and test dataset.
Further, linear regression is applied to determine dependency of wine quality on
other independent attributes. This gives us the required model. At last, Wine quality is
predicted with the help of linear regression model.
1
Linear regression:
linear regression is a linear approach to modelling the relationship between a scalar
response (dependent variables) and one or more explanatory variables (or independent
variables).
A linear regression model with k number of explanatory (independent) variables and one
response(dependent)variable can be expressed as (1).
Y = β0 + β1X1 + β2X2 + ∙ ∙ ∙ βkXk + ϵ (1)
where, Y is response variable and Xi are explanatory variables (independent variables). ϵ is
the residual term of the model, which is used for inference on the remaining model
parameters. β0 is the Y-intercept and β1, β2, ..., βk are regression coefficient. Our project
consists of 10 independent variables after pre-processing and a dependent variable
(quality).
2
Figure 2: Y-intercept
Figure 3: Linear regression model graph between one independent and a dependent
variable
Machine Learning is a science to make the machine capable of taking the decision
itself. These systems also have the ability to learn from past experience or analyze historical
data. It provides results according to its experience.
3
Advantages of Machine learning
⮚ Image recognition
⮚ Face recognition
⮚ Medical diagnosis
⮚ Classification
⮚ Prediction
⮚ Extraction
Predictions:
Machine Learning (ML) can generate two types of predictions—batch and real-time.
Regression:
4
Regression is a statistical measurement used in finance, investing and other disciplines that
attempts to determine the strength of the relationship between one dependent variable
(usually denoted by Y) and a series of other changing variables (known as independent
variables).
The two basic types of regression are linear regression and multiple linear
regression. Linear regression uses one independent variable to explain or predict the
outcome of the dependent variable Y, while multiple regression uses two or more
independent variables to predict the outcome.
Multiple linear regression:
It performs the task to predict continuous dependent variable based on more than
two independent variables.
5
fixed acidity, volatile acidity, citric acidity, residual sugar, chlorides, free Sulphur dioxide,
total Sulphur dioxide, pH, density, sulphates and alcohol.
About dataset:
In the above reference, the dataset was created, using white wine samples. The inputs
include objective tests (e.g. PH values) and the output is based on sensory data (median of
at least 3 evaluations made by wine experts). Each expert graded the wine quality between
0 (very bad) and 10 (very excellent).
The dataset is related to white variant of the Portuguese “Vinho Verde” wine. For more
details, consult: http://www.vinhoverde.pt/en/ or the reference [Cortez et al., 2009]. Due
to privacy and logistic issues, only physicochemical (inputs) and sensory (the output)
variables are available (e.g. there is no data about grape types, wine brand, wine selling
price, etc.).
Attribute information:
8 - density (g / cm^3)
9 - pH
6
10 - sulphates (potassium sulphate - g / dm3)
11 - alcohol (% by volume)
Description of attributes:
1 - fixed acidity: most acids involved with wine or fixed or nonvolatile (do not evaporate
readily)
2 - volatile acidity: the amount of acetic acid in wine, which at too high of levels can lead to
an unpleasant, vinegar taste
3 - citric acid: found in small quantities, citric acid can add ‘freshness’ and flavor to wines
4 - residual sugar: the amount of sugar remaining after fermentation stops, it’s rare to find
wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered
sweet
6 - free sulfur dioxide: the free form of SO2 exists in equilibrium between molecular SO2 (as
a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine
7 - total sulfur dioxide: amount of free and bound forms of S02; in low concentrations, SO2
is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes
evident in the nose and taste of wine
8 - density: the density of water is close to that of water depending on the percent alcohol
and sugar content
7
9 - pH: describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very
basic); most wines are between 3-4 on the pH scale
10 - sulphates: a wine additive which can contribute to sulfur dioxide gas (S02) levels, which
acts as an antimicrobial and antioxidant
⮚ Linear Regression can be used to predict the sale of products in the future
based on past buying behaviour.
⮚ Economists use Linear Regression to predict the economic growth of a country or
state.
⮚ Economists use Linear Regression to predict the economic growth of a country or
state.
⮚ An organisation can use linear regression to figure out how much they would pay to
a new joinee based on the years of experience.
⮚ Linear regression analysis can help a builder to predict how much houses it would
sell in the coming months and at what price.
⮚ Petroleum prices can be predicted using Linear Regression.
Spyder:
Spyder, the Scientific Python Development Environment, is a free integrated development
environment (IDE) that is included with Anaconda. It includes editing, interactive testing,
debugging and introspection features.
Machine learning:
Machine learning provides systems the ability to automatically learn and improve from
experience without being explicitly programmed. Machine Learnings main focus is to
provide algorithms which can be trained to perform a task. Affordable and easy
computational processing and cost-effective data storage options have made it feasible to
develop models that quickly and accurately analyze huge chunks of complex data.
Python:
Simple & easy to learn: The concept of this programming language is as simple as it can be.
That makes it easy for everyone to learn and use. It is easy to understand the syntax.
9
Open source: Python is completely free as it is an open source software. Several of open
source scientific computing storage has the API for Python. Users can easy to install Python
on their own computer and use the standard and extend library.
Scalability: Programmers can write their code in C or C++ and run them in Python.
Applications of Python:
Scikit-learn:
Scikit–learn is an open source machine learning library for the Python programming
language. It features various classification, regression, and clustering algorithms and is
designed to interoperate with the Python numerical libraries NumPy and SciPy. SciKit-learn
contains the Kmeans algorithm based on Python and it helps to figure out how to
implement this algorithm in programming.
10
3. System design
3.1 System architecture:
11
4. System implementation and methodologies
4.1 Methodologies:
read_csv():
It is used to read a comma-separated values (csv) file into DataFrame. Filepath is given as
parameter to read_csv() method. This method is available in panadas. So, to use this
method pandas should be imported.
duplicated():
This function will check for duplicate instances in our dataset. If it come across a duplicate
instance then it prints true at that instance number else false.
12
drop_duplicates():
It is a predefined function and is used to drop duplicates from dataset.
It returns DataFrame with duplicate rows removed.
head():
This method returns first n rows of the dataset. Default number of is taken as 5.
corr():
It is used to find the pairwise correlation of all columns in the data frame. Any null values
are automatically excluded. For any non-numeric data type columns in the data frame it is
ignored.
If the correlation between any two attributes is one then one of the attribute should be
removed because they both have almost similar to each other. From the below table we
can see that correlation between density and residual sugar is one. So we dropped density
attribute.
13
Table 3: Correlation between the attributes of wine dataset
14
LinearRegression():
We are using LinearRegression algorithm to built our model. It is a predefined function in
linear_model class. So, we imported linear_modelclass from sklearn library to use this
method.
Predict(argument):
This method is used to predict the output of an unknown sample. Here, argument is the
array of attributes of the unknown sample.
The predicted quality for an unknown sample with inputs
[12,0.5,1.4,56,0.2,205,380,3.1,0.3,7.1] is as shown below.
15
Figure 7: output (quality of unknown sample)
4.2Code skeleton/Algorithms:
16
5. Testing:
5.1 Screenshots:
Testing screenshot 1:
17
Testing screenshot 2:
18
Testing screenshot 3 (Result):
19
6. Conclusions and future enhancement
20
The interest has been increased in wine industry in recent years which demands
growth in this industry. Therefore, companies are investing in new technologies to improve
wine production and selling. In this direction, wine quality certification plays a very
important role for both processes and it requires wine testing by human experts. Our
project explores the usage of machine learning technique, linear regression. It determines
important features for prediction. The experiments show that the value of dependent
variable can be predicted more accurately if only important features are considered in
prediction rather than considering all features. In future, large dataset can be taken for
experiments and other machine learning techniques may be explored for wine quality
prediction.
References/bibliography
21
https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/
https://github.com/auxa/Wine-Quality-Prediction
https://scikit-learn.org/stable/
https://www.geeksforgeeks.org/linear-regression-python-implementation/
https://docs.python.org/3/
Appendix
22
Steps to install anaconda and launch spyder (windows)
1. Open https://www.anaconda.com/distribution/ in a web browser.
2. Download the anaconda installer(select python version) for windows
3. To start spyder, first open anaconda navigator (you will find anaconda navigator in
the start menu)
4. Then, click the launch button below the spyder icon on the navigator home tab
Code:
23
"""
Prediction of wine quality using machine learning in python
"""
import pandas as pd
from sklearn.model_selection import train_test_split as tts
from sklearn import linear_model
from sklearn.metrics import mean_squared_error
#reading dataset
raw_data=pd.read_csv('file:///C:/Users/Lathasree
Reddy/Downloads/winequality-white.csv',sep=';')
print("Shape of raw data :",raw_data.shape,"\n")
#identifying duplicates
dup_data=raw_data.duplicated()
print("Number of duplicate rows =",sum(dup_data),"\n")
#removing duplicates
data=raw_data.drop_duplicates()
print("Shape of data after removing duplicates :",data.shape,"\n")
data.rename(columns={'fixedacidity':'fixed_acidity','volatile
acidity':'volatile_acidity',\
'citricacid':'citric_acid','residual
sugar':'residual_sugar',\
'free sulfur dioxide':'free_sulfur_dioxide',\
'total sulfur
dioxide':'total_sulfur_dioxide'},inplace=True)
24
print()
#there are no missing values
y=train['quality']
cols=["fixed_acidity","volatile_acidity","citric_acid","residual_sugar
","chlorides",\
"free_sulfur_dioxide","total_sulfur_dioxide","pH","sulphates","alcohol
"]
X=train[cols]
#model formation
reg=linear_model.LinearRegression()
model=reg.fit(X,y)
coef=reg.coef_
print("Coefficients of the linear equation : \n",coef,"\n")
intercept=reg.intercept_
print("Y-intercept :",intercept)
print()
y_train_pred=reg.predict(X)
print("In sample Root mean square error:
%.2f"%mean_squared_error(y,y_train_pred)**0.5)
print()
y_test=test['quality']
25
X_test=test[cols]
y_test_pred=reg.predict(X_test)
print("Out sample Root mean square error:
%.2f"%mean_squared_error(y_test,y_test_pred)**0.5)
print()
#unknown sample
import numpy as np
a=np.array([12,0.5,1.4,56,0.2,205,380,3.1,0.3,7.1]).reshape(1,-1)
quality1=reg.predict(a)
print(quality1)
26
arr.append(x)
ar=np.asarray(arr)
ar=ar.reshape(1,-1)
quality2=reg.predict(ar)
print()
print("Quality of your wine sample is",quality2)
27