finaldocmp.doc

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 40

A Mini-Project Report

on

PREDICTION OF WINE QUALITY USING MACHINE LEARNING IN PYTHON

Submitted for partial fulfillment of the requirements for the award of the degree
of

BACHELOR OF ENGINEERING

IN

COMPUTER SCIENCE AND ENGINEERING

BY

K.R. Lathasree (2451-17-733-064)


T. Meghana (2451-17-733-065)
B. Sony (2451-17-733-067)

Under the guidance of

D. Haritha
Assistant Professor
Department of CSE
M.V.S.R.E.C., Hyderabad.

Department of Computer Science and Engineering


M.V.S.R. ENGINEERING COLLEGE
(Affiliated to Osmania University & Recognized by AICTE)
Nadergul, Saroor Nagar Mandal, Hyderabad – 501 510
2018-19.

1
M.V.S.R. ENGINEERING COLLEGE
(Affiliated to Osmania University, Hyderabad)
Nadergul(P.O.), Hyderabad-501510

Certificate
This is to certify that the mini-project work entitled “Prediction of wine quality using
machine learning in python” is a bonafide work carried out by K.R. Lathasree
(2451-17-733-064), T. Meghana (2451-17-733-065.), B. Sony (2451-17-733-067) in partial
fulfillment of the requirements for the award of degree of BACHELOR OF ENGINEERING IN
COMPUTER SCIENCE AND ENGINEERING from M.V.S.R. Engineering College, affiliated to
OSMANIA UNIVERSITY, Hyderabad, under our guidance and supervision.

The results embodied in this report have not been submitted to any other university or
institute for the award of any degree or diploma.

Internal Guide
D. Haritha Dr.Akhil Khare
Assistant Professor Professor and Head
Department of CSE Department of CSE
MVSREC, Hyderabad. MVSREC, Hyderabad.

2
3
4
5
6
DECLARATION

This is to certify that the work reported in the present mini-project entitled “ Prediction
of Wine Quality using Machine Learning in python” is a record of bonafide work done by
us in the Department of Computer Science and Engineering, M.V.S.R. Engineering College,
Osmania University. The reports are based on the mini-project work done entirely by us and
not copied from any other source.
The results embodied in this mini-project report have not been submitted to any other
University or Institute for the award of any degree or diploma to the best of our/ my
knowledge and belief.

K. R. Lathasree T. Meghana B. Sony


(2451-17-733-064) (2451-17-733-065) (2451-17-733-067)

7
ACKNOWLEDGEMENTS

We would like to express our sincere gratitude and indebtedness to my mini-project


guide D. Haritha for her valuable suggestions and interest throughout the course of this
mini-project.

We are also thankful to our principal Dr. G. Kanaka Durga and Dr. Akhil Khare, Professor
and Head, Department of Computer Science and Engineering, MVSR Engineering College,
Hyderabad for providing excellent infrastructure for completing this mini-project
successfully as a part of our B.E. Degree (CSE). We would like to thank our mini-project
coordinators M. Anupama, Md. Abdul Azeem, P. Subhashini for their constant monitoring,
guidance and support.

We convey our heartfelt thanks to the lab staff for allowing me to use the required
equipment whenever needed.

Finally, we would like to take this opportunity to thank my family for their support
through the work. We sincerely acknowledge and thank all those who gave directly or
indirectly their support in completion of this work.

K.R. Lathasree (2451-17-733-064)


T. Meghana (2451-17-733-065)
B.Sony (2451-17-733-067)

8
ABSTRACT

PREDICTION OF WINE QUALITY USING MACHINE LEARNINGIN PYTHON


Machine learning (ML) is one of most popular approaches in Artificial Intelligence. It
concerns giving computers the ability to learn without being explicitly programmed to
perform the task. Prediction of wine quality refers to predicting the quality of wine based
on the attributes given like fixed acidity, volatile acidity, density, pH etc. The wine quality
dataset we are using is taken from UCI Machine Learning Repository. First, we pre-process
the dataset, check for any missing values, duplicates and correlation between attributes.
Next, we split the dataset into train and test dataset. Later, we import a machine learning
algorithm from sklearn. We used linear Regression algorithm for our project. Linear
regression performs the task to predict continuous dependent variable based on two or
more independent variables. We now train our algorithm with train dataset which gives a
model. Then, we will test its accuracy using test dataset. Now, when an unknown sample
with its attributes is given as input to the model, it predicts the quality of that sample.

K.R. LathaSree (2451-17-733-064)


T. Meghana (2451-17-733-065)
B. Sony (2451-17-733-067)

Guide
D. HARITHA,
Assistant Professor,
Dept. of CSE,
MVSREC, Hyderabad.

9
TABLE OF CONTENTS
PAGE NOS.
Title
page……………………………………………………………………………………………
……… …..i
Certificate ......................................................................................................ii - vi
Declaration .......................................................................................................vii
Acknowledgements ...........................................................................................viii
Abstract.............................................................................................................. ix
Table of contents................................................................................................ x
List of Figures .................................................................................................... xi
List of Tables .................................................................................................... xii

10
LIST OF FIGURES
Figure 1 Coefficients of linear regression model in predicting Page no. 02
wine quality

Figure 2 Y intercept Page no. 03

Figure 3 Linear regression model between one independent Page no. 03


and a dependent variable

Figure 4 Structure of machine learning system Page no. 03

Figure 5 A Linear regression model with k number of Page no. 05


explanatory variables and one response

Figure 6 Input (attributes of unknown sample) Page no. 15

Figure 7 Output (quality of unknown sample) Page no. 15

Figure 8 Anaconda Navigator Page no. 22

11
LIST OF TABLES

Table 1 Dup data series for wine quality Page no. 12

Table 2 Head elements of wine quality dataset Page no. 13

Table 3 Correlation between the attributes of Page no. 14


wine dataset

Table 4 Test and Train dataset instances and Page no. 14


Attributes

12
CONTENTS INSIDE THE DOCUMENT

Chapter 1
1. Introduction 01 - 05
1.1 Problem statement 01
1.2 Existing system 01
1.3 Proposed system 01 - 03
1.4 Scope of the mini-project 03 - 05

Chapter 2

2. Tools and technologies 06 - 10


2.1 Literature survey 06 - 08
2.2 Software requirements 08 - 09
2.3 Tools and technologies 09 - 10

Chapter 3

3. System design 11
3.1 System architecture 11

Chapter 4
4. System implementation & methodologies 12 - 16
4.1 Methodologies 12 - 15
4.2 Code skeleton 16

Chapter 5
5. Testing 17 - 19
5.1 Screen shots 17 - 18
5.2 Results (screen shots) 19

Chapter 6
6. Conclusion & future enhancements 20

References/ bibliography 21
Appendix 22 - 26
Steps to install anaconda and launch spyder (windows) 22
Code 23 - 26
13
1.Introduction

1.1 Problem Statement:

To build a regression model which will predict the quality of wine depending on multiple
factors and chemical composition of it.

1.2 Existing System:

Nowadays, industries are using product quality certifications to promote their products.
This is a time taking process and requires the assessment given by human experts which
makes this process very expensive. Our project explores the usage of machine learning
techniques such as linear regression for predicting wine quality.

1.3 Proposed Methodology:

Machine learning techniques are used to determine dependency of wine quality on other
independent variables. First, wine dataset is pre-processed. That is done by removing
duplicate instances and checking for null values. Then after, independent attributes are
selected according to dependency of attributes on each other. The dependency between
attributes is determined by checking the correlation between them. Now, the dataset is
split into two parts train and test dataset.
Further, linear regression is applied to determine dependency of wine quality on
other independent attributes. This gives us the required model. At last, Wine quality is
predicted with the help of linear regression model.

1
Linear regression:
linear regression is a linear approach to modelling the relationship between a scalar
response (dependent variables) and one or more explanatory variables (or independent
variables).
A linear regression model with k number of explanatory (independent) variables and one
response(dependent)variable can be expressed as (1).
Y = β0 + β1X1 + β2X2 + ∙ ∙ ∙ βkXk + ϵ (1)
where, Y is response variable and Xi are explanatory variables (independent variables). ϵ is
the residual term of the model, which is used for inference on the remaining model
parameters. β0 is the Y-intercept and β1, β2, ..., βk are regression coefficient. Our project
consists of 10 independent variables after pre-processing and a dependent variable
(quality).

Figure 1: Coefficients of the linear regression model in predicting wine quality

2
Figure 2: Y-intercept

Figure 3: Linear regression model graph between one independent and a dependent
variable

1.4 Scope of the project:

Machine Learning is a science to make the machine capable of taking the decision
itself. These systems also have the ability to learn from past experience or analyze historical
data. It provides results according to its experience.

Figure 4: Structure of machine learning system

3
​ Advantages of Machine learning

⮚ Easily identifies trends and patterns


⮚ No human intervention needed(automation)
⮚ Continuous improvement
⮚ Handling multi-dimensional and multi-variety data

​ Applications of Machine Learning

⮚ Image recognition
⮚ Face recognition
⮚ Medical diagnosis
⮚ Classification
⮚ Prediction
⮚ Extraction

Predictions:

Machine Learning (ML) can generate two types of predictions—batch and real-time.

A real-time prediction is a prediction for a single observation. Real-time predictions are


ideal for mobile apps, websites, and other applications that need to use results
interactively.

A batch prediction is a set of predictions for a group of observations. ML processes the


records in a batch prediction together, so processing can take some time. Use batch
predictions for applications that require predictions for set of observations or predictions
that don't use results interactively.

Regression:

4
Regression is a statistical measurement used in finance, investing and other disciplines that
attempts to determine the strength of the relationship between one dependent variable
(usually denoted by Y) and a series of other changing variables (known as independent
variables).
The two basic types of regression are linear regression and multiple linear
regression. Linear regression uses one independent variable to explain or predict the
outcome of the dependent variable Y, while multiple regression uses two or more
independent variables to predict the outcome.
Multiple linear regression:
It performs the task to predict continuous dependent variable based on more than
two independent variables.

Figure 5: A linear regression model with k number of explanatory (independent)


variables and one response(dependent).

2. Tools and technologies used

2.1 Literature survey:


We took our dataset from UCI datasets with 4898 instances. It contains 12 attributes and
the 12thattribute is quality attribute which is dependent on remaining 11 attributes like

5
fixed acidity, volatile acidity, citric acidity, residual sugar, chlorides, free Sulphur dioxide,
total Sulphur dioxide, pH, density, sulphates and alcohol.
About dataset:
In the above reference, the dataset was created, using white wine samples. The inputs
include objective tests (e.g. PH values) and the output is based on sensory data (median of
at least 3 evaluations made by wine experts). Each expert graded the wine quality between
0 (very bad) and 10 (very excellent).

The dataset is related to white variant of the Portuguese “Vinho Verde” wine. For more
details, consult: http://www.vinhoverde.pt/en/ or the reference [Cortez et al., 2009]. Due
to privacy and logistic issues, only physicochemical (inputs) and sensory (the output)
variables are available (e.g. there is no data about grape types, wine brand, wine selling
price, etc.).

Attribute information:

Input variables (based on physicochemical tests):

1 - fixed acidity (tartaric acid - g / dm^3)

2 - volatile acidity (acetic acid - g / dm^3)

3 - citric acid (g / dm^3)

4 - residual sugar (g / dm^3)

5 - chlorides (sodium chloride - g / dm^3)

6 - free sulfur dioxide (mg / dm^3)

7 - total sulfur dioxide (mg / dm^3)

8 - density (g / cm^3)

9 - pH

6
10 - sulphates (potassium sulphate - g / dm3)

11 - alcohol (% by volume)

Output variable (based on sensory data):

12 - quality (score between 0 and 10)

Description of attributes:

Input variables (based on physicochemical tests):

1 - fixed acidity: most acids involved with wine or fixed or nonvolatile (do not evaporate
readily)

2 - volatile acidity: the amount of acetic acid in wine, which at too high of levels can lead to
an unpleasant, vinegar taste

3 - citric acid: found in small quantities, citric acid can add ‘freshness’ and flavor to wines

4 - residual sugar: the amount of sugar remaining after fermentation stops, it’s rare to find
wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered
sweet

5 - chlorides: the amount of salt in the wine

6 - free sulfur dioxide: the free form of SO2 exists in equilibrium between molecular SO2 (as
a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine

7 - total sulfur dioxide: amount of free and bound forms of S02; in low concentrations, SO2
is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes
evident in the nose and taste of wine

8 - density: the density of water is close to that of water depending on the percent alcohol
and sugar content

7
9 - pH: describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very
basic); most wines are between 3-4 on the pH scale

10 - sulphates: a wine additive which can contribute to sulfur dioxide gas (S02) levels, which
acts as an antimicrobial and antioxidant

11 - alcohol: the percent alcohol content of the wine

Output variable (based on sensory data):

12 - quality (score between 0 and 10)

Applications of linear regression:

⮚ Linear Regression can be used to predict the sale of products in the future
based on past buying behaviour.
⮚ Economists use Linear Regression to predict the economic growth of a country or
state.
⮚ Economists use Linear Regression to predict the economic growth of a country or
state.
⮚ An organisation can use linear regression to figure out how much they would pay to
a new joinee based on the years of experience.
⮚ Linear regression analysis can help a builder to predict how much houses it would
sell in the coming months and at what price.
⮚ Petroleum prices can be predicted using Linear Regression.

2.2 Software requirements:


Anaconda:
The open-source anaconda distribution is the easiest way to perform Python/R data science
and machine learning on Linux, Windows, and Mac OS X. Anaconda contains all the popular
8
python libraries that can be used in data science. The most important being scikit-learn,
numpy, pandas, scipy etc. It also comes with the jupyter notebook and Ipython distribution.
So, it saves us from importing numerous libraries separately.

Spyder:
Spyder, the Scientific Python Development Environment, is a free integrated development
environment (IDE) that is included with Anaconda. It includes editing, interactive testing,
debugging and introspection features.

2.3 Tools and technologies:

Machine learning:

Machine learning provides systems the ability to automatically learn and improve from
experience without being explicitly programmed. Machine Learnings main focus is to
provide algorithms which can be trained to perform a task. Affordable and easy
computational processing and cost-effective data storage options have made it feasible to
develop models that quickly and accurately analyze huge chunks of complex data.

Python:

Python is a programming language created by Guido van Rossum in 1989. Python is an


interpreted, object-oriented, dynamic data type of high-level programming languages.
(Python Software Foundation 2013). The programming language style is simple, clear and it
also contains powerful different kinds of classes. Moreover, Python can easily combine
other programming languages, such as C or C++. As a successful programming language, it
has its own advantages:

Simple & easy to learn: The concept of this programming language is as simple as it can be.
That makes it easy for everyone to learn and use. It is easy to understand the syntax.

9
Open source: Python is completely free as it is an open source software. Several of open
source scientific computing storage has the API for Python. Users can easy to install Python
on their own computer and use the standard and extend library.

Scalability: Programmers can write their code in C or C++ and run them in Python.

Applications of Python:

● GUI based desktop applications


o Image processing and graphic design applications
o Scientific and computational applications
o Games
● Web frameworks and web applications
● Enterprise and business applications
● Operating systems
● Language development

Scikit-learn:
Scikit–learn is an open source machine learning library for the Python programming
language. It features various classification, regression, and clustering algorithms and is
designed to interoperate with the Python numerical libraries NumPy and SciPy. SciKit-learn
contains the Kmeans algorithm based on Python and it helps to figure out how to
implement this algorithm in programming.

10
3. System design
3.1 System architecture:

11
4. System implementation and methodologies
4.1 Methodologies:

read_csv():
It is used to read a comma-separated values (csv) file into DataFrame. Filepath is given as
parameter to read_csv() method. This method is available in panadas. So, to use this
method pandas should be imported.

duplicated():
This function will check for duplicate instances in our dataset. If it come across a duplicate
instance then it prints true at that instance number else false.

Table 1: Dup Data series for wine quality

12
drop_duplicates():
It is a predefined function and is used to drop duplicates from dataset.
It returns DataFrame with duplicate rows removed.

head():
This method returns first n rows of the dataset. Default number of is taken as 5.

Table 2: Head elements of white wine quality dataset

corr():
It is used to find the pairwise correlation of all columns in the data frame. Any null values
are automatically excluded. For any non-numeric data type columns in the data frame it is
ignored.
If the correlation between any two attributes is one then one of the attribute should be
removed because they both have almost similar to each other. From the below table we
can see that correlation between density and residual sugar is one. So we dropped density
attribute.

13
Table 3: Correlation between the attributes of wine dataset

train,test=train_test_split(argument one,argument two):


We should import this method from sklearn.model_selection. This method splits our data
into train and test based on the splitting percentage we provide. Here, argument one is the
data that we want to split and argument two is the test_size i.e the percentage onto which
we split test and train.
We split our dataset into 20% and 80% where 20% is test sample containing 793
instances and
train sample with 3168 instances as shown below table.

Table 4 : Test and Train dataset instances and attributes.

14
LinearRegression():
We are using LinearRegression algorithm to built our model. It is a predefined function in
linear_model class. So, we imported linear_modelclass from sklearn library to use this
method.

fit(argument one,argument two):


fit method is used to train our algorithm with train dataset. Here, argument one is the
independent attributes of the dataset and argument two is the dependent attribute of the
dataset.

Predict(argument):
This method is used to predict the output of an unknown sample. Here, argument is the
array of attributes of the unknown sample.
The predicted quality for an unknown sample with inputs
[12,0.5,1.4,56,0.2,205,380,3.1,0.3,7.1] is as shown below.

Figure 6: Input (attributes of unknown sample)

15
Figure 7: output (quality of unknown sample)
4.2Code skeleton/Algorithms:

This program predicts the wine quality based on given 10 inputs


Function: raw_data=pd.read_csv(file path)
print shape of raw_data
Function:raw_data.duplicated()
print number of duplicates
Function: data=raw_data.drop_duplicates()
print shape of data
check for null values
Function:data_corr=data.corr()
print the correlated data
Function: train,test=tts(argument one,argument two)
y is quality attribute of train
X is remaining attributes of train
Function: reg=linear_model.linearRegression()
reg.fit(argument one,argument two)
y_train_predict=reg.predict(argument)
print the root mean suare error between y_train_predict and y
y_test is quality attribute of test
X_test is the remaining attributes of test
y_test_predict=reg.predict(argument)
print the root mean square error between y_test and y_test_predict
print promt “enter 10 attributes”
store them in array as arr
predict(arr)
print the predicted quality

16
5. Testing:

5.1 Screenshots:
Testing screenshot 1:

17
Testing screenshot 2:

18
Testing screenshot 3 (Result):

19
6. Conclusions and future enhancement
20
The interest has been increased in wine industry in recent years which demands
growth in this industry. Therefore, companies are investing in new technologies to improve
wine production and selling. In this direction, wine quality certification plays a very
important role for both processes and it requires wine testing by human experts. Our
project explores the usage of machine learning technique, linear regression. It determines
important features for prediction. The experiments show that the value of dependent
variable can be predicted more accurately if only important features are considered in
prediction rather than considering all features. In future, large dataset can be taken for
experiments and other machine learning techniques may be explored for wine quality
prediction.

References/bibliography
21
https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/
https://github.com/auxa/Wine-Quality-Prediction
https://scikit-learn.org/stable/
https://www.geeksforgeeks.org/linear-regression-python-implementation/
https://docs.python.org/3/

Appendix

22
Steps to install anaconda and launch spyder (windows)
1. Open https://www.anaconda.com/distribution/ in a web browser.
2. Download the anaconda installer(select python version) for windows
3. To start spyder, first open anaconda navigator (you will find anaconda navigator in
the start menu)
4. Then, click the launch button below the spyder icon on the navigator home tab

Figure 8: Anaconda Navigator

Code:

23
"""
Prediction of wine quality using machine learning in python

"""

import pandas as pd
from sklearn.model_selection import train_test_split as tts
from sklearn import linear_model
from sklearn.metrics import mean_squared_error

#reading dataset
raw_data=pd.read_csv('file:///C:/Users/Lathasree
Reddy/Downloads/winequality-white.csv',sep=';')
print("Shape of raw data :",raw_data.shape,"\n")

#identifying duplicates
dup_data=raw_data.duplicated()
print("Number of duplicate rows =",sum(dup_data),"\n")

#removing duplicates
data=raw_data.drop_duplicates()
print("Shape of data after removing duplicates :",data.shape,"\n")

data.rename(columns={'fixedacidity':'fixed_acidity','volatile
acidity':'volatile_acidity',\
'citricacid':'citric_acid','residual
sugar':'residual_sugar',\
'free sulfur dioxide':'free_sulfur_dioxide',\
'total sulfur
dioxide':'total_sulfur_dioxide'},inplace=True)

#printing first 5 data instances


Head=data.head()
print(data.head())
print()

#checking for missing values


print(data.isnull().sum())

24
print()
#there are no missing values

#data set description


data_info=data.describe()
print(data_info)

#checking the correlation between attributes


data_corr=data.corr()
print(data_corr)

#spliting the data set


train,test=tts(data,test_size=0.2)

y=train['quality']
cols=["fixed_acidity","volatile_acidity","citric_acid","residual_sugar
","chlorides",\

"free_sulfur_dioxide","total_sulfur_dioxide","pH","sulphates","alcohol
"]
X=train[cols]

#model formation
reg=linear_model.LinearRegression()
model=reg.fit(X,y)

coef=reg.coef_
print("Coefficients of the linear equation : \n",coef,"\n")
intercept=reg.intercept_
print("Y-intercept :",intercept)
print()

y_train_pred=reg.predict(X)
print("In sample Root mean square error:
%.2f"%mean_squared_error(y,y_train_pred)**0.5)
print()

y_test=test['quality']

25
X_test=test[cols]

y_test_pred=reg.predict(X_test)
print("Out sample Root mean square error:
%.2f"%mean_squared_error(y_test,y_test_pred)**0.5)
print()

#unknown sample
import numpy as np
a=np.array([12,0.5,1.4,56,0.2,205,380,3.1,0.3,7.1]).reshape(1,-1)
quality1=reg.predict(a)
print(quality1)

#unknown sample from user


from array import array
arr=array('f',[])
x=float(input("Enter the value of fixed acidity(range(3 to 15))"))
arr.append(x)
x=float(input("Enter the value of volatile acidity(range(0 to 1))"))
arr.append(x)
x=float(input("Enter the value of citric acidity(range(0 to 2))"))
arr.append(x)
x=float(input("Enter the value of residual sugar(range(0 to 100))"))
arr.append(x)
x=float(input("Enter the value of chlorides(range(0 to 0.5))"))
arr.append(x)
x=float(input("Enter the value of free sulphur dioxide(range(0 to
300))"))
arr.append(x)
x=float(input("Enter the value of total sulphur dioxide(range(0 to
500))"))
arr.append(x)
x=float(input("Enter the value of pH(range(2 to 4))"))
arr.append(x)
x=float(input("Enter the value of sulphates(range(0 to 1))"))
arr.append(x)
x=float(input("Enter the value of alcohol(range(5 to 15))"))

26
arr.append(x)

ar=np.asarray(arr)
ar=ar.reshape(1,-1)

quality2=reg.predict(ar)
print()
print("Quality of your wine sample is",quality2)

27

You might also like