04-Chapter 1

CHAPTER 1
INTRODUCTION
1.1 BACKGROUND
Availability of massive data has made raw data seldom of direct use. Manual analysis
lags behind the fast accumulation of large data base. Knowledge Discovery and Data
Mining (KDD) is a fast emerging field consisting of data bases, statistics and
machine learning which have come to extend helping hand. The object of KDD is
churning raw data to simple understandable method to help analysis. KDD has
emerged in the present competitive world for the development of science, technology
and finance as well. (Matheus et al. 1993) have stated KDD as the nontrivial process
of identifying valid, novel and potentially useful in ultimately understandable
patterns in data. Data selection, processing, data mining, interpretation and
evaluation are included in it. The chief issue of scaling down data is to select the
pertinent data and to present it to a data mining algorithm. Data mining (Chen et al.
1996) (Kriegel et al. 2007) helps reducing information overload and in helping
decision making. This is obtained by extracting and polishing useful knowledge by a
process of searching the relationships and patterns from the large data collected by
the researcher and the organizations. The culled out information is used for
prediction, classification and summarization of the data. Many industries use
classification and pattern recognition data mining techniques, like rule induction,
neural networks, genetic algorithms and fuzzy logic. The models or patterns are
derived from relationships and summaries using data mining techniques. Data mining
is a secondary data analysis and the collection of data may be for a specific purpose
or even without any purpose. In many a case, from a general mass of data collected,
1
some data may be pulled out for some other objective. From money supply
circulation, price fluctuation may be correlated. while data for money supply may be
with a different purpose, it is used for price fluctuation. This brings the importance of
reliable data base in discernible form for very many uses. This shows the importance
of data mining and data analysis.
Statistics is a science where any researcher too often deals with the problem of
hitting at the smallest data size that yields sufficiently confident estimates. In contrast
data mining deals with the juxtaposition namely building a data model which is
small, yet significant, from the data size which is large. The core of Data Mining is
developing a model of data which is easy to understand and use. Thus this has
become significant in the modern digital world. (Fayyad et al. 1996) developed
Knowledge Discovery Process (KDP) model consisting of nine steps of which the
seventh step is Data Mining. The utility of this step is that it brings a representation
form based on classification rules, decision tree, trends and regression models.
Though the authors state this as seventh step it is not low in priority but is the very
basis of data analysis.
Machine learning should be construed as a subfield of computer science i.e., soft
computing which is evolved from the learning of pattern recognition and
computational learning theory. Machine learning may border on Data Mining where
the latter focuses more on exploratory data analysis. It may also be known as
unsupervised learning. Machine learning is a field of study that gives to computers
the capacity to learn without openly programmed. Machine learning brings out the
study and construction of algorithms which could make any one to learn from and
2
make prediction of data. These algorithms operate by creating a model or inputs in
order to make data driven decisions or predictions. Machine learning is closely
associated to computational statistics which focuses in prediction making with the
use of computers. It has closer ties with mathematical optimization, which yields to
methods, theory and application domain. Where programming and designing is not
feasible, Machine learning is used.
In the field of data analysis, machine learning is a devise that lend themselves to
prediction in commercial use. Hence it is also known as predictive analysis. Such
analysis helps researchers, engineers and data analysts to bring out reliable decision
and results. Through learning from historical relationships and trends in the data, it
brings out hidden insights.
In machine learning, classification is the difficulty in allotting to which of a set of
categories a new observation belongs. It is based on the training data set which
contains observations whose relationship is observed. Classification is significant in
pattern recognition. In machine learning, classification is supervised learning where
as clustering is unsupervised. There is a difference between supervised learning and
unsupervised learning. In the former, the training data is pairs of input data (vectors)
and desired outputs. But in unsupervised learning, there is no priority output. A
classifier is one which does classification, particularly in a concrete implementation.
It also sometimes refer to mathematical function implemented by a classification
algorithm. In machine learning, three terminologies are important. They are
instances, features and classes. Instances are observation, features are explanatory
variables and classes are the possible categories to be predicted.
3
There are two classifiers viz., linear classifier and non-linear classifier. A linear
classifier does classification function taking the value of a linear combination of
characteristics. The linear classifier helps to solve the practical problems like
documentation classification and is more suitable to solve problems with several
variables (features). Non-linear classifier takes lesser time to train and apply.
Generally classification has various uses. Child mortality and nature of ailment,
period of recovery and survival rate may be made in the medical field based on the
characteristics of the patients using classification. Though the examples given here
are in medical field, it can be applied in different domains.
Ensemble data mining methods are also sometimes referred to as committee
methods. These are machine learning methods which increase the power of multiple
classifiers to achieve better prediction accuracy than individual classifier. The
fundamental object of ensemble model is to predict accuracy. A committee
consisting of a few persons is a good example of ensemble model. Each one will
think individually and contribute to the decision. In a committee, the composition is
significant. Everyone should think and act individually and also act in unison. If all
are 'yes men' or 'no men', it is of no use. Members should maintain individuality and
agree on some agreeable points and differ where they feel. Then only the functioning
of the committee will be effective and purposeful. Thus the resulting classifier
(ensemble model) is a combination of different classifier that gives good result.
Many researchers (Dietterich 2000) (Grove and Schuurmans 1998) prove ensemble
model is better than any single model.
4
Two oft used methods for obtaining accurate ensembles are Bagging and Boosting,
according to (Breiman 1996) (Freund and Schapire 1995) respectively. These
methods use 'resampling' techniques to get different training sets for each of the
classifiers. Besides evolutionary approach (using the individuals in a population) and
multi objective approaches (members of different complexity) can also be used to
create ensembles.
A time series (Chatfield 2016) is a set of observations taken at specified times,
usually at equal intervals. Mathematically a time series is defined by the values
A1,A2....of a variable A (temperature, closing price of a share, etc.) at times
X1,X2....Thus A is a function of X, symbolized by A=F(X). It may be seen that time
series consist of data arranged chronologically. Thus if data relating to population,
per capita income, prices, production, etc., for the last 5,10,15,20 years or some other
time period, the series emerging would be called time series. Time series analysis
(Guralnik and Srivastava 1999) (Esling and Agon 2012) facilitates analyzing time
series data for extracting meaningful statistics and other characteristics of the data as
well. The analysis of time series is of great use to economist, businessmen, scientists,
geologists, sociologists, biologists, meteorologists and research workers. The time
series analysis helps in understanding past behavior. It also helps in planning future
operations. Evaluation of current accomplishments can be made with it. It also
facilitates comparison.
5
1.2 OBJECTIVES OF THE WORK
One of the most important tasks before economists, businessmen, meteorologists,
agriculturists, industries, government and planners is to make estimates for the
future. For example, a businessman is interested in finding out his likely sales in the
year 2018 or as a long term planning in the year 2025 so that he could adjust his
production accordingly and avoid the possibility of either unsold stocks or inadequate
production to meet the demand. Similarly, an economist is interested in estimating
the likely production in the coming year so that the proper planning can be carried
out with regard to food supply, jobs for the people, control of inflation, etc. The
incumbent Governor of the Reserve Bank of India (RBI) may be concerned in
containing the prevailing inflation or inflation movement for the next five years. So
also maintaining interest rate in the short period and variation limit in future can be
subject of enquiry. The researcher feels that the model he develops will be highly
useful in this regard.
The first step in making estimates for the future consists of gathering information
from the past. In this connection one usually deals with statistical data which are
collected, observed or recorded at successive intervals of time. Such data are
generally referred to as time series. Hence in the analysis of time series, time is the
most important factor because the variable is related to time which may be either
year, month, week, day, hour or even minutes or seconds.
The role of the computer specialist is finding a way out for suggesting a suitable
model combining more than one classifier (ensemble). The researcher attempts to hit
6
at a suitable model in order to analyze the different real time situations. The approach
may be useful for future research under different situations. If the researcher
succeeds it will be a direct approach for further development in the field. Needless to
state that the scope of the present study can be extended to variety of fields.
The objectives of the present enquiry are only a few related to real time problem of
great importance. A very few are taken for consideration in this study as examples.
They are as follows
 Predicting stock market movement
 Gauging possible rainfall
 Estimating crop production
Though they are only a few examples, in real time situations, various problems may
emanate warranting suitable prediction. The objective is also due to the availability
of various sophisticated machine learning methods such as Support Vector Machine
(SVM) and Naive Bayes (NB). These are extensively used for data classification and
interpretation. This also made the researcher to think a proposed ensemble model
such as AdaSVM and AdaNaive Bayes in order to make fine prediction. The
objective of the ensemble model is to elaborate the accuracy of classification by
making use of more than one classifier. This model helps decision making by
combining the results of classification techniques. By the Boosting method, the
accuracy of the given algorithm is bettered. One such boosting algorithm is the
AdaBoost (Freund and Schapire 1995). For this purpose, AdaBoost algorithm is
applied by the researcher in the study. The statistical significance of the results
obtained are analyzed and discussed using ANOVA and Bonferroni methods.
7
1.3 MOTIVATION OF THE WORK
Any researcher should have motivation to choose an area for enquiry. Any research
which cannot be used in real situations is futile. The advent of computer and modern
technology has ushered new vistas for research. Right from school days, the
researcher was enamored with computer and its application. When the turn occurred
for higher learning and research, the researcher was fascinated to find easy prediction
model of use. This is the factor which motivated the researcher to find an ensemble
model.
The importance of stock market (Eun and Shim 1989) (Day and Lewis 1992) is very
significant and sensitive. Even general public with investing mind and a little
computer knowledge, browse the prices of stock to purchase and sell using their
DEMAT account. With the emergence of corporate world and its expansion many
stocks and shares of large magnitude has emerged. With all small and big investors
liking to purchase and sell securities, stock exchanges have become an important
financial market. Government's policies and budgets reflect in the movement of stock
exchange. Investment has become global, so much so, the movement of prices of
stocks of a country immediately reflect in the price movement of shares of all
countries.
Investors are closely watching stock movement (French 1987) (Guiso et al. 2008)
and on this basis shares are purchased or sold. The stock market may be bullish or
bearish. The fall in share prices in America in the last decade immediately reflected
in all other countries. Economic recession should be closely watched by the monetary
8
authorities not only in their country but in other countries as well. The mass of data
of listed securities should be closely studied and hence a suitable model is needed to
study the shares of various sectors.
So long capital and capital markets exist, as they would, the study of huge share
market price in presentable style, agreeable and proposed ensemble model will
prevail. The present work attempts this and presents to approach one of the problems
viz., stock movement (Huang and Stoll 1994) (Enke and Thawornwong 2005) (Chen
2009) taken for study and scrutiny.
Rainfall is nature given, indispensable and highly needed input to the world and
mankind. The famous Tamil poet Tiruvalluvar says, world cannot sustain without
rain and it is a pre-requisite for agricultural production. The world revolves behind
the agriculturists. Water gives life to human beings and without it men will die.
There is no artificial method of creating rainfall. Rainfall (Krajewski and Smith
2002) (Toth et al. 2000) (Washington and Downing 1999) eludes in some years and
it is not foreseen. In the absence of rainfall, rivers and channels get dried up. Men,
cattle, animals become ferocious and die for want of watersheds. Wild animals move
to nearby towns in search of water and creates panic and terror to habitations.
Successive rainfall failure makes ground water source scarce.
India is a land of villages. Agriculture is the chief occupation of more than 60% of
population. Men are also villains who have made deforestation and pollution of
surroundings making failures of rainfall. In recent years, government has taken a few
positive steps for afforestation and developing social forests. The recurrent failure of
9
rainfall and dying out of water sources is one of the premier problems of the
economy as industries depend on agriculture. If the rainfall could be predicted
sufficiently early, the government and agriculturists can well be prepared to meet the
impending difficult days. The agricultural department and agriculturists can prepare
well their minds. The government should always give priority to water conservation
and underground water preservation. This very vital problem of highest importance
led the researcher to find suitable model for rainfall prediction. Besides heavy rainfall
in some years lead to floods (Lobell et al. 2008) and its trail of loss of life and
standing crops and depletion of soil fertility. This reinforces the need of rainfall
forecasting which also motivated to study this area.
Crop is a general term and includes food (Tilman et al. 2011) crops as well as
commercial crops. Food crops are paddy, wheat, maize, grams, millets, etc.,
Commercial crops are cotton, sugarcane, groundnut, tobacco, cashew, etc., Both food
crops and commercial crops are important. Food crops for human consumption and
commercial crops for industrial purposes and for money earnings are significant. A
study on estimation of crop production and forecasting is highly useful for economic
growth. Oil seeds production estimation can forewarn the government, regarding
needed imports and consequent requirements of foreign exchange.
In a piece of land single crop alone may be raised or more than one time the same
crop may be raised depending on the suitability of land and weather conditions. In
some fertile wet lands generally a short term crop and long term crop (Monteith and
Moss 1977) may be grown. In the same land, two different kinds of crop may also be
grown. In some states in India, Kharif and Rabi crop may be grown. Kharif crop is a
10
monsoon crop (Lobell and Burke 2010) where domesticated plant are cultivated and
harvested during the monsoon season. Millet and Rice are main kharif crops. Rabi
crops are agricultural crops sown in winter and harvested in the spring season. The
major rabi crop in India are wheat, barley, mustard, sesame and peas. In order to
maintain the fertility of the soil, rotation of crops may be followed. In Tamilnadu, the
delta districts, rotation of crops is followed. When paddy is cultivated, after harvest,
grams are raised. They are of short duration and capable of replenishing the lost
properties of soil. It helps agriculturists to earn more.
Mixed cropping or mixed farming is a concept where more than one crop may be
grown. For example, paddy may be raised and after harvest cattle rearing may be
undertaken to make the soil more fit and rejuvenate the soil for the next cultivation.
All the above explanations are to bring the different types of crop and their
importance. Crop selection (Lobell et al. 2011) and raising depends on previous
experience and suitability of soil fertility. The researcher felt that the trends of crop
production may forecast well in the types of crop that may be chosen for different
seasons and different regions. What else is significant in an agricultural country with
variation of fertility and variation in weather conditions than studying crop
production prediction. The selection of the area is the motivation and fascination of
agriculture as the researcher hails from agricultural family. The possible benefit of
forecasting is illustrated in the Table 1.1.
11
Table 1.1 Prospective Benefit of Forecasting
WITHOUT PREDICTION WITH PREDICTION
STOCK
STOCK  Avoid or be prepared to face
ENVIRONMENTAL INFLUENCE
 May result in bullish or bearish situation

bearish situation  Make the most out of the bullish
situation
RAINFALL
RAINFALL  Avoid or be prepared to face
 May result in draught or draught situation
flood situation  Save life and property damage
from flood situation
CROPYIELD
CROPYIELD  Avoid or be prepared to face no
 May result in less or no yield
yield  Take measures to increase the
yield
1.4 SCOPE OF THE WORK
The ensemble model (Kotsiantis et al. 2010) (Pierro et al. 2016) in the present work
is the analysis of time series using machine learning techniques. As the outset time
series is formed by the collection of set of observations made sequentially over a
period of time. Examples of composition of series may be given in different areas. It
may extend from economics to engineering. Such series are subject to further
scrutiny.
12
Some examples of time series in certain areas are relevant at this stage. They are
Economics and Financial time series, Physical time series, Marketing time series,
Demographic time series, Process control data, Binary process.
1.4.1 Economics and Financial Time Series
A few instances of share prices day by day, Export and import of commodities in
variety and value over a period of time, Wholesale and retail prices monthly wise,
household income and expenditure every month, company turnover and profits
monthly are only a few examples of useful time series in the economic domain.
1.4.2 Physical Time Series
In physical sciences many instances of time series occur. To mention only a few,
meteorology, marine science, earth science and earth quakes and tremors. Rainfall on
successive days of a season, heat temperature in a few months, flood situations in a
season are some examples. Many things happen in a rhythm and can be used to find
the areas prone for earthquake and the possible areas as well.
In a time series studies, there are a few mechanical recorders which take
measurements continuously and gives a continuous data rather than observations at
discrete intervals. In some laboratories, temperature and humidity observation
continuously is very much needed for which certain equipments are kept to measure
these variables all the twenty four hours. When the trace goes outside pre specified
limits, action is initiated. In some situation, visual examination of the trace may be
13
enough. However for a detailed analysis, it should be converted to discrete time
series by sampling the trace to an appropriate equal intervals of time.
1.4.3 Marketing Time Series
Marketing is an important segment of business where time series analysis is
significant. Sales figures in successive weeks and months, monetary receipts,
advertisement costs, new markets obtained, lost areas of market are only a few
variables warranting time series studies. This would be a useful guide for future
action for increasing receipts and reducing expenditure.
1.4.4 Demographic Time Series
In the study of population growth, various time series are in use. Increase in
population month-wise, year-wise are studied in a few countries. Child mortality rate,
longevity curve are all used by time series.
1.4.5 Process Control Data
In manufacturing, quality is important. By measuring a variable, which reveals the
quality of the process. These measurements can be plotted against time. When the
variations is too much from target values, control measures should be taken to
control the process.
14
1.4.6 Binary Process
A special type of time series arises when observations can be only one of the two
values viz., 0 and 1. To quote familiar example from computer science field, the
position of a switch either 'on' or 'off' can be recorded as one or zero respectively.
Binary processes occur in several situations including in the study of communication
theory.
In the analysis of time series, many reasons such as description, explanation,
prediction and control can be made.
 Description: When presented with a time series, initial step in the analysis
plotting the observations against time to give what is known as time plot.
Then obtaining of simple descriptive measures of the main properties of the
series is needed. The description can yield from descriptive measures of the
series that there may be regular seasonal effect, with higher sales in winter
season and lower effect in summer season. The time plot also can reveal that
the annual sales are increasing, that is there is an upward swing. This a very
basic model which describes trend and seasonal variation. This may be
perfectly adequate to describe the variation in time series.
 Explanation: When the observations are recorded on two or more variables, it
is possible to use the variation in one time series in order to explain the
variation in another series. This will give a deeper understanding of the
mechanism which generated a given time series. Through a linear operation, a
linear system converts an input series to an output series. The analyst can find
15
the input and output to a linear system by linear operation and access the
properties of the linear system. It could be found how a sea level is affected
by temperature and pressure and also to find how sales are affected by price
and economic conditions. The Figure 1.1 represents a linear system.
Fig 1.1 Schematic Representation of a Linear System
 Prediction: From a observed time series, a researcher would wish to predict
the future values of the series. For example in any trading business an
important requirement is sales forecasting, so also the prediction of
requirement of raw materials for future demand and needed workforce for
future production are all could be predicted on the basis of time and demand
series.
 Control: Control is an important aspect in industrial production. Both
quantitative and qualitative controls are significant. There should be neither
over production nor under production. Over production may lead to glut in
the market and under production may bring search for alternatives in the
market. So to maintain steady demand and growth, time series study is
needed.
16
All the above examples and statements bring out the vital need of time series analysis
in different situations. While time series is significant, it is fortified with ensemble
models using machine learning techniques. This is the core of the scope of this
thesis.
Machine learning (Goldberg et al. 1988) research attempts to bring the possibility of
instructing computers to new ways, which helps to ease the burden of hand-
programming growing voluminous and complex. Besides the fast expansion of
applications and availability of computers makes this possibility facile and desirable.
The classification problems are better utilized with the help of supervised learning
(Caruana and Niculescu-Mizil 2006). In recent times, learners are interested in
producing models of small expected loss by bringing and aggregating multiple
individual models. Ensemble methods (Hassan et al. 2007) (Contiu and Groza 2016)
(Bhardwaj et al. 2016) very often outperform single classifiers and increases its
available computing power that has made the application of ensemble methods
feasible for large data sets. According to ( Hansen et al. 1990) a necessary and
sufficient condition for the ensemble of classifiers is to be more accurate than any of
its individual members, if the individual classifiers are accurate. Ensemble learning
generate multiple models. The ensemble passes the new example to each of the
multiple base models and obtain their predictions, combine them in some appropriate
manner such as averaging or voting. The present work provides a statistical ensemble
procedure in order to make predictions more accurate. It may be seen that the facets
of scope of the work is very wide and deep.
17
1.5 ORGANIZATION OF THE THESIS
The entire work is presented in 5 chapters. Chapter 1 is introduction which outlines
the background, importance of study and machine learning methods and their
significance. Chapter 2 gives an in-depth explanation of review of literature and its
importance leading to the present study. Chapter 3 deals with the methodology
adopted for the study. Chapter 4 is results and discussion presenting the various
angles of the study. Chapter 5 gives the conclusion and utility for future use.
1.6 SUMMARY
In this chapter, the basic idea behind data mining techniques and time series data
analysis is elaborated. The main objectives of this work is given in detail. Factors
that are influenced for this study has been described in the motivation. Various time
series data available for analysis are discussed in the scope. Entire thesis work is
structured in organization of the thesis.
18

04-Chapter 1

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

04-Chapter 1

Uploaded by

Copyright:

Available Formats

CHAPTER 1

of identifying valid, novel and potentially useful in ultimately understandable

patterns in data. Data selection, processing, data mining, interpretation and

decision making. This is obtained by extracting and polishing useful knowledge by a

prediction, classification and summarization of the data. Many industries use

of data mining and data analysis.

basis of data analysis.

Machine learning should be construed as a subfield of computer science i.e., soft

computing which is evolved from the learning of pattern recognition and

unsupervised learning. Machine learning is a field of study that gives to computers

order to make data driven decisions or predictions. Machine learning is closely

associated to computational statistics which focuses in prediction making with the

feasible, Machine learning is used.

prediction in commercial use. Hence it is also known as predictive analysis. Such

brings out hidden insights.

In machine learning, classification is the difficulty in allotting to which of a set of

contains observations whose relationship is observed. Classification is significant in

pattern recognition. In machine learning, classification is supervised learning where

as clustering is unsupervised. There is a difference between supervised learning and

and desired outputs. But in unsupervised learning, there is no priority output. A

classifier is one which does classification, particularly in a concrete implementation.

It also sometimes refer to mathematical function implemented by a classification

algorithm. In machine learning, three terminologies are important. They are

variables and classes are the possible categories to be predicted.

classifier does classification function taking the value of a linear combination of

documentation classification and is more suitable to solve problems with several

are in medical field, it can be applied in different domains.

Ensemble data mining methods are also sometimes referred to as committee

classifiers to achieve better prediction accuracy than individual classifier. The

fundamental object of ensemble model is to predict accuracy. A committee

think individually and contribute to the decision. In a committee, the composition is

(ensemble model) is a combination of different classifier that gives good result.

model is better than any single model.

according to (Breiman 1996) (Freund and Schapire 1995) respectively. These

classifiers. Besides evolutionary approach (using the individuals in a population) and

multi objective approaches (members of different complexity) can also be used to

A time series (Chatfield 2016) is a set of observations taken at specified times,

usually at equal intervals. Mathematically a time series is defined by the values

A1,A2....of a variable A (temperature, closing price of a share, etc.) at times

X1,X2....Thus A is a function of X, symbolized by A=F(X). It may be seen that time

series consist of data arranged chronologically. Thus if data relating to population,

geologists, sociologists, biologists, meteorologists and research workers. The time

operations. Evaluation of current accomplishments can be made with it. It also

One of the most important tasks before economists, businessmen, meteorologists,

agriculturists, industries, government and planners is to make estimates for the

production to meet the demand. Similarly, an economist is interested in estimating

incumbent Governor of the Reserve Bank of India (RBI) may be concerned in

useful in this regard.

collected, observed or recorded at successive intervals of time. Such data are

year, month, week, day, hour or even minutes or seconds.

They are as follows

 Predicting stock market movement

 Gauging possible rainfall

 Estimating crop production

of various sophisticated machine learning methods such as Support Vector Machine

objective of the ensemble model is to elaborate the accuracy of classification by

combining the results of classification techniques. By the Boosting method, the

stocks of a country immediately reflect in the price movement of shares of all

study the shares of various sectors.

2009) taken for study and scrutiny.

There is no artificial method of creating rainfall. Rainfall (Krajewski and Smith

Successive rainfall failure makes ground water source scarce.