Professional Documents
Culture Documents
04-Chapter 1
04-Chapter 1
INTRODUCTION
1.1 BACKGROUND
Availability of massive data has made raw data seldom of direct use. Manual analysis
lags behind the fast accumulation of large data base. Knowledge Discovery and Data
Mining (KDD) is a fast emerging field consisting of data bases, statistics and
machine learning which have come to extend helping hand. The object of KDD is
churning raw data to simple understandable method to help analysis. KDD has
emerged in the present competitive world for the development of science, technology
and finance as well. (Matheus et al. 1993) have stated KDD as the nontrivial process
evaluation are included in it. The chief issue of scaling down data is to select the
pertinent data and to present it to a data mining algorithm. Data mining (Chen et al.
1996) (Kriegel et al. 2007) helps reducing information overload and in helping
process of searching the relationships and patterns from the large data collected by
the researcher and the organizations. The culled out information is used for
classification and pattern recognition data mining techniques, like rule induction,
neural networks, genetic algorithms and fuzzy logic. The models or patterns are
derived from relationships and summaries using data mining techniques. Data mining
is a secondary data analysis and the collection of data may be for a specific purpose
or even without any purpose. In many a case, from a general mass of data collected,
1
some data may be pulled out for some other objective. From money supply
circulation, price fluctuation may be correlated. while data for money supply may be
with a different purpose, it is used for price fluctuation. This brings the importance of
reliable data base in discernible form for very many uses. This shows the importance
Statistics is a science where any researcher too often deals with the problem of
hitting at the smallest data size that yields sufficiently confident estimates. In contrast
data mining deals with the juxtaposition namely building a data model which is
small, yet significant, from the data size which is large. The core of Data Mining is
developing a model of data which is easy to understand and use. Thus this has
become significant in the modern digital world. (Fayyad et al. 1996) developed
Knowledge Discovery Process (KDP) model consisting of nine steps of which the
seventh step is Data Mining. The utility of this step is that it brings a representation
form based on classification rules, decision tree, trends and regression models.
Though the authors state this as seventh step it is not low in priority but is the very
computational learning theory. Machine learning may border on Data Mining where
the latter focuses more on exploratory data analysis. It may also be known as
the capacity to learn without openly programmed. Machine learning brings out the
study and construction of algorithms which could make any one to learn from and
2
make prediction of data. These algorithms operate by creating a model or inputs in
use of computers. It has closer ties with mathematical optimization, which yields to
methods, theory and application domain. Where programming and designing is not
In the field of data analysis, machine learning is a devise that lend themselves to
analysis helps researchers, engineers and data analysts to bring out reliable decision
and results. Through learning from historical relationships and trends in the data, it
categories a new observation belongs. It is based on the training data set which
unsupervised learning. In the former, the training data is pairs of input data (vectors)
instances, features and classes. Instances are observation, features are explanatory
3
There are two classifiers viz., linear classifier and non-linear classifier. A linear
characteristics. The linear classifier helps to solve the practical problems like
variables (features). Non-linear classifier takes lesser time to train and apply.
Generally classification has various uses. Child mortality and nature of ailment,
period of recovery and survival rate may be made in the medical field based on the
characteristics of the patients using classification. Though the examples given here
methods. These are machine learning methods which increase the power of multiple
consisting of a few persons is a good example of ensemble model. Each one will
significant. Everyone should think and act individually and also act in unison. If all
are 'yes men' or 'no men', it is of no use. Members should maintain individuality and
agree on some agreeable points and differ where they feel. Then only the functioning
of the committee will be effective and purposeful. Thus the resulting classifier
Many researchers (Dietterich 2000) (Grove and Schuurmans 1998) prove ensemble
4
Two oft used methods for obtaining accurate ensembles are Bagging and Boosting,
methods use 'resampling' techniques to get different training sets for each of the
create ensembles.
per capita income, prices, production, etc., for the last 5,10,15,20 years or some other
time period, the series emerging would be called time series. Time series analysis
(Guralnik and Srivastava 1999) (Esling and Agon 2012) facilitates analyzing time
series data for extracting meaningful statistics and other characteristics of the data as
well. The analysis of time series is of great use to economist, businessmen, scientists,
series analysis helps in understanding past behavior. It also helps in planning future
facilitates comparison.
5
1.2 OBJECTIVES OF THE WORK
future. For example, a businessman is interested in finding out his likely sales in the
year 2018 or as a long term planning in the year 2025 so that he could adjust his
production accordingly and avoid the possibility of either unsold stocks or inadequate
the likely production in the coming year so that the proper planning can be carried
out with regard to food supply, jobs for the people, control of inflation, etc. The
containing the prevailing inflation or inflation movement for the next five years. So
also maintaining interest rate in the short period and variation limit in future can be
subject of enquiry. The researcher feels that the model he develops will be highly
The first step in making estimates for the future consists of gathering information
from the past. In this connection one usually deals with statistical data which are
generally referred to as time series. Hence in the analysis of time series, time is the
most important factor because the variable is related to time which may be either
The role of the computer specialist is finding a way out for suggesting a suitable
model combining more than one classifier (ensemble). The researcher attempts to hit
6
at a suitable model in order to analyze the different real time situations. The approach
may be useful for future research under different situations. If the researcher
succeeds it will be a direct approach for further development in the field. Needless to
state that the scope of the present study can be extended to variety of fields.
The objectives of the present enquiry are only a few related to real time problem of
great importance. A very few are taken for consideration in this study as examples.
Though they are only a few examples, in real time situations, various problems may
emanate warranting suitable prediction. The objective is also due to the availability
(SVM) and Naive Bayes (NB). These are extensively used for data classification and
interpretation. This also made the researcher to think a proposed ensemble model
such as AdaSVM and AdaNaive Bayes in order to make fine prediction. The
making use of more than one classifier. This model helps decision making by
accuracy of the given algorithm is bettered. One such boosting algorithm is the
AdaBoost (Freund and Schapire 1995). For this purpose, AdaBoost algorithm is
applied by the researcher in the study. The statistical significance of the results
obtained are analyzed and discussed using ANOVA and Bonferroni methods.
7
1.3 MOTIVATION OF THE WORK
Any researcher should have motivation to choose an area for enquiry. Any research
which cannot be used in real situations is futile. The advent of computer and modern
technology has ushered new vistas for research. Right from school days, the
researcher was enamored with computer and its application. When the turn occurred
for higher learning and research, the researcher was fascinated to find easy prediction
model of use. This is the factor which motivated the researcher to find an ensemble
model.
The importance of stock market (Eun and Shim 1989) (Day and Lewis 1992) is very
significant and sensitive. Even general public with investing mind and a little
computer knowledge, browse the prices of stock to purchase and sell using their
DEMAT account. With the emergence of corporate world and its expansion many
stocks and shares of large magnitude has emerged. With all small and big investors
liking to purchase and sell securities, stock exchanges have become an important
financial market. Government's policies and budgets reflect in the movement of stock
exchange. Investment has become global, so much so, the movement of prices of
countries.
Investors are closely watching stock movement (French 1987) (Guiso et al. 2008)
and on this basis shares are purchased or sold. The stock market may be bullish or
bearish. The fall in share prices in America in the last decade immediately reflected
in all other countries. Economic recession should be closely watched by the monetary
8
authorities not only in their country but in other countries as well. The mass of data
of listed securities should be closely studied and hence a suitable model is needed to
So long capital and capital markets exist, as they would, the study of huge share
market price in presentable style, agreeable and proposed ensemble model will
prevail. The present work attempts this and presents to approach one of the problems
viz., stock movement (Huang and Stoll 1994) (Enke and Thawornwong 2005) (Chen
Rainfall is nature given, indispensable and highly needed input to the world and
mankind. The famous Tamil poet Tiruvalluvar says, world cannot sustain without
rain and it is a pre-requisite for agricultural production. The world revolves behind
the agriculturists. Water gives life to human beings and without it men will die.
2002) (Toth et al. 2000) (Washington and Downing 1999) eludes in some years and
it is not foreseen. In the absence of rainfall, rivers and channels get dried up. Men,
cattle, animals become ferocious and die for want of watersheds. Wild animals move
to nearby towns in search of water and creates panic and terror to habitations.
India is a land of villages. Agriculture is the chief occupation of more than 60% of
population. Men are also villains who have made deforestation and pollution of
surroundings making failures of rainfall. In recent years, government has taken a few
positive steps for afforestation and developing social forests. The recurrent failure of
9
rainfall and dying out of water sources is one of the premier problems of the
sufficiently early, the government and agriculturists can well be prepared to meet the
impending difficult days. The agricultural department and agriculturists can prepare
well their minds. The government should always give priority to water conservation
and underground water preservation. This very vital problem of highest importance
led the researcher to find suitable model for rainfall prediction. Besides heavy rainfall
in some years lead to floods (Lobell et al. 2008) and its trail of loss of life and
standing crops and depletion of soil fertility. This reinforces the need of rainfall
Crop is a general term and includes food (Tilman et al. 2011) crops as well as
commercial crops. Food crops are paddy, wheat, maize, grams, millets, etc.,
Commercial crops are cotton, sugarcane, groundnut, tobacco, cashew, etc., Both food
crops and commercial crops are important. Food crops for human consumption and
commercial crops for industrial purposes and for money earnings are significant. A
study on estimation of crop production and forecasting is highly useful for economic
growth. Oil seeds production estimation can forewarn the government, regarding
In a piece of land single crop alone may be raised or more than one time the same
crop may be raised depending on the suitability of land and weather conditions. In
some fertile wet lands generally a short term crop and long term crop (Monteith and
Moss 1977) may be grown. In the same land, two different kinds of crop may also be
grown. In some states in India, Kharif and Rabi crop may be grown. Kharif crop is a
10
monsoon crop (Lobell and Burke 2010) where domesticated plant are cultivated and
harvested during the monsoon season. Millet and Rice are main kharif crops. Rabi
crops are agricultural crops sown in winter and harvested in the spring season. The
major rabi crop in India are wheat, barley, mustard, sesame and peas. In order to
maintain the fertility of the soil, rotation of crops may be followed. In Tamilnadu, the
delta districts, rotation of crops is followed. When paddy is cultivated, after harvest,
grams are raised. They are of short duration and capable of replenishing the lost
Mixed cropping or mixed farming is a concept where more than one crop may be
grown. For example, paddy may be raised and after harvest cattle rearing may be
undertaken to make the soil more fit and rejuvenate the soil for the next cultivation.
All the above explanations are to bring the different types of crop and their
importance. Crop selection (Lobell et al. 2011) and raising depends on previous
experience and suitability of soil fertility. The researcher felt that the trends of crop
production may forecast well in the types of crop that may be chosen for different
seasons and different regions. What else is significant in an agricultural country with
production prediction. The selection of the area is the motivation and fascination of
agriculture as the researcher hails from agricultural family. The possible benefit of
11
Table 1.1 Prospective Benefit of Forecasting
STOCK
STOCK Avoid or be prepared to face
ENVIRONMENTAL INFLUENCE
CROPYIELD
CROPYIELD Avoid or be prepared to face no
May result in less or no yield
yield Take measures to increase the
yield
The ensemble model (Kotsiantis et al. 2010) (Pierro et al. 2016) in the present work
is the analysis of time series using machine learning techniques. As the outset time
may extend from economics to engineering. Such series are subject to further
scrutiny.
12
Some examples of time series in certain areas are relevant at this stage. They are
Economics and Financial time series, Physical time series, Marketing time series,
A few instances of share prices day by day, Export and import of commodities in
variety and value over a period of time, Wholesale and retail prices monthly wise,
household income and expenditure every month, company turnover and profits
monthly are only a few examples of useful time series in the economic domain.
In physical sciences many instances of time series occur. To mention only a few,
meteorology, marine science, earth science and earth quakes and tremors. Rainfall on
season are some examples. Many things happen in a rhythm and can be used to find
the areas prone for earthquake and the possible areas as well.
In a time series studies, there are a few mechanical recorders which take
continuously is very much needed for which certain equipments are kept to measure
these variables all the twenty four hours. When the trace goes outside pre specified
limits, action is initiated. In some situation, visual examination of the trace may be
13
enough. However for a detailed analysis, it should be converted to discrete time
advertisement costs, new markets obtained, lost areas of market are only a few
variables warranting time series studies. This would be a useful guide for future
In the study of population growth, various time series are in use. Increase in
population month-wise, year-wise are studied in a few countries. Child mortality rate,
quality of the process. These measurements can be plotted against time. When the
variations is too much from target values, control measures should be taken to
14
1.4.6 Binary Process
A special type of time series arises when observations can be only one of the two
values viz., 0 and 1. To quote familiar example from computer science field, the
position of a switch either 'on' or 'off' can be recorded as one or zero respectively.
theory.
Description: When presented with a time series, initial step in the analysis
plotting the observations against time to give what is known as time plot.
series is needed. The description can yield from descriptive measures of the
series that there may be regular seasonal effect, with higher sales in winter
season and lower effect in summer season. The time plot also can reveal that
the annual sales are increasing, that is there is an upward swing. This a very
basic model which describes trend and seasonal variation. This may be
is possible to use the variation in one time series in order to explain the
linear system converts an input series to an output series. The analyst can find
15
the input and output to a linear system by linear operation and access the
properties of the linear system. It could be found how a sea level is affected
by temperature and pressure and also to find how sales are affected by price
the future values of the series. For example in any trading business an
requirement of raw materials for future demand and needed workforce for
future production are all could be predicted on the basis of time and demand
series.
over production nor under production. Over production may lead to glut in
the market and under production may bring search for alternatives in the
needed.
16
All the above examples and statements bring out the vital need of time series analysis
models using machine learning techniques. This is the core of the scope of this
thesis.
Machine learning (Goldberg et al. 1988) research attempts to bring the possibility of
instructing computers to new ways, which helps to ease the burden of hand-
applications and availability of computers makes this possibility facile and desirable.
The classification problems are better utilized with the help of supervised learning
individual models. Ensemble methods (Hassan et al. 2007) (Contiu and Groza 2016)
(Bhardwaj et al. 2016) very often outperform single classifiers and increases its
available computing power that has made the application of ensemble methods
feasible for large data sets. According to ( Hansen et al. 1990) a necessary and
sufficient condition for the ensemble of classifiers is to be more accurate than any of
its individual members, if the individual classifiers are accurate. Ensemble learning
generate multiple models. The ensemble passes the new example to each of the
multiple base models and obtain their predictions, combine them in some appropriate
manner such as averaging or voting. The present work provides a statistical ensemble
procedure in order to make predictions more accurate. It may be seen that the facets
17
1.5 ORGANIZATION OF THE THESIS
the background, importance of study and machine learning methods and their
importance leading to the present study. Chapter 3 deals with the methodology
adopted for the study. Chapter 4 is results and discussion presenting the various
angles of the study. Chapter 5 gives the conclusion and utility for future use.
1.6 SUMMARY
In this chapter, the basic idea behind data mining techniques and time series data
analysis is elaborated. The main objectives of this work is given in detail. Factors
that are influenced for this study has been described in the motivation. Various time
series data available for analysis are discussed in the scope. Entire thesis work is
18