Download as pdf or txt
Download as pdf or txt
You are on page 1of 18

CHAPTER 1

INTRODUCTION

1.1 BACKGROUND

Availability of massive data has made raw data seldom of direct use. Manual analysis

lags behind the fast accumulation of large data base. Knowledge Discovery and Data

Mining (KDD) is a fast emerging field consisting of data bases, statistics and

machine learning which have come to extend helping hand. The object of KDD is

churning raw data to simple understandable method to help analysis. KDD has

emerged in the present competitive world for the development of science, technology

and finance as well. (Matheus et al. 1993) have stated KDD as the nontrivial process

of identifying valid, novel and potentially useful in ultimately understandable

patterns in data. Data selection, processing, data mining, interpretation and

evaluation are included in it. The chief issue of scaling down data is to select the

pertinent data and to present it to a data mining algorithm. Data mining (Chen et al.

1996) (Kriegel et al. 2007) helps reducing information overload and in helping

decision making. This is obtained by extracting and polishing useful knowledge by a

process of searching the relationships and patterns from the large data collected by

the researcher and the organizations. The culled out information is used for

prediction, classification and summarization of the data. Many industries use

classification and pattern recognition data mining techniques, like rule induction,

neural networks, genetic algorithms and fuzzy logic. The models or patterns are

derived from relationships and summaries using data mining techniques. Data mining

is a secondary data analysis and the collection of data may be for a specific purpose

or even without any purpose. In many a case, from a general mass of data collected,

1
some data may be pulled out for some other objective. From money supply

circulation, price fluctuation may be correlated. while data for money supply may be

with a different purpose, it is used for price fluctuation. This brings the importance of

reliable data base in discernible form for very many uses. This shows the importance

of data mining and data analysis.

Statistics is a science where any researcher too often deals with the problem of

hitting at the smallest data size that yields sufficiently confident estimates. In contrast

data mining deals with the juxtaposition namely building a data model which is

small, yet significant, from the data size which is large. The core of Data Mining is

developing a model of data which is easy to understand and use. Thus this has

become significant in the modern digital world. (Fayyad et al. 1996) developed

Knowledge Discovery Process (KDP) model consisting of nine steps of which the

seventh step is Data Mining. The utility of this step is that it brings a representation

form based on classification rules, decision tree, trends and regression models.

Though the authors state this as seventh step it is not low in priority but is the very

basis of data analysis.

Machine learning should be construed as a subfield of computer science i.e., soft

computing which is evolved from the learning of pattern recognition and

computational learning theory. Machine learning may border on Data Mining where

the latter focuses more on exploratory data analysis. It may also be known as

unsupervised learning. Machine learning is a field of study that gives to computers

the capacity to learn without openly programmed. Machine learning brings out the

study and construction of algorithms which could make any one to learn from and

2
make prediction of data. These algorithms operate by creating a model or inputs in

order to make data driven decisions or predictions. Machine learning is closely

associated to computational statistics which focuses in prediction making with the

use of computers. It has closer ties with mathematical optimization, which yields to

methods, theory and application domain. Where programming and designing is not

feasible, Machine learning is used.

In the field of data analysis, machine learning is a devise that lend themselves to

prediction in commercial use. Hence it is also known as predictive analysis. Such

analysis helps researchers, engineers and data analysts to bring out reliable decision

and results. Through learning from historical relationships and trends in the data, it

brings out hidden insights.

In machine learning, classification is the difficulty in allotting to which of a set of

categories a new observation belongs. It is based on the training data set which

contains observations whose relationship is observed. Classification is significant in

pattern recognition. In machine learning, classification is supervised learning where

as clustering is unsupervised. There is a difference between supervised learning and

unsupervised learning. In the former, the training data is pairs of input data (vectors)

and desired outputs. But in unsupervised learning, there is no priority output. A

classifier is one which does classification, particularly in a concrete implementation.

It also sometimes refer to mathematical function implemented by a classification

algorithm. In machine learning, three terminologies are important. They are

instances, features and classes. Instances are observation, features are explanatory

variables and classes are the possible categories to be predicted.

3
There are two classifiers viz., linear classifier and non-linear classifier. A linear

classifier does classification function taking the value of a linear combination of

characteristics. The linear classifier helps to solve the practical problems like

documentation classification and is more suitable to solve problems with several

variables (features). Non-linear classifier takes lesser time to train and apply.

Generally classification has various uses. Child mortality and nature of ailment,

period of recovery and survival rate may be made in the medical field based on the

characteristics of the patients using classification. Though the examples given here

are in medical field, it can be applied in different domains.

Ensemble data mining methods are also sometimes referred to as committee

methods. These are machine learning methods which increase the power of multiple

classifiers to achieve better prediction accuracy than individual classifier. The

fundamental object of ensemble model is to predict accuracy. A committee

consisting of a few persons is a good example of ensemble model. Each one will

think individually and contribute to the decision. In a committee, the composition is

significant. Everyone should think and act individually and also act in unison. If all

are 'yes men' or 'no men', it is of no use. Members should maintain individuality and

agree on some agreeable points and differ where they feel. Then only the functioning

of the committee will be effective and purposeful. Thus the resulting classifier

(ensemble model) is a combination of different classifier that gives good result.

Many researchers (Dietterich 2000) (Grove and Schuurmans 1998) prove ensemble

model is better than any single model.

4
Two oft used methods for obtaining accurate ensembles are Bagging and Boosting,

according to (Breiman 1996) (Freund and Schapire 1995) respectively. These

methods use 'resampling' techniques to get different training sets for each of the

classifiers. Besides evolutionary approach (using the individuals in a population) and

multi objective approaches (members of different complexity) can also be used to

create ensembles.

A time series (Chatfield 2016) is a set of observations taken at specified times,

usually at equal intervals. Mathematically a time series is defined by the values

A1,A2....of a variable A (temperature, closing price of a share, etc.) at times

X1,X2....Thus A is a function of X, symbolized by A=F(X). It may be seen that time

series consist of data arranged chronologically. Thus if data relating to population,

per capita income, prices, production, etc., for the last 5,10,15,20 years or some other

time period, the series emerging would be called time series. Time series analysis

(Guralnik and Srivastava 1999) (Esling and Agon 2012) facilitates analyzing time

series data for extracting meaningful statistics and other characteristics of the data as

well. The analysis of time series is of great use to economist, businessmen, scientists,

geologists, sociologists, biologists, meteorologists and research workers. The time

series analysis helps in understanding past behavior. It also helps in planning future

operations. Evaluation of current accomplishments can be made with it. It also

facilitates comparison.

5
1.2 OBJECTIVES OF THE WORK

One of the most important tasks before economists, businessmen, meteorologists,

agriculturists, industries, government and planners is to make estimates for the

future. For example, a businessman is interested in finding out his likely sales in the

year 2018 or as a long term planning in the year 2025 so that he could adjust his

production accordingly and avoid the possibility of either unsold stocks or inadequate

production to meet the demand. Similarly, an economist is interested in estimating

the likely production in the coming year so that the proper planning can be carried

out with regard to food supply, jobs for the people, control of inflation, etc. The

incumbent Governor of the Reserve Bank of India (RBI) may be concerned in

containing the prevailing inflation or inflation movement for the next five years. So

also maintaining interest rate in the short period and variation limit in future can be

subject of enquiry. The researcher feels that the model he develops will be highly

useful in this regard.

The first step in making estimates for the future consists of gathering information

from the past. In this connection one usually deals with statistical data which are

collected, observed or recorded at successive intervals of time. Such data are

generally referred to as time series. Hence in the analysis of time series, time is the

most important factor because the variable is related to time which may be either

year, month, week, day, hour or even minutes or seconds.

The role of the computer specialist is finding a way out for suggesting a suitable

model combining more than one classifier (ensemble). The researcher attempts to hit

6
at a suitable model in order to analyze the different real time situations. The approach

may be useful for future research under different situations. If the researcher

succeeds it will be a direct approach for further development in the field. Needless to

state that the scope of the present study can be extended to variety of fields.

The objectives of the present enquiry are only a few related to real time problem of

great importance. A very few are taken for consideration in this study as examples.

They are as follows

 Predicting stock market movement

 Gauging possible rainfall

 Estimating crop production

Though they are only a few examples, in real time situations, various problems may

emanate warranting suitable prediction. The objective is also due to the availability

of various sophisticated machine learning methods such as Support Vector Machine

(SVM) and Naive Bayes (NB). These are extensively used for data classification and

interpretation. This also made the researcher to think a proposed ensemble model

such as AdaSVM and AdaNaive Bayes in order to make fine prediction. The

objective of the ensemble model is to elaborate the accuracy of classification by

making use of more than one classifier. This model helps decision making by

combining the results of classification techniques. By the Boosting method, the

accuracy of the given algorithm is bettered. One such boosting algorithm is the

AdaBoost (Freund and Schapire 1995). For this purpose, AdaBoost algorithm is

applied by the researcher in the study. The statistical significance of the results

obtained are analyzed and discussed using ANOVA and Bonferroni methods.

7
1.3 MOTIVATION OF THE WORK

Any researcher should have motivation to choose an area for enquiry. Any research

which cannot be used in real situations is futile. The advent of computer and modern

technology has ushered new vistas for research. Right from school days, the

researcher was enamored with computer and its application. When the turn occurred

for higher learning and research, the researcher was fascinated to find easy prediction

model of use. This is the factor which motivated the researcher to find an ensemble

model.

The importance of stock market (Eun and Shim 1989) (Day and Lewis 1992) is very

significant and sensitive. Even general public with investing mind and a little

computer knowledge, browse the prices of stock to purchase and sell using their

DEMAT account. With the emergence of corporate world and its expansion many

stocks and shares of large magnitude has emerged. With all small and big investors

liking to purchase and sell securities, stock exchanges have become an important

financial market. Government's policies and budgets reflect in the movement of stock

exchange. Investment has become global, so much so, the movement of prices of

stocks of a country immediately reflect in the price movement of shares of all

countries.

Investors are closely watching stock movement (French 1987) (Guiso et al. 2008)

and on this basis shares are purchased or sold. The stock market may be bullish or

bearish. The fall in share prices in America in the last decade immediately reflected

in all other countries. Economic recession should be closely watched by the monetary

8
authorities not only in their country but in other countries as well. The mass of data

of listed securities should be closely studied and hence a suitable model is needed to

study the shares of various sectors.

So long capital and capital markets exist, as they would, the study of huge share

market price in presentable style, agreeable and proposed ensemble model will

prevail. The present work attempts this and presents to approach one of the problems

viz., stock movement (Huang and Stoll 1994) (Enke and Thawornwong 2005) (Chen

2009) taken for study and scrutiny.

Rainfall is nature given, indispensable and highly needed input to the world and

mankind. The famous Tamil poet Tiruvalluvar says, world cannot sustain without

rain and it is a pre-requisite for agricultural production. The world revolves behind

the agriculturists. Water gives life to human beings and without it men will die.

There is no artificial method of creating rainfall. Rainfall (Krajewski and Smith

2002) (Toth et al. 2000) (Washington and Downing 1999) eludes in some years and

it is not foreseen. In the absence of rainfall, rivers and channels get dried up. Men,

cattle, animals become ferocious and die for want of watersheds. Wild animals move

to nearby towns in search of water and creates panic and terror to habitations.

Successive rainfall failure makes ground water source scarce.

India is a land of villages. Agriculture is the chief occupation of more than 60% of

population. Men are also villains who have made deforestation and pollution of

surroundings making failures of rainfall. In recent years, government has taken a few

positive steps for afforestation and developing social forests. The recurrent failure of

9
rainfall and dying out of water sources is one of the premier problems of the

economy as industries depend on agriculture. If the rainfall could be predicted

sufficiently early, the government and agriculturists can well be prepared to meet the

impending difficult days. The agricultural department and agriculturists can prepare

well their minds. The government should always give priority to water conservation

and underground water preservation. This very vital problem of highest importance

led the researcher to find suitable model for rainfall prediction. Besides heavy rainfall

in some years lead to floods (Lobell et al. 2008) and its trail of loss of life and

standing crops and depletion of soil fertility. This reinforces the need of rainfall

forecasting which also motivated to study this area.

Crop is a general term and includes food (Tilman et al. 2011) crops as well as

commercial crops. Food crops are paddy, wheat, maize, grams, millets, etc.,

Commercial crops are cotton, sugarcane, groundnut, tobacco, cashew, etc., Both food

crops and commercial crops are important. Food crops for human consumption and

commercial crops for industrial purposes and for money earnings are significant. A

study on estimation of crop production and forecasting is highly useful for economic

growth. Oil seeds production estimation can forewarn the government, regarding

needed imports and consequent requirements of foreign exchange.

In a piece of land single crop alone may be raised or more than one time the same

crop may be raised depending on the suitability of land and weather conditions. In

some fertile wet lands generally a short term crop and long term crop (Monteith and

Moss 1977) may be grown. In the same land, two different kinds of crop may also be

grown. In some states in India, Kharif and Rabi crop may be grown. Kharif crop is a

10
monsoon crop (Lobell and Burke 2010) where domesticated plant are cultivated and

harvested during the monsoon season. Millet and Rice are main kharif crops. Rabi

crops are agricultural crops sown in winter and harvested in the spring season. The

major rabi crop in India are wheat, barley, mustard, sesame and peas. In order to

maintain the fertility of the soil, rotation of crops may be followed. In Tamilnadu, the

delta districts, rotation of crops is followed. When paddy is cultivated, after harvest,

grams are raised. They are of short duration and capable of replenishing the lost

properties of soil. It helps agriculturists to earn more.

Mixed cropping or mixed farming is a concept where more than one crop may be

grown. For example, paddy may be raised and after harvest cattle rearing may be

undertaken to make the soil more fit and rejuvenate the soil for the next cultivation.

All the above explanations are to bring the different types of crop and their

importance. Crop selection (Lobell et al. 2011) and raising depends on previous

experience and suitability of soil fertility. The researcher felt that the trends of crop

production may forecast well in the types of crop that may be chosen for different

seasons and different regions. What else is significant in an agricultural country with

variation of fertility and variation in weather conditions than studying crop

production prediction. The selection of the area is the motivation and fascination of

agriculture as the researcher hails from agricultural family. The possible benefit of

forecasting is illustrated in the Table 1.1.

11
Table 1.1 Prospective Benefit of Forecasting

WITHOUT PREDICTION WITH PREDICTION

STOCK
STOCK  Avoid or be prepared to face
ENVIRONMENTAL INFLUENCE

 May result in bullish or bearish situation


bearish situation  Make the most out of the bullish
situation
RAINFALL
RAINFALL  Avoid or be prepared to face
 May result in draught or draught situation
flood situation  Save life and property damage
from flood situation

CROPYIELD
CROPYIELD  Avoid or be prepared to face no
 May result in less or no yield
yield  Take measures to increase the
yield

1.4 SCOPE OF THE WORK

The ensemble model (Kotsiantis et al. 2010) (Pierro et al. 2016) in the present work

is the analysis of time series using machine learning techniques. As the outset time

series is formed by the collection of set of observations made sequentially over a

period of time. Examples of composition of series may be given in different areas. It

may extend from economics to engineering. Such series are subject to further

scrutiny.

12
Some examples of time series in certain areas are relevant at this stage. They are

Economics and Financial time series, Physical time series, Marketing time series,

Demographic time series, Process control data, Binary process.

1.4.1 Economics and Financial Time Series

A few instances of share prices day by day, Export and import of commodities in

variety and value over a period of time, Wholesale and retail prices monthly wise,

household income and expenditure every month, company turnover and profits

monthly are only a few examples of useful time series in the economic domain.

1.4.2 Physical Time Series

In physical sciences many instances of time series occur. To mention only a few,

meteorology, marine science, earth science and earth quakes and tremors. Rainfall on

successive days of a season, heat temperature in a few months, flood situations in a

season are some examples. Many things happen in a rhythm and can be used to find

the areas prone for earthquake and the possible areas as well.

In a time series studies, there are a few mechanical recorders which take

measurements continuously and gives a continuous data rather than observations at

discrete intervals. In some laboratories, temperature and humidity observation

continuously is very much needed for which certain equipments are kept to measure

these variables all the twenty four hours. When the trace goes outside pre specified

limits, action is initiated. In some situation, visual examination of the trace may be

13
enough. However for a detailed analysis, it should be converted to discrete time

series by sampling the trace to an appropriate equal intervals of time.

1.4.3 Marketing Time Series

Marketing is an important segment of business where time series analysis is

significant. Sales figures in successive weeks and months, monetary receipts,

advertisement costs, new markets obtained, lost areas of market are only a few

variables warranting time series studies. This would be a useful guide for future

action for increasing receipts and reducing expenditure.

1.4.4 Demographic Time Series

In the study of population growth, various time series are in use. Increase in

population month-wise, year-wise are studied in a few countries. Child mortality rate,

longevity curve are all used by time series.

1.4.5 Process Control Data

In manufacturing, quality is important. By measuring a variable, which reveals the

quality of the process. These measurements can be plotted against time. When the

variations is too much from target values, control measures should be taken to

control the process.

14
1.4.6 Binary Process

A special type of time series arises when observations can be only one of the two

values viz., 0 and 1. To quote familiar example from computer science field, the

position of a switch either 'on' or 'off' can be recorded as one or zero respectively.

Binary processes occur in several situations including in the study of communication

theory.

In the analysis of time series, many reasons such as description, explanation,

prediction and control can be made.

 Description: When presented with a time series, initial step in the analysis

plotting the observations against time to give what is known as time plot.

Then obtaining of simple descriptive measures of the main properties of the

series is needed. The description can yield from descriptive measures of the

series that there may be regular seasonal effect, with higher sales in winter

season and lower effect in summer season. The time plot also can reveal that

the annual sales are increasing, that is there is an upward swing. This a very

basic model which describes trend and seasonal variation. This may be

perfectly adequate to describe the variation in time series.

 Explanation: When the observations are recorded on two or more variables, it

is possible to use the variation in one time series in order to explain the

variation in another series. This will give a deeper understanding of the

mechanism which generated a given time series. Through a linear operation, a

linear system converts an input series to an output series. The analyst can find

15
the input and output to a linear system by linear operation and access the

properties of the linear system. It could be found how a sea level is affected

by temperature and pressure and also to find how sales are affected by price

and economic conditions. The Figure 1.1 represents a linear system.

Fig 1.1 Schematic Representation of a Linear System

 Prediction: From a observed time series, a researcher would wish to predict

the future values of the series. For example in any trading business an

important requirement is sales forecasting, so also the prediction of

requirement of raw materials for future demand and needed workforce for

future production are all could be predicted on the basis of time and demand

series.

 Control: Control is an important aspect in industrial production. Both

quantitative and qualitative controls are significant. There should be neither

over production nor under production. Over production may lead to glut in

the market and under production may bring search for alternatives in the

market. So to maintain steady demand and growth, time series study is

needed.

16
All the above examples and statements bring out the vital need of time series analysis

in different situations. While time series is significant, it is fortified with ensemble

models using machine learning techniques. This is the core of the scope of this

thesis.

Machine learning (Goldberg et al. 1988) research attempts to bring the possibility of

instructing computers to new ways, which helps to ease the burden of hand-

programming growing voluminous and complex. Besides the fast expansion of

applications and availability of computers makes this possibility facile and desirable.

The classification problems are better utilized with the help of supervised learning

(Caruana and Niculescu-Mizil 2006). In recent times, learners are interested in

producing models of small expected loss by bringing and aggregating multiple

individual models. Ensemble methods (Hassan et al. 2007) (Contiu and Groza 2016)

(Bhardwaj et al. 2016) very often outperform single classifiers and increases its

available computing power that has made the application of ensemble methods

feasible for large data sets. According to ( Hansen et al. 1990) a necessary and

sufficient condition for the ensemble of classifiers is to be more accurate than any of

its individual members, if the individual classifiers are accurate. Ensemble learning

generate multiple models. The ensemble passes the new example to each of the

multiple base models and obtain their predictions, combine them in some appropriate

manner such as averaging or voting. The present work provides a statistical ensemble

procedure in order to make predictions more accurate. It may be seen that the facets

of scope of the work is very wide and deep.

17
1.5 ORGANIZATION OF THE THESIS

The entire work is presented in 5 chapters. Chapter 1 is introduction which outlines

the background, importance of study and machine learning methods and their

significance. Chapter 2 gives an in-depth explanation of review of literature and its

importance leading to the present study. Chapter 3 deals with the methodology

adopted for the study. Chapter 4 is results and discussion presenting the various

angles of the study. Chapter 5 gives the conclusion and utility for future use.

1.6 SUMMARY

In this chapter, the basic idea behind data mining techniques and time series data

analysis is elaborated. The main objectives of this work is given in detail. Factors

that are influenced for this study has been described in the motivation. Various time

series data available for analysis are discussed in the scope. Entire thesis work is

structured in organization of the thesis.

18

You might also like