Professional Documents
Culture Documents
A Brief Survey On Data Mining For Biological and Environmental Problems.
A Brief Survey On Data Mining For Biological and Environmental Problems.
Data mining (the superior scrutiny step of the "familiarity find in Databases"
progression, or KDD), an interdisciplinary subfield of computer science, is the
computational route of determine outlines in great data sets involving ways at the
intersection of artificial intelligence, machine studying, statistics, and database systems.
The in excess of aim of the data mining development is to remove in sequence from a
data set and convert it into a comprehensible construction for more use. Aside from the
raw investigation step, it involves database and data management aspects, data pre-
processing, copy and conjecture deliberations, interestingness metrics, complexity
deliberations, post-processing of discovered configurations, visualization, and online
updating. The term is a exhortation, and is recurrently maltreatment to mean any form
of bulky-scale data or in sequence processing (anthology, taking out, warehousing,
breakdown, and information) but is also generalized to any class of computer
conclusion support system, counting artificial intelligence, machine learning, and
business intelligence. In the correct use of the word, the key expression is unearthing
commonly defined as "detecting a bit new". Even the accepted book "Data mining: no-
nonsense machine learning tools and methods with Java" (which covers mostly machine
learning material) was originally to be named just "Practical machine learning", and the
phrase "data mining" was simply added for marketing reasons.
Problem description
Finding relevant information in unstructured data is a challenge. The data is
unknown in terms of structure and values. The lifecycle of each part of data is in
a specific domain, whereby a domain expert is available for a priori knowledge.
Domain experts can creates structures by hand in the data, however this is a
time-consuming job and it is done for one dataset in one domain.
An additional challenge is connecting data from different domains.
Collaborative environment
Al the data is (geographically) spread over multiple domains. Whereby the
environment consists of more than one domain and they have the intention to
collaborate. Users are physically located at different places exchanging knowledge and
share information by interacting. Nowadays collaborative environments have the
characteristics of being a complex infrastructure, multiple organizational sub-domains,
information sharing is constrained, heterogeneity, changes, volatile, and dynamic.
Problem description
During problem analysis, system administrators combine information from
various monitoring tools. Domain administrators try to relate their findings to
administrators from different domains. This process involves manual inspection of log
files. This is a time-consuming job, which sometimes involves the inference of missing
data. The current study aims to find patterns in the observed data, and build models of
the structure of information in the log files. By relating information from multiple
sources (log files) an automated system should be able to infer missing information (1).
Approach
The following approach is used to address the research questions. First, state of
the art literature research is performed to get an overview of learning mechanisms and
agent technology in distributed systems.
Experiment: Corpus creation using genetic algorithms the first experiment is using or
creating an intelligent algorithm for Region of Interest (ROI) extraction. ROI's are
possible parts of the solution domain (the retrievable structure/pattern, see figure 1-1).
The algorithm has to have the ability to read manifests, to interpret them, and have a
learning function. The result should be a structured set containing regular expressions.
The structured set is called a corpus. The regular expressions represent the value data of
ROI's. A ROI in a log file can be a sentence, verb, IP address, date or a combination
(see figure 1-3).
The algorithm uses a log file and domain knowledge as input. Domain knowledge
contains descriptive attributes e.g. {; ,Time = . * date= }. A descriptive attribute is a
data pattern that the agent will search for in the log file. A descriptive attribute is a
regular expression. With the manifest and the domain knowledge the algorithm will
return a corpus. The corpus represents an information pattern in the log file. Information
retrieval is achieved by querying the corpus. These queries return the values of the ROI
SCOPE:
The scope in the current study is shown in figure 1-4 every chapter will zoom in
to a part of the figure. The scope is retrieving data from manifest(s) using domain
knowledge and machine learning techniques. Figure 1-4 describes the various levels of
the current study, and their relations. The top level describes the data that was used, in
this case manifest containing log data. The second level involved pre-processing. That
is retrieving/creating domain knowledge about the manifests. This domain knowledge
was used in the next level by the algorithm to find patterns in the manifest, based on the
domain knowledge. Finally, in the level of the Multi-agent system, the found patterns -
or ROI's- were exchanged between agents.
Fig: Text mining executed by Multi-agents
Human rights
Data mining is practical in human rights area. Data mining of government
records mostly of the courts system. Justice system enables the discovery of systemic
human rights.
Objectives:
To collect different samples from the different areas of the protein, and forest
fire data.
To include identification and maintenance of techniques and explore clustering,
classification techniques for this problem.
To perform comparative analysis between different areas (protein, and forest fire
data).
To explore different techniques for protein, and forest fire data.
To use the data mining algorithms.
To explore the study of environmental data, Decision Tree, prediction data.
CHAPTER 2
LITERATURE SURVEY
Learning through examples. This means learning from trainings data, For
instance ( X i, Y i ) , where Xi is a vector from the domain space D and Y i is a
vector from the solution space S , i = 1,2,. . . , n, are used to train a system to
obtain the goal function F : D — S . This type of learning is typical for neural
networks.
Learning by being told. This is a direct or indirect implementation of a set of
heuristic rules in a system. For example, the heuristic rules to monitor a car can
be directly represented as production rules. Or instructions given to a system in a
text form by an instructor (written text, speech, natural language) can be
transformed into internally represented machine rules.
Learning by doing. This way of learning means that the system starts with nil or
little knowledge. During its functioning it accumulates valuable experience and
profits from it, and performs better over time. This method of learning is typical
for genetic algorithms
The first two approaches are top down, because there are many possible solutions
available. These approaches can learn from experience or via an instructor (supervised).
The third one is a bottom- up strategy, beginning with a little knowledge and try to find
the best possible solution. But the final or optimal solution is not known only parts of
the solution domain are given.
In information technology, an agent is an autonomous software component.
Autonomous, because it operates without the direct intervention of humans or others
and has control over its actions and internal states. It perceives its environment through
sensors and acts upon its environment through effectors. Agents communicate with
other agents. By communication, agents can work together which can result in
cooperative problem solving, whereby agents have their own tasks and goals to fulfil in
the environment. See figure 2-2 Multi agent system canonical view where the agents
interact with each other. And an agent perceives a part of the environment with its
actions an agent can influence the environment (partial).
Multi agent system (MAS) is a multiple coupled (intelligent) (software) agents.
Multi agents which interact with one another through communication and are capable of
to perceive the environment. Multi agent systems solve problems, which are difficult or
impossible for a single agent system.
FCM is based on fuzzy logic (Elena, 2013) author presented paper is a survey of
fuzzy logic theory which is applied in cluster analysis. In this work author presented
review the Fuzzy c - means clustering method in MATLAB.
In this literature (Ester et al., 2001) authors presented a database-oriented
scaffold for spatial data mining. A small set of basic and primary operations on these
graphs and paths were discussed as database primitives for spatial data mining. Many
types of techniques discussed in this research such as commercial DBMS were
presented. In the research authors covered the main tasks of spatial data mining: spatial
classification, spatial characterization, spatial clustering and spatial trend detection. For
each of these tasks, authors presented algorithms prototypical applications. The authors
indicated interesting directions for future. Since the system overhead imposed by this
DBMS is rather many types of concepts and techniques of improving the efficiency
should be investigated. For example, techniques for processing sets which provide more
information to the DBMS can be used to improve the overall efficiency of mining
algorithms.
(Ghosh and Dubey, 2013) this research authors presented two important
clustering algorithms. First is K-Means and second is an FCM (Fuzzy C-Means)
algorithm and comparative study done between these algorithms.
In this (Jain and Gajbhiye, 2012) research authors presented competitive Neural
Network, K means Clustering algorithm, Fuzzy C Means algorithm and one original
method Quantum Clustering to consider its. The main aim of this research was to
compare the four clustering techniques with lot capable presentation of sum data which
is multivariate. Authors introduced an easy-to-use and astute tool. This tool compares
some clustering methods within the same scaffold in this research.
In (Kalyankar and Alaspurkar, 2013) this research authors presented a lot of
amount of data mining concepts, data, many types of data mining methods such as
Classification, clustering, hyper cyclic patterns, many types of algorithms etc.. The
main aim of this research is study on weather data using data mining technique and
methods like clustering technique.
In (Kavitha and Sarojamma, 2012) this research author presented Disease
Management Programs are initial to encompass providers across the healthcare band. In
this editorial creator express the diabetic monitoring platform, supports Diabetes (D)
diagnosis assessment which is offering functions of DM. It is based on the CART
method. This work explained a decision tree for diabetes diagnostics and showed how
to use it as a basis for generating knowledge base rules.
In (Lavrac and Novak, 2013) this research worked on first outlines relational
data mining approaches and finding of subgroups. In this research described recently
developed approaches to semantic data mining which facilitate the use of sphere
ontology. This ontology is the background acquaintance in analysis of data. The
techniques and tools are useful of illustrated on selected biological applications.
The author in literature (Lai and Tsai, 2012) presented the preliminary results.
This result present in the validation. The author presented risk measurement of total
victory measures which upshot by profound hammering rains in the Shimen reservoir
watershed of Taiwan. In this research author used spatial analysis and data mining
algorithms. In this research researcher focused on Normalized Difference Vegetation
Index, eleven factors such as Digital Elevation Model, DEM, slope, aspect, curvature,
geology, soil, land use, fault, river and road.
Title: Monitoring of Diabetes with Data Mining via CART Method
Author: Kavitha K, Sarojamma R M
Advantages
It can generate regression trees, where each leaf predicts a real number and not
just a class.
CART can identify the most significant variables and eliminate non-significant
ones by referring to the Gini index.
Disadvantages
CART may have an unstable decision tree. When learning sample is modified
there may be increase or decrease of tree complexity, changes in splitting
variables and values.
Title: Verification and Risk Assessment for Landslides in the Shimen Reservoir
Watershed of Taiwan Using Spatial Analysis and Data Mining
Author: J-S. Lai, F. Tsai
In the arena of software, data mining technology has been considered as useful
means for identifying patterns and trends of large volume of data. This approach is
basically used to extract the unknown pattern from the large set of data for business as
well as real time applications. It is a computational intelligence discipline which has
emerged as a valuable tool for data analysis, new knowledge discovery and autonomous
decision making. The raw, unlabeled data from the large volume of dataset can be
classified initially in an unsupervised fashion by using cluster analysis i.e. clustering the
assignment of a set of observations into clusters so that observations in the same cluster
may be in some sense be treated as similar. The outcome of the clustering process and
efficiency of its domain application are generally determined through algorithms. There
are various algorithms which are used to solve this problem. In this research work two
important clustering algorithms namely centroid based K-Means and representative
object based FCM (Fuzzy C-Means) clustering algorithms are compared. These
algorithms are applied and performance is evaluated on the basis of the efficiency of
clustering output. The numbers of data points as well as the number of clusters are the
factors upon which the behaviour patterns of both the algorithms are analyzed. FCM
produces close results to K-Means clustering but it still requires more computation time
than K-Means clustering.
Advantages
KM is relatively efficient in computational time complexity with its cost
Disadvantages
KM may not be successful to find overlapping clusters
It is not invariant to non-linear transformations of data.
KM also fails to cluster noisy data and non-linear datasets.
Title: Data Mining Technique to Analyse the Metrological Data
Author: Meghali A. Kalyankar, S. J. Alaspurkar
Data Mining is the process of discovering new patterns from large data sets, this
technology which is employed in inferring useful knowledge that can be put to use from
a vast amount of data, various data mining techniques such as Classification, Prediction,
Clustering and Outlier analysis can be used for the purpose. Weather is one of the
meteorological data that is rich by important knowledge. Meteorological data mining is
a form of data mining concerned with finding hidden patterns inside largely available
meteorological data, so that the information retrieved can be transformed into usable
knowledge. We know sometimes Climate affects the human society in all the possible
ways. Knowledge of weather data or climate data in a region is essential for business,
society, agriculture and energy applications. The main aim of this paper is to overview
on Data mining Process for weather data and to study on weather data using data
mining technique like clustering technique. By using this technique we can acquire
Weather data and can find the hidden patterns inside the large dataset so as to transfer
the retrieved information into usable knowledge for classification and prediction of
climate condition. We discussed how to use a data mining technique to analyze the
Metrological data like Weather data.
Advantages
Data Logger can automatically collect data on a 24-hour basis
CHAPTER 3
PRE-PROCESSING IN INFORMATION RETRIEVAL
Domain administrators
Manually gaining knowledge can be done by asking an expert or administrator
in the specific domain. This person can provide the required information about the
manifest. Which knowledge is needed and give a priority order on each item.
In this specific case on gaining knowledge and retrieving it via domain experts
has some disadvantages. For these experts it's a time consuming job, providing
knowledge to other people or systems. In the case of expert providing to an other
person. The domain knowledge should be provided in a consumable format. Where a
non domain expert can work with the provide knowledge in retrieving rich information
from manifest. In practice the domain knowledge can change, so that's an extra
challenge for the domain experts. This study assumes that a domain expert provides
knowledge in readable machine format as input for finding a corpus (structure) of the
manifest. It is important to have a readable Human machine communication. An
ontology which is very helpful in creating a compromise and an agreement in
communication and the meaning about the given domain knowledge.
Extracting algorithms
Delegating these processes and responsibilities raises the need for machine-
process form. A machine (extracting algorithms) reads a set of manifests and tries to
extract domain knowledge. This will be used to find a corpus out of a set of manifests.
An advantage that the domain experts have a lower load of work in providing
knowledge. Their task can change from an executing task to a controlling one. For
checking the outcome of these algorithms and give these algorithms feedback, so in the
future unseen data will be handled better. This is only the case in supervised learning,
paragraph 0. Another approach can be that domain experts only change the outcome of
an domain knowledge gaining algorithm
Regular Expressions
The domain knowledge is in this particular case represented by regular
expressions. A regular expression is very powerful, because they represent a search
pattern in a manifest.
This format of domain knowledge is used as input for the genetic algorithms. In
the GRID case for analyzing log files key value pairs are used, see table 3-1 regular
expressions. Log files have mostly a hierarchical structure containing a lot of key value
pairs. Using domain knowledge represented by regular expressions as input for corpus
creating algorithms is very powerful. The corpus can be queried and it consists of
multiple regular expressions. Indeed regular expressions are search patterns, which can
be queried.
Also, some common information retrieval problems are usually not encountered
in conventional database systems, such as unstructured documents, estimated search
based on keywords, and the concept of relevance. Due to the huge quantity of text
information, information retrieval has found many applications. There exist many
information retrieval systems, such as on-line library catalog systems, on-line document
management systems, and the more recently developed Web search engines
Information Extraction
The information extraction method identifies key words and relationships within
the text. It does this by looking for predefined sequences in the text, a process called
pattern matching. The software infers the relationships between all the identified places,
people, and time to give the user with meaningful information.
This technology is very useful when dealing with large volumes of text.
Traditional data mining assumes that the information being "mined" is already in the
form of a relational database. Unfortunately, for many applications, electronic
information is only available in the form of free natural language documents rather than
structured databases
In addition to the classic stop list, we use three stop word creation methods
moved by Zipf's law, including: removing most frequent words (TF-High) and
removing words that occur once, i.e. singleton words (TF1). We also consider
removing words with low inverse document frequency (IDF).
D. Stemming
This method is used to identify the root/stem of a word. For example, the words
connect, connected, connecting, connections all can be stemmed to the word "connect"
[6]. The purpose of this method is to remove various suffixes, to reduce the number of
words, to have accurately matching stems, to save time and memory space
TOKENIZATION
Tokenization is a critical activity in any information retrieval model, which
simply segregates all the words, numbers, and their characters etc. from given document
and these identified words, numbers, and other characters are called tokens. Along with
token generation this process also evaluates the frequency value of all these tokens
present in the input documents. All the phases of tokenization process are Pre-
processing involves the set of all documents are gathered and passed to the word
extraction phases in which all words are extracted. In next phase all the infrequent
words are listed and removed for example remove words having frequency less than
two. Intermediate results are passed to the stop word removal phase. In this phase
remove those English words which are useless in information retrieval these English
words are known as stop words. For example stop words include “the, as, of, and, or, to
etc. this phase is very essential in the tokenization because it has some advantages: It
reduces the size of indexing file and it also improve the overall efficiency and make
effectiveness.
Next phase in tokenization is stemming. Stemming phase is used to extract the
sub-part i.e. called as stem/root of a given word. For example, the words continue,
continuously, continued all can be rooted to the word continue. The main role of
stemming is to remove various suffixes as result in the reduction of number of words, to
have exactly matching stems, to minimize storage requirement and maximize the
efficiency of IR Model.
Numbers= 2<1>
On the completion of stemming process, next step is to count the
frequency of each word. Information retrieval works on the output of this tokenization
process for achieving or producing most relevant results to the given users. For
example, there is a document in which the information likes “This is an Information
retrieval model and it is widely used in the data mining application areas. In our project
there is team of 2 members. Our project name is IR”. If this document is passing to
the tokenization technique or process then the output of the process is like it separate
the words this, is, an, information etc. If there is a number it can also separate from
other words or numbers and finally give the tokens with their occurrences count in
the given document.
Doc1 Military is a good option for a career builder for youngsters. Military is not
covering only defense it also includes IT sector and its various forms are
Army, Navy, and Air force. It satisfies the sacrifice need of youth for their
country.
Doc2 Cricket is the most popular game in India. In crocket a player uses a bat to hit
the ball and scoring runs. It is played between two teams; the team scoring
maximum runs will win the game.
Doc3 Science is the essentiality of education, what we are watching with our eyes
Happening non-happening all include science. Various scientists working on
different topics help us to understand the science in our lives. Science is
continuous evolutionary study, each day something new is determined.
Doc4 Engineering makes the development of any country, engineers are
manufacturing beneficial things day by day, as the professional engineers of
software develops programs which reduces man work, civil engineers gives
their knowledge to construction to form buildings, hospitals etc. Everything
can be controlled by computer systems nowadays.
Phase 2:
In this phase, all the words are extracted from these four documents as shown below:
Name: doc1
[Military, is, a, good, option, for, a, career, builder, for, youngsters, Military, is, not,
covering, only, defense, it, also, includes, IT, sector, and, its, various, forms, are, Army,,
Navy,, and, Air, force., It, satisfies, the, sacrifice, need, of, youth, for, their, country.]
Name: doc2
[Cricket, is, the, most, popular, game, in, India., In, crocket, a, player, uses, a,
bat, to, hit, the, ball, and, scoring, runs., It, is, played, between, two, teams;, the, team,
scoring, maximum, runs, will, win, the, game.]
Name: doc3
[Science, is, the, essentiality, of, education,, what, we, are, watching, with, our,
eyes, happening, non-happening, all, include, science., Various, scientists, working, on,
different, topics, help, us, to, understand, the, science, in, our, lives., Science, is,
continuous, evolutionary, study,, each, day, something, new, is, determined.]
Name: doc4
[Engineering, makes, the, development, of, any, country,, engineers, are,
manufacturing, beneficial, things, day, by, day,, as, the, professional, engineers, of,
software, develops, programs, which, reduces, man, work,, civil, engineers, gives, their,
knowledge, to, construction, to, form, buildings,, hospitals, etc., Everything, can, be,
controlled, by, computer, systems, nowadays.]
Name: doc1
[militari, good, option, for, career, builder, for, youngster, militari, not, cover,
onli, defens, it, also, includ, it, sector, it, variou, form, ar,armi, navi, air, forc, it, satisfi,
sacrific, need, youth, for, their, country]
Name: doc2
[Cricket, most, popular, game, in, India, in, crocket, player, us, bat, to, hit, ball,
score, run, it, plai, between, two, team, team, score, maximum, run, win, game]
Fig: Document Tokenization Graph
Name: doc3
[scienc, essenti, educ, what, we, ar, watch,our, ey, happen, non, happen, all,
includ, scienc,variou, scientist, wor, on, differ, topic, help, to, understand, scienc, in,
our, live, scienc, continu, evolutionari, studi, each, dai, someth, new, determin]
Name: doc4
[engine, make, develop, ani, countri, engin, ar, manufactur, benefici, thing, dai,
by, dai, profession, engin, softwar, develop, program, which, reduc, man, work, civil,
engin, give, their, knowledg, to, construct, to, form, build, hospit, etc, everyth, can, be,
control, by, comput, system, nowadai]
CHAPTER 4
BIODIVERSITY
Fig: Types of diversity: Genetic (inner), Species (middle), and Ecosystem (outer)
Genetic diversity refers to the genetic variation and heritable traits within
organisms. All species are related with other species through a genetic network, but the
variety of genetic properties and features makes creatures different in their morphologic
characteristics. Genetic diversity applies to all living organisms having inheritance of
genes, including the amount of DNA per cell and chromosome structures. Genetic
diversity is an important factor for the adaptation of populations to changing
environments and the resistance to certain types of diseases. For species, a higher
genetic variation implies less risk. It is also essential for species evolution.
Species diversity refers to the variety of living organisms within an ecosystem,
a habitat or a region. It is evaluated by considering two factors: species richness and
species evenness. The first corresponds to the number of different species present in a
community per unit area, and the second to the relative abundance of each species in the
geographical area. Both factors are evaluated according to the size of populations or
biomass of each species in the area. Recent studies have shown relationships between
diversity within species and diversity among species. Species diversity is the most
visible part of biodiversity.
Ecosystem diversity refers to the variety of landscape of ecosystems in each
region of the world. An ecosystem is a combination of communities - associations of
species - of living organisms with the physical environment in which they live (e.g., air,
water, mineral soil, topography, and climate). Ecosystems vary in size and in every
geographic region there is a complex mosaic of interconnected ecosystems. Ecosystems
are environments with a balanced state of natural elements (water, plants, animals,
fungi, microbes, molecules, climate, etc.). Ecosystem diversity embraces the variety of
habitats and environmental parameters that occur within a region. pcologists consider
biodiversity according to the three following interdependent primary characteristics:
Ecosystems composition, i.e., the variety and richness of inhabiting species, ecosystems
structure, i.e., the physical and three dimensional patterns of life forms, and ecosystems
function, i.e., biogeochemical cycles and evolving environmental conditions. Even
though many application tools have been developed, evaluating biodiversity still faces
difficulties due to the complexity of precise evaluations of these parameters. Hence, the
overall number of species that can be measured and officially identified all around the
world is only 1.7 to 2 million and 5 to 30 million respectively
Environmental Issues
Methodology
This research deals with the methodological steps adopted in the research study.
The research Procedures followed is described under the following headlines:
The most prominent information providers that propose data and knowledge
repositories used for biodiversity and environmental studies are shown in Figure2. Each
one provides contents related to different domains and categories depicted by the edges
in the schema. These providers propose contents of different types (documents,
databases, meta-data, spatio-temporal, etc.), in different categories (data, knowledge
and documentations) and for different domains of application. Two main categories of
resources are considered in this classification diagram. The Data category corresponds
to resources depicting facts about species (animal and plants) and environmental
conditions in specific areas. The Frameworks category corresponds to resources
depicting both tacit and formal knowledge related to biodiversity and environment
analytical application domains. Several providers, such as ESABII (East and Southeast
Asia Biodiversity Information Initiative), IUCN (International Union for Conservation
of Nature), and OECD (Organisation for Economic Co-operation and Development),
give access to both data and frameworks resources.
In the Biodiversity Policy knowledge domain, information concern issues of
principles, regulations and agreements on biodiversity. Amid these resources, we can
cite BioNET, CBD, ESABII, IUCN and UNEP that provide information to serve and
follow up among botanists, biologists and researchers in "globalization". For example,
the Agreement of the United Nations Decade on Biodiversity 2011-2020 aims to
support and implement the Strategic Plan for Biodiversity.
The Environment domain refers to results of researches and repositories on this
area to scientists or people who want to know status of environment on the earth,
especially the biologist who works on this issue. The major actors in this category are
BHL, BioNET, CBD, ECNC, IUCN, KNEU and UNEP that are organizations which
regularly publish reports and results of studies on domains related to biodiversity and
environment, such as the Protected Areas Presented as Natural Solutions to Global
Environmental Challenges at RIO +20 published by the IUCN or the Global
Environment Facility (GEF) published by the CDB organization.
The Economics domain corresponds to information from scientists about the
status of economics development, based on effects and values of ecosystem services.
The foremost information providers in this category are CBD, IUCN, OECD and
TEEB. Among reports and studies published by these organizations, we can cite
Restoring World's Forests Proven to Boost Local Economies and Reduce Poverty
by IUCN and Green Growth and Sustainable Development by OECD for example.
In the Health/Society domain, repositories supply knowledge information
referring to natural resources of ecosystem services and effects. This information has
been produced, and their validity was demonstrated, by researchers from world-wide
organizations such as BHL, CBD, IUCN, KNEU and UNEP. These organizations
provide summaries and proposals such as for example, Human Health and
Biodiversity is projected by CBD, Towards the Blue Society is projected by IUCN
and Action for Biodiversity: Towards a Society in Harmony with Nature is
published by CBD.
Fig: Classification biodiversity information
REFERENCE
1. Agarwal, t.limielinski and A.Swami mining association rules between sets of
items in large databases, in ACM SIGMOD,1993.
2. Arthur David and Vassilvitskii Sergei (2007), “k-means++: The Advantages of
Careful Seeding”, SODA ’07: Proceedings of the eighteenth annual ACM-
SIAM, 1027-1035.
3. Basheer M., Al-Maqaleh, Hamid Shahbazkia, (2012),” A Genetic Algorithm for
Discovering Classification Rules in Data Mining”, International Journal of
Computer Applications (0975 – 8887), 41(18): 40-44.
4. Bifet A., Holmes G., Pfahringer B., Kirkby R., Gavalda R.(2009), “New
ensemble methods for evolving data streams”, In KDD, ACM, 139-148.
5. Dietterich T. (2002), “Ensemble learning, in The Handbook of Brain Theory and
Neural Networks”, 2nd ed., M. Arbib, Ed., Cambridge MA: MIT Press.
6. Dr. Bhatia, M.P.S. and Khurana, Deepika (2013), “Experimental study of Data
clustering using k-Means and modified algorithms”, International Journal of
Data Mining & Knowledge Management Process (IJDKP), 3(3): 17-30.
7. Dr. Hemalatha M. and Saranya Naga N. (2011), “A Recent Survey on
Knowledge Discovery in Spatial Data Mining” IJCSI International Journal of
Computer Science, 8 (3):473-479.
8. Dzeroski S. and Zenko B. (2004), “Is combining classifiers with stacking better
than selecting the best one? Machine Learning”, 255–273.
9. Ester Martin, Kriegel Peter Hans, Sander Jörg (2001), “Algorithms and
Applications for Spatial Data Mining Published in Geographic Data Mining and
Knowledge Discovery, Research Monographs in GIS”, Taylor and Francis, 1-
32.
10. Elena Makhalova, (2013), “Fuzzy C means Clustering in MATLAB”, The 7th
International Days of Statistics and Economics, Prague, 19(21): 905-914.
11. Ester Martin, Kriegel Peter Hans, Sander Jörg (2001), “Algorithms and
Applications for Spatial Data Mining Published in Geographic Data Mining and
Knowledge Discovery, Research Monographs in GIS”, Taylor and Francis, 1-
32.
12. Ghosh, Soumi and Dubey, Sanjay, Kumar (2013), “Comparative Analysis of
KMeans and Fuzzy C-Means Algorithms”, (IJACSA) International Journal of
Advanced Computer Science and Applications, 4(4):35-39.
13. Jain Shreya and Gajbhiye Samta (2012), “A Comparative Performance Analysis
of Clustering Algorithms”, International Journal of Advanced Research in
Computer Science and Software Engineering, 2(5):441-445.
14. Kalyankar A. Meghali and Prof. Alaspurkar S. J. (2013), “Data Mining
Technique to Analyse the Metrological Data”, International Journal of
Advanced Research in Computer Science and Software Engineering, 3(2):114-
118.
15. Kavitha K. and Sarojamma R M (2012), “Monitoring of Diabetes with Data
Mining via CART Method”, International Journal of Emerging Technology and
Advanced Engineering, 2 (11):157-162.
16. Lavrac Nada and Novak Kralj Petra (2013), “Relational and Semantic Data
Mining for Biomedical Research”, Informatica (Slovenia), 37(1): 35-39.
17. Lai J-S and Tsai F.(2012),“Verification And Risk Assessment For Landslides In
The Shimen Reservoir Watershed Of Taiwan Using Spatial Analysis And Data
Mining”, International Archives of the Photogrammetry, Remote Sensing and
Spatial Information Sciences, XXXIX(B2): 67-70.
18. Li Yifeng, Ngom, Alioune (2013), “The Non-Negative Matrix Factorization
Toolbox for Biological Data Mining”, Source Code for Biology & Medicince, 8
(1): 1-18.
19. Mrs. Sujatha B. and Akila C., (2012), “Prediction of Breast Cancer Using Data
Mining”, International Journal of Computer Science and Management Research,
1 (3): 384-389.