Professional Documents
Culture Documents
Intelligent Decision Support System Based On Data Mining (Autosaved) (Repaired)
Intelligent Decision Support System Based On Data Mining (Autosaved) (Repaired)
Abstract-
In this paper, we propose an Intelligent Decision Support System based on Data
Mining (IDSSDM), which integrates several data mining techniques and considers both
structured data and semi-structured data. For structured transactional data, Online Analytical
Processing (OLAP) is first used to access data warehouse for multidimensional analysis and
primary decision support. As for semi-structured data, classification and clustering is
exploited for contract documents mining; while Web usage mining is used for analyzing the
behavior of the users in order to extract relationships in the recorded data. Furthermore,
Knowledge Discovery in Knowledge Base (KDK) is used as the primary inference engine. As
the main business intelligence tool, the system has been adopted by E-Commerce Center of
Ministry of Commerce of the People's Republic of China.
Keywords- Decision Support System, IDSSDM, KDD*, SOM, KDK, Foreign
Trading.
I. INTRODUCTION-
Business Intelligence is the gathering, management, and analysis of large amounts of
data on a company's customers, products, services, operations, suppliers, and partners and all
the transactions in between. Business Intelligence applications include target marketing,
market basket analysis, customer profiling, and fraud detection in e-commerce industry,
which require analyzing large volumes of shopping transaction data from electronic
storefront sites.
Intelligent Decision Support Systems (IDSS), which are computer-mediated tools that
assist managerial decision making by presenting information and interpretations for various
alternatives, are usually regarded as one of major implementations of Business Intelligence.
Data Mining (DM), or Knowledge Discovery in Database (KDD), is an area that has received
an increasing amount of attentions in that it develops techniques and tools for the exploration
of databases in an attempt to extract relevant and interesting hidden relationships that exist
among variables or between causes and effects. The results emerging from the data mining
~ 2 ~
community can be an important contribution to IDSS community by providing techniques
which future IDSS can be able to utilize in providing a wide range of information Bingru
Yang, Wei Song and Linna Li School of Information Engineering University of Science and
Technology Beijing, China available for decision makers.
In this paper, by utilizing several data mining techniques and considering both
structured data and semi-structured data, we try to integrate DSS and DM better. That is to
provide one more knowledge source for decision making besides domain knowledge and
inference engine. As the main Business Intelligence tool, the proposed system has been
adopted by E-commerce Center of Ministry of Commerce of the People's Republic of China
(ECCMCPRC) to deal with foreign trading.
The rest of this paper is organized as follows. Section II describes the proposed
system in briefly. Section III discusses the techniques for structured data mining. The
techniques for semi-structured data mining are introduced in section IV. Section V depicts the
inference engine. Section VI compares the proposed system with related works. Section VII
concludes the paper and prospects future work.
A. Data Preprocessing
Data preprocessing is the foundation for data analysis and mining. Incorrect data integration
must lead to incorrect outputs of data mining algorithms. So attained decision support
through this way is unreliable. In fact, during the process of data mining, a large part of work
is used to data preparation and improvement of data quality. Therefore, in order to improve
the accuracy of the results of mining, data preprocessing should not be ignored. Because of
semi-structured data collected from World Wide Web, they often contain large number of
noises, such as advertisements, hyperlinks, etc. Data preprocessing usually means re-
processing of data, checking the integrity of data and consistency, smoothing of the noise
data, fill the lost data, elimination of "dirty" data, and elimination of repeated records. There
are a number of common data preprocessing techniques, such as data cleaning, data
integration, data transformation and data reduction. Data cleaning routines work to "clean"
the data by filling in missing values, smoothing noisy data, identifying or removing outliers,
and resolving inconsistencies. Binning method and regression can be employed to remove the
noise.
B. Web Text Classification
There are thousands of new trading contracts in China, and how to categorize them efficiently
for support decision is far beyond the ability of existing statistical software used by E-
commerce Center of Ministry of Commerce of the People's Republic of China. Furthermore,
with the rapid development of World Wide Web, more contract data are collected via
Internet, so it is urgent to design efficient content-based retrieval, searching and filtering for
the huge and semi-structured online repositories on the Internet. Text classification, the
assignment of free text documents to one or more predefined categories based on their
~ 5 ~
content. Each document can be in multiple, exactly one, or no category at all. Text
classification has many application areas, such as information management, real-time sorting
of email or files into folder hierarchies, topic identification to support topic-specific
processing operations, structured search and/or browsing, or finding documents that match
long-term standing interests or more dynamic task-based interests, etc. In the research
community the dominant approach to it is based on machine learning techniques: a general
inductive process automatically builds a classifier by learning from a set of pre-classified
documents, the characteristic of the categories. This system adopts the fusion of K-NN (K
Nearest Neighbor) and SVM (Support Vector Machine). K-NN classification is an instance-
based learning algorithm. For deciding whether, K-NN algorithm ranks the document’s
neighbors among the training documents, and uses the class labels of the k most similar
neighbors to predict the class of the input document. SVM regards training samples as
relevant and non-relevant and aims to find the hyper plane which minimizes the loss function.
The detailed introduction to SVM can be found in references. The fused method can be
described as follows:
Input: texts will be classified
Output: texts and corresponding categories
(1) While there are texts which have not been dealt
(2) For a text which will be dealt
(3) K-NN is applied to it firstly.
(4) If the text can be classified by K-NN undoubtedly, it will be classified to the
corresponding category.
(5) Otherwise, SVM is applied to it.
C. Web Text Clustering
The above-mentioned text classification technique requires a large number of labeled training
examples. However, in certain cases, it is difficult to label some of them. Clustering can be
applied to these collected texts. Hierarchical clusters of these unlabeled documents can be
generated. Based on this, categorization can be operated indefinitely. Clustering via Self-
Organizing Maps (SOM) is adopted. The SOM is one of the major unsupervised artificial
neural network models and often used to learn certain useful features found in their learning
process. It basically provides a way for cluster analysis by producing a mapping of high
dimensional input vectors onto a two-dimensional output space while preserving topological
relations as faithfully as possible. After appropriate training iterations, the similar input items
are grouped spatially close to each other. As such, the resulting map is capable of performing
the clustering task in a completely unsupervised fashion. Furthermore, SOM approach is
superior to other cluster analysis methods in data mining in terms of the power of data
~ 6 ~
visualization. Thus, in this work SOM method was adopted to produce document cluster map
for Web text mining.
V. INFERENCE ENGINE-
As an important part of DSS, common methods used for inference engine mainly include
rule-based reasoning and case-based reasoning. For IDSSDM, Knowledge Discovery in
Knowledge Base (KDK), proposed in this paper, is used for this task. The basic ideas of
KDK are illustrated as follows.
(i) The aim of the KDK is the nontrivial process of discovering new knowledge in the
huge knowledge base, this means that the key problem of the discovery process is
induction, but the deduction is just assistance, since it cannot always ensure the
facts;
(ii) KDK can discover knowledge of deeper level. To be specific, we have to go further to
discover other relation based on the existing attributes and relations, from the
logical point of view, it is important to discover the relation between predicate or
function word.
(iii) Because knowledge itself may contain some attributes such as uncertainty, non-
monotony, incompletion, etc., the progress of KDK process will be a process of
complexity and multi-solutions. It is closely related to the organization of
knowledge base, as well as, the types of knowledge that a user pursue, the means
of reasoning may be associated with many different logical domains.
(iv) The knowledge discovered in KDK should be original, potentially useful,
effective and understandable to users. From the above description, we can see that
the nature of KDK is a machine learning process. Its aim is to obtain knowledge.
The resource of learning is knowledge base, the way of learning is to combine
induction with deduction methods, and the final output is not only to discover the
fact knowledge, as well as, knowledge of rules. As a result, in specific fulfillment,
two mining methods should be adopted.
VII. CONCLUSIONS-
In this paper, an Intelligent Decision Support System based on Data Mining (IDSSDM) was
proposed. It integrates several data mining techniques and considers both structured data and
semi-structured data. For structured transactional data, Online Analytical Processing (OLAP)
is first used to access data warehouse for multidimensional analysis and primary decision
support. To uncover hidden relationships between different attributes, KDD* is used for
discovering association rules from massive trading data. As for semi-structured data,
classification and clustering is exploited for contract documents mining; while Web usage
mining is used for analyzing the behaviors of the users in order to extract relationships in the
recorded data. Furthermore, Knowledge Discovery in Knowledge Base (KDK) is used as the
primary inference engine. As the main business intelligence tool, the proposed system has
been adopted by E-Commerce Center of Ministry of Commerce of the People's Republic of
China (ECCMCPRC) for foreign trading. How to better integrate different main parts of
IDSSDM for real-time decision support is our future work. And the second stage of the
project between ECCMCPRC and us has been started just now.
~ 8 ~
Real DB Basic KB
Pre- Classify
processing knowledge sub-
base
Classify data
sub-base
Classify knowledge
nodes according to
attributes establish
the reasoning and
create mining
Create data sub knowledge base
structure according
to sub-base and
establish the Search the association
mining data base of knowledge nodes in
mining knowledge base,
discover shortage of
knowledge and
User’s interest and needs prioritize.
heuristic coordinator
(Directional mining)
(
Focusing Derivative
DBA
Directional
mining
Acquire Evaluation
hypothesis
rule
(Directional searching)
Store the mined rule into
knowledge base
Maintenance coordinator
( Fig-1 )