Shumet (Computer Science) Introduction February 2023 1 / 13
Motivation: Why data mining? Our capacity of generating and collecting data have been increased rapidly in the last several decades. Huge amount of data is available at the tip of our hand because of: − The computerization of business & scientific transactions − Advances in data collection tools, ranging from scanned texts & image platforms to satellite remote sensors − Popular use of WWW as a global information system Though all these have made it easier to create, collect, and store all types of data, it results in data explosion (i.e., information overload). Data explosion is the problem of having huge amount of data in an enterprise stored in databases, data warehouses and other information repositories generated by automated data collection tools. As the volume of data increases, the proportion of information in which people could understand decreases substantially or as the size of data get larger, analyzing the data becomes very difficult. Shumet (Computer Science) Introduction February 2023 2 / 13 Motivation: Why data mining? The true value is not in storing the data, but rather in our ability to extract useful reports and to find interesting trends & correlations to support decisions and policies made by businesses − We are drowning in data, but starving for knowledge! − Too much data & too little knowledge! To bridge the gap of analyzing large volume of data and extracting useful information and knowledge for decision making, computerized methods known as Data Mining (DM) or Knowledge Discovery in Databases (KDD) has emerged. Facing too enormous volumes of data, human analysts with no special tools can no longer make sense. − Data mining can automate the process of finding patterns & relationships in raw data and the results can be utilized for decision support. If we know how to reveal valuable knowledge hidden in raw data, data might be one of our most valuable assets. − Data mining is the tool that involves retrospective analysis to extract diamonds of knowledge from historical data & predict outcome of the future. Shumet (Computer Science) Introduction February 2023 3 / 13 What is Data Mining? Different scholars provided different definitions about DM According to Berry and Linoff (2000); Han and Kamber (2006), DM is the process of extracting or mining knowledge from large amounts of data in order to discover meaningful patterns and rules DM is a technology that uses various techniques to discover hidden information or pattern (i.e., non-trivial, novel, valid, understandable, and useful) from data in large databases (e.g. data warehouse) The term DM is a misnomer as it doesn’t directly related to what it does Data mining should best described as knowledge mining from data rather than data mining − Any way, we will use the term with this understanding − Alternative names Knowledge discovery from databases (KDD), knowledge extraction, data/pattern analysis, information harvesting, business intelligence, etc. Shumet (Computer Science) Introduction February 2023 4 / 13 What is Data Mining?
Data mining is an interdisciplinary subfield of computer science and
statistics with an overall goal to extract information (with intelligent methods) from a dataset and transform the information into a comprehensible structure for further use.
Data Mining vs Machine Learning:
− different from each other while they have some commonalities − regarding their purpose: data mining is used to extract the information while machine learning is used to teach the computer how to learn and work from past experiences without programming.
Shumet (Computer Science) Introduction February 2023 5 / 13
Data Mining Functionalities/Goals
Data mining functionalities are used to specify the kind of patterns to
be found in data mining task Generally data mining task can be broadly classified as − Descriptive (unsupervised) − Predictive (supervised) Descriptive data mining task characterize the general properties of the data in a database Predictive data mining task perform inference on the current data in order to make prediction to the future reference − permits the value of one variable to be predicted from the known values of other variables
Shumet (Computer Science) Introduction February 2023 6 / 13
Data Mining Functionalities/Goals
The supervised predictive data mining functionalities include
− Classification − Regression − Time series − Prediction The unsupervised descriptive data mining functionalities includes − Association rule discovery − Cluster analysis − Summarization − Sequence discovery
Shumet (Computer Science) Introduction February 2023 7 / 13
Data Mining Functionalities/Goals Classification − DM system learns from examples or the data how to partition or classify the data i.e. it formulates classification rules − Example - customer database in a bank Question - Is a new customer applying for a loan a good investment or not? Typical rule formulated: if STATUS = married and INCOME > 10000 and HOUSE OWNER = yes then INVESTMENT TYPE = good Cluster Analysis − Find out the group of objects which are similar to each other in the group but are different from the object in other groups − For example, in a company the classes of items for sales include electronics and non electronic equipments. Association − Rules that associate one attribute of a relation to another − Set oriented approaches are the most efficient means of discovering such rules − Example - supermarket database 72% of all the records that contain items A and B also contain item C the specific percentage of occurrences, 72 is the confidence factor of the rule Shumet (Computer Science) Introduction February 2023 8 / 13 Relationship between Data Mining & Data Warehousing Data Warehouse: centralized data repository which can be queried for business benefit. DM is the automated process of analyzing large data sets to find patterns, relationships and trends and ultimately to generate business insights – which will be used to solve challenges and identify new opportunities A data warehouse – where the data from the various sources is combined and stored – allows data mining to be used throughout the organization Data warehousing and data mining are the cornerstones of modern business decisions. Data Warehousing makes it possible to − extract archived operational data − overcome inconsistencies between different legacy data formats − integrate data throughout an enterprise regardless of location, or format − incorporate additional or expert information Shumet (Computer Science) Introduction February 2023 9 / 13 The KDD/DM Process Model
The term KDD stands for Knowledge Discovery in Databases.
The main objective of the KDD process is to extract information from data in the context of large databases. − It does this by using Data Mining algorithms to identify what is deemed knowledge.
Overview of the steps constituting the KDD process
Shumet (Computer Science) Introduction February 2023 10 / 13
Steps constituting the KDD process
Choosing and creating a dataset on which discovery will be
performed − This incorporates discovering what data is accessible, obtaining important data, and afterward integrating all the data for knowledge discovery onto one set involves the qualities that will be considered for the process. − This process is important because of Data Mining learns and discovers from the accessible data. − This is the evidence base for building the models. If some significant attributes are missing, at that point, then the entire study may be unsuccessful from this respect, the more attributes are considered. Preprocessing − In this step, data reliability is improved. − It incorporates data clearing, for example, handling the missing quantities and removal of noise or outliers.
Shumet (Computer Science) Introduction February 2023 11 / 13
Steps constituting the KDD process Data Transformation − In this stage, the creation of appropriate data for Data Mining is prepared and developed. − Techniques here incorporate dimension reduction( for example, feature selection and extraction and record sampling). − This step can be essential for the success of the entire KDD process, and it is typically very project-specific. Selecting the Data Mining algorithm − This stage incorporates choosing a particular technique to be used for searching patterns that include multiple inducers. − For example, considering precision versus understandability, the previous is better with neural networks, while the latter is better with decision trees. Evaluation − In this step, we assess and interpret the mined patterns, rules, and reliability to the objective characterized in the first step. − Here we consider the preprocessing steps as for their impact on the Data Mining algorithm results. Shumet (Computer Science) Introduction February 2023 12 / 13 Steps constituting the KDD process
Using the discovered knowledge
− At this stage, we are prepared to include the knowledge into another system for further activity. − The knowledge becomes effective in the sense that we may make changes to the system and measure the impacts. − The accomplishment of this step decides the effectiveness of the whole KDD process.
Exercise
What is Data mining and what is it used for?
What are the main reasons for DM to attract a great deal of attention in the information industry in recent years? Explain the main functionalities of DM. Explain the main steps in the KDD process.
Shumet (Computer Science) Introduction February 2023 13 / 13