Professional Documents
Culture Documents
(Download PDF) Discovering Knowledge in Data An Introduction To Data Mining Second Edition Daniel T Larose Online Ebook All Chapter PDF
(Download PDF) Discovering Knowledge in Data An Introduction To Data Mining Second Edition Daniel T Larose Online Ebook All Chapter PDF
https://textbookfull.com/product/data-mining-a-tutorial-based-
primer-second-edition-roiger/
https://textbookfull.com/product/data-mining-and-big-data-ying-
tan/
https://textbookfull.com/product/introduction-to-java-
programming-and-data-structures-comprehensive-version-y-daniel-
liang/
https://textbookfull.com/product/introduction-to-java-
programming-and-data-structures-comprehensive-version-12th-
edition-y-daniel-liang/
Introduction to Java Programming and Data Structures
Comprehensive Version 12th Edition Y Daniel Liang
https://textbookfull.com/product/introduction-to-java-
programming-and-data-structures-comprehensive-version-12th-
edition-y-daniel-liang-2/
https://textbookfull.com/product/bioinformation-discovery-data-
to-knowledge-in-biology-pandjassarame-kangueane/
https://textbookfull.com/product/data-mining-yee-ling-boo/
https://textbookfull.com/product/mobile-data-mining-yuan-yao/
https://textbookfull.com/product/computational-intelligence-in-
data-mining-himansu-sekhar-behera/
Wiley Series on Methods and
Applications in Data Mining
Daniel T. Larose, Series Editor
Second Edition
DISCOVERING
KNOWLEDGE IN DATA
An Introduction to Data Mining
Daniel T. Larose • Chantal D. Larose
DISCOVERING
KNOWLEDGE IN DATA
WILEY SERIES ON METHODS AND APPLICATIONS
IN DATA MINING
DISCOVERING
KNOWLEDGE IN DATA
An Introduction to Data Mining
DANIEL T. LAROSE
CHANTAL D. LAROSE
Copyright © 2014 by John Wiley & Sons, Inc. All rights reserved.
No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or
by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise, except as
permitted under Section 107 or 108 of the 1976 United States Copyright Act, without either the prior
written permission of the Publisher, or authorization through payment of the appropriate per-copy fee to
the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, (978) 750-8400, fax
(978) 750-4470, or on the web at www.copyright.com. Requests to the Publisher for permission should
be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ
07030, (201) 748-6011, fax (201) 748-6008, or online at http://www.wiley.com/go/permission.
Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in
preparing this book, they make no representations or warranties with respect to the accuracy or
completeness of the contents of this book and specifically disclaim any implied warranties of
merchantability or fitness for a particular purpose. No warranty may be created or extended by sales
representatives or written sales materials. The advice and strategies contained herein may not be suitable
for your situation. You should consult with a professional where appropriate. Neither the publisher nor
author shall be liable for any loss of profit or any other commercial damages, including but not limited to
special, incidental, consequential, or other damages.
For general information on our other products and services or for technical support, please contact our
Customer Care Department within the United States at (800) 762-2974, outside the United States at (317)
572-3993 or fax (317) 572-4002.
Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may
not be available in electronic formats. For more information about Wiley products, visit our website at
www.wiley.com.
10 9 8 7 6 5 4 3 2 1
CONTENTS
PREFACE xi
v
vi CONTENTS
8.6 Comparison of the C5.0 and Cart Algorithms Applied to Real Data 180
The R Zone 183
References 184
Exercises 185
Hands-On Analysis 185
INDEX 309
PREFACE
Today, there are a variety of terms used to describe this process, including analyt-
ics, predictive analytics, big data, machine learning, and knowledge discovery in
databases. But these terms all share in common the objective of mining actionable
nuggets of knowledge from large data sets. We shall therefore use the term data
mining to represent this process throughout this text.
Humans are inundated with data in most fields. Unfortunately, these valuable data,
which cost firms millions to collect and collate, are languishing in warehouses and
repositories. The problem is that there are not enough trained human analysts avail-
able who are skilled at translating all of these data into knowledge, and thence up
the taxonomy tree into wisdom. This is why this book is needed.
The McKinsey Global Institute reports:1
There will be a shortage of talent necessary for organizations to take advantage of big
data. A significant constraint on realizing value from big data will be a shortage of talent,
particularly of people with deep expertise in statistics and machine learning, and the
managers and analysts who know how to operate companies by using insights from big
data . . . . We project that demand for deep analytical positions in a big data world could
exceed the supply being produced on current trends by 140,000 to 190,000 positions.
. . . In addition, we project a need for 1.5 million additional managers and analysts in
the United States who can ask the right questions and consume the results of the analysis
of big data effectively.
This book is an attempt to help alleviate this critical shortage of data analysts.
Discovering Knowledge in Data: An Introduction to Data Mining provides readers
with:
r The models and techniques to uncover hidden nuggets of information,
1 Big data: The next frontier for innovation, competition, and productivity, by James Manyika et al.,
Mckinsey Global Institute, www.mckinsey.com, May, 2011. Last accessed March 16, 2014.
xi
xii PREFACE
r The insight into how the data mining algorithms really work, and
r The experience of actually performing data mining on large data sets.
The plethora of new off-the-shelf software platforms for performing data mining has
kindled a new kind of danger. The ease with which these graphical user interface
(GUI)-based applications can manipulate data, combined with the power of the
xiv PREFACE
formidable data mining algorithms embedded in the black box software currently
available, makes their misuse proportionally more hazardous.
Just as with any new information technology, data mining is easy to do badly. A
little knowledge is especially dangerous when it comes to applying powerful models
based on large data sets. For example, analyses carried out on unpreprocessed data
can lead to erroneous conclusions, or inappropriate analysis may be applied to data
sets that call for a completely different approach, or models may be derived that are
built upon wholly specious assumptions. These errors in analysis can lead to very
expensive failures, if deployed.
The best way to avoid these costly errors, which stem from a blind black-box approach
to data mining, is to instead apply a “white-box” methodology, which emphasizes
an understanding of the algorithmic and statistical model structures underlying the
software.
Discovering Knowledge in Data applies this white-box approach by:
r Walking the reader through the various algorithms;
r Providing examples of the operation of the algorithm on actual large data sets;
r Testing the reader’s level of understanding of the concepts and algorithms;
r Providing an opportunity for the reader to do some real data mining on large
data sets; and
r Supplying the reader with the actual R code used to achieve these data mining
results, in The R Zone.
Algorithm Walk-Throughs
Discovering Knowledge in Data walks the reader through the operations and nuances
of the various algorithms, using small sample data sets, so that the reader gets a true
appreciation of what is really going on inside the algorithm. For example, in Chapter
10, Hierarchical and K-Means Clustering, we see the updated cluster centers being
updated, moving toward the center of their respective clusters. Also, in Chapter
11, Kohonen Networks, we see just which kind of network weights will result in a
particular network node “winning” a particular record.
data disk, so that the reader may follow the analytical steps on their own, using data
mining software of their choice.
The R Zone
R is a powerful, open-source language for exploring and analyzing data sets (www.r-
project.org). Analysts using R can take advantage of many freely available packages,
routines, and GUIs, to tackle most data analysis problems. In most chapters of this
book, the reader will find The R Zone, which provides the actual R code needed to
obtain the results shown in the chapter, along with screen shots of some of the output.
The R Zone is written by Chantal D. Larose (Ph.D. candidate in Statistics, University
of Connecticut, Storrs), daughter of the author, and R expert, who uses R extensively
in her research, including research on multiple imputation of missing data, with her
dissertation advisors, Dr. Dipak Dey and Dr. Ofer Harel.
1.6 WHAT TASKS CAN DATA MINING ACCOMPLISH? 9
Consider Figure 1.2, where we have a scatter plot of the graduate GPAs against
the undergraduate GPAs for 1000 students. Simple linear regression allows us to
find the line that best approximates the relationship between these two variables,
according to the least squares criterion. The regression line, indicated as a straight
line increasing from left to right in Figure 1.2 may then be used to estimate the
graduate GPA of a student, given that student’s undergraduate GPA.
Here, the equation of the regression line (as produced by the statistical package
Minitab, which also produced the graph) is ŷ = 1.24 + 0.67 x. This tells us that the
estimated graduate GPA ŷ equals 1.24 plus 0.67 times the student’s undergrad GPA.
For example, if your undergrad GPA is 3.0, then your estimated graduate GPA is
ŷ = 1.24 + 0.67 (3) = 3.25. Note that this point (x = 3.0, ŷ = 3.25) lies precisely on
the regression line, as do all of the linear regression predictions.
The field of statistical analysis supplies several venerable and widely used esti-
mation methods. These include, point estimation and confidence interval estimations,
simple linear regression and correlation, and multiple regression. We examine these
3.25
GPA Graduate
2 3 4
GPA - Undergraduate
Figure 1.2 Regression estimates lie on the regression line.
Another random document with
no related content on Scribd:
paarse en blankhelle, blauwe gebloemten van àl wild
gewas, in ’t heet-zonnige gras. En stiller al, op den
doodstillen Zondag, heilig goud-droomden de
goudene lommerende oprijpaadjes; blakerden de
tuindershoeven, als van den weg af, naar achter
geschoven, brandend tusschen de felle zegeningen,
van zijïg groen, tusschen fluweel malsch-groen, ’t
stugge, geripste, ’t gladde, ijle, zacht-trillende, even
wuivende groen, ’t licht gepolijste, vonkende, ’t
goudlicht-afbrandende groen, ’t groen van al soorten
boomen en bladeren.
Kees wou niet meer dan één borrel. Ouë Gerrit dronk
zoetjes, in vadsige lekkerheid. Maar Dirk alleen
genoot, roerde z’n brandewijntje met suiker, een na
een, kneep half dicht z’n oogen van pure lekkerigheid.
Vijf na elkaar had ie ’r ingeslagen, dat z’n lobbeskijk al
zachtjes-an in guitig spel van licht verfonkelen ging.
Hij lolde met de kerels aan ’t biljard, zei hoe ze stooten
nemen moesten en waar ze heenrobbelen zouden.—
—Nou.. waa’ a ’t? ikke.. ik.. ke bin.. bin d’r ook feur..
feu.. ffeur.. ffeur.. den donder.. feur! feu.. feur de
koning Fillim Drie! Laife.… lai.. fe Fillum dr.. drie.…
daa’s.. daa’s main weut!.… hee?.… enne.. bi.… bi.…
jai.… d’r.… nooit.… nie.… nie iet.… in.… in de wind
weust.… hee?
Kees stapte ’t eerst uit. Hij had ’t benauwd; hij kon die
kroeglucht niet langer inademen, omdat ie ’r nooit
kwam. Ouë Gerrit was lichtelijk draaierig van z’n
dronkje, voelde zich blij weer op straat te staan. Alleen
Dirk was weer teruggestrompeld, bleef plakken bij de
biljarters.
[Inhoud]
II.
—Wà’ nou? hai je je aige weer prikt? paa’s d’r dan òp!
jai ken vast gain bloed misse.… zei Kees norsch.
[Inhoud]
III.
Volgenden dag scheen de zon weer fèl. Ouë Gerrit
was aan ’t frambozen plukken, met Piet en twee
helpertjes. Tusschen de nauwe struiken hurkte de
Ouë, stram en ingebukt vóór de frambozen, één voor
één de vruchten in de mandjes stapelend.—Boven z’n
zilverbaard, die in kronkelig zijïge karteltjes, tusschen
takkengewirwar òpschemerde, bloeiden de
frambozen, rozepurperend, bedauwd met ’n adem van
dof paarsrood, vochtig en donzig. Achter Ouë’s hielen,
rijden mandjes in zonnevlam, wonderteer doorglansd
van licht. Zoeter dan aardbei geurden ze rond, als
rozendroom van sprookjes-prinsesje. Rondom, in de
lucht vervloeide ’n teere geur van rozenzoet, wáár de
mandjes stonden.
—Heé Ouë, je ben d’r in màin regel, nou soek jai d’r
de mooie uit, en ikke hou ’t kriel!
—Brom jai teuge sint Jan! blai da’ tie d’r weust is! de
sjèlotjes binne d’r krek uit! Ook nie te bestig! Is
aldegoar ’n rooi noa niks!… en de rooie kool f’rvloekt!
En nou.. hai je d’r nog spruitkool loate sette tusschen
vaif regels boone! Sel main d’r ’n stelletje worre.—