Download as pdf or txt
Download as pdf or txt
You are on page 1of 38

Feature Engineering for Machine

Learning and Data Analytics First


Edition Dong
Visit to download the full and correct content document:
https://textbookfull.com/product/feature-engineering-for-machine-learning-and-data-a
nalytics-first-edition-dong/
More products digital (pdf, epub, mobi) instant
download maybe you interests ...

Feature engineering for machine learning principles and


techniques for data scientists First Edition Casari

https://textbookfull.com/product/feature-engineering-for-machine-
learning-principles-and-techniques-for-data-scientists-first-
edition-casari/

The Art of Feature Engineering: Essentials for Machine


Learning 1st Edition Pablo Duboue

https://textbookfull.com/product/the-art-of-feature-engineering-
essentials-for-machine-learning-1st-edition-pablo-duboue/

AI and Machine Learning Paradigms for Health Monitoring


System: Intelligent Data Analytics Hasmat Malik

https://textbookfull.com/product/ai-and-machine-learning-
paradigms-for-health-monitoring-system-intelligent-data-
analytics-hasmat-malik/

Advanced Data Analytics Using Python: With Machine


Learning, Deep Learning and NLP Examples Mukhopadhyay

https://textbookfull.com/product/advanced-data-analytics-using-
python-with-machine-learning-deep-learning-and-nlp-examples-
mukhopadhyay/
Recent Developments in Machine Learning and Data
Analytics IC3 2018 Jugal Kalita

https://textbookfull.com/product/recent-developments-in-machine-
learning-and-data-analytics-ic3-2018-jugal-kalita/

Machine Learning and Big Data Analytics Paradigms:


Analysis, Applications and Challenges Aboul Ella
Hassanien

https://textbookfull.com/product/machine-learning-and-big-data-
analytics-paradigms-analysis-applications-and-challenges-aboul-
ella-hassanien/

Data Processing with Optimus: Supercharge big data


preparation tasks for analytics and machine learning
with Optimus using Dask and PySpark Leon

https://textbookfull.com/product/data-processing-with-optimus-
supercharge-big-data-preparation-tasks-for-analytics-and-machine-
learning-with-optimus-using-dask-and-pyspark-leon/

Intelligent Feature Selection for Machine Learning


Using the Dynamic Wavelet Fingerprint Mark K. Hinders

https://textbookfull.com/product/intelligent-feature-selection-
for-machine-learning-using-the-dynamic-wavelet-fingerprint-mark-
k-hinders/

Scala and Spark for Big Data Analytics Explore the


concepts of functional programming data streaming and
machine learning 1st Edition Md. Rezaul Karim

https://textbookfull.com/product/scala-and-spark-for-big-data-
analytics-explore-the-concepts-of-functional-programming-data-
streaming-and-machine-learning-1st-edition-md-rezaul-karim/
FEATURE ENGINEERING FOR
MACHINE LEARNING AND
DATA ANALYTICS
Chapman & Hall/CRC
Data Mining and Knowledge Series
Series Editor: Vipin Kumar

RapidMiner
Data Mining Use Cases and Business Analytics Applications
Markus Hofmann and Ralf Klinkenberg
Computational Business Analytics
Subrata Das
Data Classification
Algorithms and Applications
Charu C. Aggarwal
Healthcare Data Analytics
Chandan K. Reddy and Charu C. Aggarwal
Accelerating Discovery
Mining Unstructured Information for Hypothesis Generation
Scott Spangler
Event Mining
Algorithms and Applications
Tao Li
Text Mining and Visualization
Case Studies Using Open-Source Tools
Markus Hofmann and Andrew Chisholm
Graph-Based Social Media Analysis
Ioannis Pitas
Data Mining
A Tutorial-Based Primer, Second Edition
Richard J. Roiger
Data Mining with R
Learning with Case Studies, Second Edition
Luís Torgo
Social Networks with Rich Edge Semantics
Quan Zheng and David Skillicorn
Large-Scale Machine Learning in the Earth Sciences
Ashok N. Srivastava, Ramakrishna Nemani, and Karsten Steinhaeuser
Data Science and Analytics with Python
Jesus Rogel-Salazar
Feature Engineering for Machine Learning and Data Analytics
Guozhu Dong and Huan Liu

For more information about this series please visit:


https://www.crcpress.com/Chapman--HallCRC-Data-Mining-and-Knowledge-Discovery-Series/book-series/CHDAMINODIS
FEATURE ENGINEERING FOR
MACHINE LEARNING AND
DATA ANALYTICS

Edited by
Guozhu Dong and Huan Liu
MATLAB• is a trademark of The MathWorks, Inc. and is used with permission. The MathWorks
does not warrant the accuracy of the text or exercises in this book. This book’s use or discussion
of MATLAB• software or related products does not constitute endorsement or sponsorship by The
MathWorks of a particular pedagogical approach or particular use of the MATLAB• software.

CRC Press
Taylor & Francis Group
6000 Broken Sound Parkway NW, Suite 300
Boca Raton, FL 33487-2742

© 2018 by Taylor & Francis Group, LLC


CRC Press is an imprint of Taylor & Francis Group, an Informa business

No claim to original U.S. Government works

Printed on acid-free paper


Version Date: 20180301

International Standard Book Number-13: 978-1-1387-4438-7 (Hardback)

This book contains information obtained from authentic and highly regarded sources. Reasonable
efforts have been made to publish reliable data and information, but the author and publisher cannot
assume responsibility for the validity of all materials or the consequences of their use. The authors and
publishers have attempted to trace the copyright holders of all material reproduced in this publication
and apologize to copyright holders if permission to publish in this form has not been obtained. If any
copyright material has not been acknowledged please write and let us know so we may rectify in any
future reprint.

Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced,
transmitted, or utilized in any form by any electronic, mechanical, or other means, now known or
hereafter invented, including photocopying, microfilming, and recording, or in any information
storage or retrieval system, without written permission from the publishers.

For permission to photocopy or use material electronically from this work, please access
www.copyright.com (http://www.copyright.com/) or contact the Copyright Clearance Center, Inc.
(CCC), 222 Rosewood Drive, Danvers, MA 01923, 978-750-8400. CCC is a not-for-profit organization
that provides licenses and registration for a variety of users. For organizations that have been granted
a photocopy license by the CCC, a separate system of payment has been arranged.

Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and
are used only for identification and explanation without intent to infringe.
Visit the Taylor & Francis Web site at
http://www.taylorandfrancis.com
and the CRC Press Web site at
http://www.crcpress.com
To my family, especially baby Hazel [G. D.]

To my family [H. L.]

To all contributing authors [G. D. & H. L.]


Contents

Preface xv

Contributors xvii

1 Preliminaries and Overview 1


Guozhu Dong and Huan Liu
1.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Features . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.2 Feature Engineering . . . . . . . . . . . . . . . . . . . 3
1.1.3 Machine Learning and Data Analytic Tasks . . . . . . 3
1.2 Overview of the Chapters . . . . . . . . . . . . . . . . . . . . 4
1.3 Beyond this Book . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3.1 Feature Engineering for Specific Data Types . . . . . . 8
1.3.2 Feature Engineering on Non-Data-Specific Topics . . . 9

I Feature Engineering for Various Data Types 13


2 Feature Engineering for Text Data 15
Chase Geigle, Qiaozhu Mei, and ChengXiang Zhai
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2 Overview of Text Representation . . . . . . . . . . . . . . . . 17
2.3 Text as Strings . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.4 Sequence of Words Representation . . . . . . . . . . . . . . . 19
2.5 Bag of Words Representation . . . . . . . . . . . . . . . . . . 21
2.5.1 Term Weighting . . . . . . . . . . . . . . . . . . . . . 22
2.5.2 Beyond Single Words . . . . . . . . . . . . . . . . . . . 27
2.6 Structural Representation of Text . . . . . . . . . . . . . . . 28
2.6.1 Semantic Structure Features . . . . . . . . . . . . . . . 30
2.7 Latent Semantic Representation . . . . . . . . . . . . . . . . 31
2.7.1 Latent Semantic Analysis . . . . . . . . . . . . . . . . 31
2.7.2 Probabilistic Latent Semantic Analysis . . . . . . . . . 33
2.7.3 Latent Dirichlet Allocation . . . . . . . . . . . . . . . 35
2.8 Explicit Semantic Representation . . . . . . . . . . . . . . . 37
2.9 Embeddings for Text Representation . . . . . . . . . . . . . 37
2.9.1 Matrix Factorization for Word Embeddings . . . . . . 38
2.9.2 Neural Networks for Word Embeddings . . . . . . . . 40

vii
viii Contents

2.9.3 Document Representations from Word Embeddings . . 41


2.10 Context-Sensitive Text Representation . . . . . . . . . . . . 42
2.11 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

3 Feature Extraction and Learning for Visual Data 55


Parag S. Chandakkar, Ragav Venkatesan, and Baoxin Li
3.1 Classical Visual Feature Representations . . . . . . . . . . . 57
3.1.1 Color Features . . . . . . . . . . . . . . . . . . . . . . 57
3.1.2 Texture Features . . . . . . . . . . . . . . . . . . . . . 61
3.1.3 Shape Features . . . . . . . . . . . . . . . . . . . . . . 63
3.2 Latent Feature Extraction . . . . . . . . . . . . . . . . . . . 66
3.2.1 Principal Component Analysis . . . . . . . . . . . . . 67
3.2.2 Kernel Principal Component Analysis . . . . . . . . . 68
3.2.3 Multidimensional Scaling . . . . . . . . . . . . . . . . 69
3.2.4 Isomap . . . . . . . . . . . . . . . . . . . . . . . . . . 69
3.2.5 Laplacian Eigenmaps . . . . . . . . . . . . . . . . . . . 70
3.3 Deep Image Features . . . . . . . . . . . . . . . . . . . . . . 71
3.3.1 Convolutional Neural Networks . . . . . . . . . . . . . 72
3.3.1.1 The Dot-Product Layer . . . . . . . . . . . . 72
3.3.1.2 The Convolution Layer . . . . . . . . . . . . 73
3.3.2 CNN Architecture Design . . . . . . . . . . . . . . . . 75
3.3.3 Fine-Tuning Off-the-Shelf Neural Networks . . . . . . 76
3.3.4 Summary and Conclusions . . . . . . . . . . . . . . . . 79

4 Feature-Based Time-Series Analysis 87


Ben D. Fulcher
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
4.1.1 The Time Series Data Type . . . . . . . . . . . . . . . 87
4.1.2 Time-Series Characterization . . . . . . . . . . . . . . 89
4.1.3 Applications of Time-Series Analysis . . . . . . . . . . 90
4.2 Feature-Based Representations of Time Series . . . . . . . . 92
4.3 Global Features . . . . . . . . . . . . . . . . . . . . . . . . . 95
4.3.1 Examples of Global Features . . . . . . . . . . . . . . 95
4.3.2 Massive Feature Vectors and Highly Comparative Time-
Series Analysis . . . . . . . . . . . . . . . . . . . . . . 98
4.4 Subsequence Features . . . . . . . . . . . . . . . . . . . . . . 102
4.4.1 Interval Features . . . . . . . . . . . . . . . . . . . . . 102
4.4.2 Shapelets . . . . . . . . . . . . . . . . . . . . . . . . . 103
4.4.3 Pattern Dictionaries . . . . . . . . . . . . . . . . . . . 105
4.5 Combining Time-Series Representations . . . . . . . . . . . . 106
4.6 Feature-Based Forecasting . . . . . . . . . . . . . . . . . . . 108
4.7 Summary and Outlook . . . . . . . . . . . . . . . . . . . . . 109
Contents ix

5 Feature Engineering for Data Streams 117


Yao Ma, Jiliang Tang, and Charu Aggarwal
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
5.2 Streaming Settings . . . . . . . . . . . . . . . . . . . . . . . . 119
5.3 Linear Methods for Streaming Feature Construction . . . . . 121
5.3.1 Principal Component Analysis for Data Streams . . . 121
5.3.2 Linear Discriminant Analysis for Data Streams . . . . 123
5.4 Non-Linear Methods for Streaming Feature Construction . . 125
5.4.1 Locally Linear Embedding for Data Streams . . . . . 125
5.4.2 Kernel Learning for Data Streams . . . . . . . . . . . 126
5.4.3 Neural Networks for Data Streams . . . . . . . . . . . 128
5.4.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . 132
5.5 Feature Selection for Data Streams with Streaming Features 132
5.5.1 The Grafting Algorithm . . . . . . . . . . . . . . . . . 133
5.5.2 The Alpha-Investing Algorithm . . . . . . . . . . . . . 133
5.5.3 The Online Streaming Feature Selection Algorithm . . 134
5.5.4 Unsupervised Streaming Feature Selection in
Social Media . . . . . . . . . . . . . . . . . . . . . . . 135
5.6 Feature Selection for Data Streams with Streaming Instances 135
5.6.1 Online Feature Selection . . . . . . . . . . . . . . . . . 136
5.6.2 Unsupervised Feature Selection on Data Streams . . . 136
5.7 Discussions and Challenges . . . . . . . . . . . . . . . . . . . 136
5.7.1 Stability . . . . . . . . . . . . . . . . . . . . . . . . . . 137
5.7.2 Number of Features . . . . . . . . . . . . . . . . . . . 137
5.7.3 Heterogeneous Streaming Data . . . . . . . . . . . . . 137

6 Feature Generation and Feature Engineering for Sequences 145


Guozhu Dong, Lei Duan, Jyrki Nummenmaa, and Peng Zhang
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
6.2 Basics on Sequence Data and Sequence Patterns . . . . . . . 148
6.3 Approaches to Using Patterns in Sequence Features . . . . . 149
6.4 Traditional Pattern-Based Sequence Features . . . . . . . . . 150
6.5 Mined Sequence Patterns for Use in Sequence Features . . . 151
6.5.1 Frequent Sequence Patterns . . . . . . . . . . . . . . . 152
6.5.2 Closed Sequential Patterns . . . . . . . . . . . . . . . 154
6.5.3 Gap Constraints for Sequence Patterns . . . . . . . . . 155
6.5.4 Partial Order Patterns . . . . . . . . . . . . . . . . . . 156
6.5.5 Periodic Sequence Patterns . . . . . . . . . . . . . . . 158
6.5.6 Distinguishing Sequence Patterns . . . . . . . . . . . . 158
6.5.7 Pattern Matching for Sequences . . . . . . . . . . . . . 160
6.6 Factors for Selecting Sequence Patterns as Features . . . . . 161
6.7 Sequence Features Not Defined by Patterns . . . . . . . . . . 161
6.8 Sequence Databases . . . . . . . . . . . . . . . . . . . . . . . 162
6.9 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . 163
x Contents

7 Feature Generation for Graphs and Networks 167


Yuan Yao, Hanghang Tong, Feng Xu, and Jian Lu
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
7.2 Feature Types . . . . . . . . . . . . . . . . . . . . . . . . . . 168
7.3 Feature Generation . . . . . . . . . . . . . . . . . . . . . . . 169
7.3.1 Basic Models . . . . . . . . . . . . . . . . . . . . . . . 170
7.3.2 Extensions . . . . . . . . . . . . . . . . . . . . . . . . 175
7.3.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . 179
7.4 Feature Usages . . . . . . . . . . . . . . . . . . . . . . . . . . 181
7.4.1 Multi-Label Classification . . . . . . . . . . . . . . . . 181
7.4.2 Link Prediction . . . . . . . . . . . . . . . . . . . . . . 181
7.4.3 Anomaly Detection . . . . . . . . . . . . . . . . . . . . 182
7.4.4 Visualization . . . . . . . . . . . . . . . . . . . . . . . 182
7.5 Conclusions and Future Directions . . . . . . . . . . . . . . . 183
7.6 Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188

II General Feature Engineering Techniques 189


8 Feature Selection and Evaluation 191
Yun Li and Tao Li
8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 191
8.2 Feature Selection Frameworks . . . . . . . . . . . . . . . . . 192
8.2.1 Search-Based Feature Selection Framework . . . . . . 193
8.2.2 Correlation-Based Feature Selection Framework . . . . 194
8.3 Advanced Topics for Feature Selection . . . . . . . . . . . . . 196
8.3.1 Stable Feature Selection . . . . . . . . . . . . . . . . . 196
8.3.2 Sparsity-Based Feature Selection . . . . . . . . . . . . 199
8.3.3 Multi-Source Feature Selection . . . . . . . . . . . . . 200
8.3.4 Distributed Feature Selection . . . . . . . . . . . . . . 203
8.3.5 Multi-View Feature Selection . . . . . . . . . . . . . . 204
8.3.6 Multi-Label Feature Selection . . . . . . . . . . . . . . 205
8.3.7 Online Feature Selection . . . . . . . . . . . . . . . . . 206
8.3.8 Privacy-Preserving Feature Selection . . . . . . . . . . 208
8.3.9 Adversarial Feature Selection . . . . . . . . . . . . . . 210
8.4 Future Work and Conclusion . . . . . . . . . . . . . . . . . . 211

9 Automating Feature Engineering in Supervised Learning 221


Udayan Khurana
9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 222
9.1.1 Challenges in Performing Feature Engineering . . . . 224
9.2 Terminology and Problem Definition . . . . . . . . . . . . . . 225
9.3 A Few Simple Approaches . . . . . . . . . . . . . . . . . . . 226
9.4 Hierarchical Exploration of Feature Transformations . . . . . 227
9.4.1 Transformation Graph . . . . . . . . . . . . . . . . . . 228
9.4.2 Transformation Graph Exploration . . . . . . . . . . . 229
Contents xi

9.5 Learning Optimal Traversal Policy . . . . . . . . . . . . . . . 231


9.5.1 Feature Exploration through Reinforcement Learning 233
9.6 Finding Effective Features without Model Training . . . . . . 235
9.6.1 Learning to Predict Useful Transformations . . . . . . 237
9.7 Miscellaneous . . . . . . . . . . . . . . . . . . . . . . . . . . 239
9.7.1 Other Related Work . . . . . . . . . . . . . . . . . . . 239
9.7.2 Research Opportunities . . . . . . . . . . . . . . . . . 240
9.7.3 Resources . . . . . . . . . . . . . . . . . . . . . . . . . 240

10 Pattern-Based Feature Generation 245


Yunzhe Jia, James Bailey, Ramamohanarao Kotagiri, and Christopher
Leckie
10.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 246
10.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . 247
10.2.1 Data and Patterns . . . . . . . . . . . . . . . . . . . . 247
10.2.2 Patterns for Non-Transactional Data . . . . . . . . . . 248
10.3 Framework of Pattern-Based Feature Generation . . . . . . . 251
10.3.1 Pattern Mining . . . . . . . . . . . . . . . . . . . . . . 251
10.3.2 Pattern Selection . . . . . . . . . . . . . . . . . . . . . 252
10.3.3 Feature Generation . . . . . . . . . . . . . . . . . . . . 253
10.4 Pattern Mining Algorithms . . . . . . . . . . . . . . . . . . . 254
10.4.1 Frequent Pattern Mining . . . . . . . . . . . . . . . . 254
10.4.2 Contrast Pattern Mining . . . . . . . . . . . . . . . . . 256
10.5 Pattern Selection Approaches . . . . . . . . . . . . . . . . . . 258
10.5.1 Post-Processing Pruning . . . . . . . . . . . . . . . . . 258
10.5.2 In-processing Pruning . . . . . . . . . . . . . . . . . . 260
10.6 Pattern-Based Feature Generation . . . . . . . . . . . . . . . 262
10.6.1 Unsupervised Mapping Functions . . . . . . . . . . . . 262
10.6.2 Supervised Mapping Functions . . . . . . . . . . . . . 263
10.6.3 Feature Generation for Sequence Data and Graph Data 265
10.6.4 Comparison with Similar Techniques . . . . . . . . . . 265
10.7 Pattern-Based Feature Generation for Classification . . . . . 266
10.7.1 Problem Statement . . . . . . . . . . . . . . . . . . . . 266
10.7.2 Direct Classification in the Pattern Space . . . . . . . 267
10.7.3 Indirect Classification in the Pattern Space . . . . . . 268
10.7.4 Connection with Stacking Technique . . . . . . . . . . 269
10.8 Pattern-Based Feature Generation for Clustering . . . . . . . 269
10.8.1 Clustering in the Pattern Space . . . . . . . . . . . . . 269
10.8.2 Subspace Clustering . . . . . . . . . . . . . . . . . . . 270
10.9 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271

11 Deep Learning for Feature Representation 279


Suhang Wang and Huan Liu
11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 279
11.2 Restricted Boltzmann Machine . . . . . . . . . . . . . . . . . 280
xii Contents

11.2.1 Deep Belief Networks and Deep Boltzmann Machine . 281


11.2.2 RBM for Real-Valued Data . . . . . . . . . . . . . . . 283
11.3 AutoEncoder . . . . . . . . . . . . . . . . . . . . . . . . . . . 284
11.3.1 Sparse Autoencoder . . . . . . . . . . . . . . . . . . . 286
11.3.2 Denoising Autoencoder . . . . . . . . . . . . . . . . . 287
11.3.3 Stacked Autoencoder . . . . . . . . . . . . . . . . . . . 287
11.4 Convolutional Neural Networks . . . . . . . . . . . . . . . . . 288
11.4.1 Transfer Feature Learning of CNN . . . . . . . . . . . 290
11.5 Word Embedding and Recurrent Neural Networks . . . . . . 291
11.5.1 Word Embedding . . . . . . . . . . . . . . . . . . . . . 291
11.5.2 Recurrent Neural Networks . . . . . . . . . . . . . . . 294
11.5.3 Gated Recurrent Unit . . . . . . . . . . . . . . . . . . 295
11.5.4 Long Short-Term Memory . . . . . . . . . . . . . . . . 296
11.6 Generative Adversarial Networks and Variational Autoencoder 296
11.6.1 Generative Adversarial Networks . . . . . . . . . . . . 297
11.6.2 Variational Autoencoder . . . . . . . . . . . . . . . . . 298
11.7 Discussion and Further Readings . . . . . . . . . . . . . . . . 299

III Feature Engineering in Special Applications 309


12 Feature Engineering for Social Bot Detection 311
Onur Varol, Clayton A. Davis, Filippo Menczer, and Alessandro Flammini
12.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 312
12.2 Social Bot Detection . . . . . . . . . . . . . . . . . . . . . . . 312
12.2.1 Holistic Approach . . . . . . . . . . . . . . . . . . . . 313
12.2.2 Pairwise Account Comparison . . . . . . . . . . . . . . 313
12.2.3 Egocentric Analysis . . . . . . . . . . . . . . . . . . . 314
12.3 Online Bot Detection Framework . . . . . . . . . . . . . . . 314
12.3.1 Feature Extraction . . . . . . . . . . . . . . . . . . . . 315
12.3.1.1 User-Based Features . . . . . . . . . . . . . . 316
12.3.1.2 Friend Features . . . . . . . . . . . . . . . . 316
12.3.1.3 Network Features . . . . . . . . . . . . . . . 318
12.3.1.4 Content and Language Features . . . . . . . 318
12.3.1.5 Sentiment Features . . . . . . . . . . . . . . 319
12.3.1.6 Temporal Features . . . . . . . . . . . . . . . 320
12.3.2 Possible Directions for Feature Engineering . . . . . . 320
12.3.3 Feature Analysis . . . . . . . . . . . . . . . . . . . . . 320
12.3.4 Feature Selection . . . . . . . . . . . . . . . . . . . . . 323
12.3.4.1 Feature Classes . . . . . . . . . . . . . . . . . 323
12.3.4.2 Top Individual Features . . . . . . . . . . . . 324
12.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325
12.5 Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334
Contents xiii

13 Feature Generation and Engineering for Software Analytics 335


Xin Xia and David Lo
13.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 336
13.2 Features for Defect Prediction . . . . . . . . . . . . . . . . . 337
13.2.1 File-level Defect Prediction . . . . . . . . . . . . . . . 337
13.2.1.1 Code Features . . . . . . . . . . . . . . . . . 338
13.2.1.2 Process Features . . . . . . . . . . . . . . . . 340
13.2.2 Just-in-time Defect Prediction . . . . . . . . . . . . . 341
13.2.3 Prediction Models and Results . . . . . . . . . . . . . 343
13.3 Features for Crash Release Prediction for Apps . . . . . . . . 343
13.3.1 Complexity Dimension . . . . . . . . . . . . . . . . . . 344
13.3.2 Time Dimension . . . . . . . . . . . . . . . . . . . . . 345
13.3.3 Code Dimension . . . . . . . . . . . . . . . . . . . . . 346
13.3.4 Diffusion Dimension . . . . . . . . . . . . . . . . . . . 346
13.3.5 Commit Dimension . . . . . . . . . . . . . . . . . . . . 347
13.3.6 Text Dimension . . . . . . . . . . . . . . . . . . . . . . 347
13.3.7 Prediction Models and Results . . . . . . . . . . . . . 348
13.4 Features from Mining Monthly Reports to Predict Developer
Turnover . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 348
13.4.1 Working Hours . . . . . . . . . . . . . . . . . . . . . . 349
13.4.2 Task Report . . . . . . . . . . . . . . . . . . . . . . . . 349
13.4.3 Project . . . . . . . . . . . . . . . . . . . . . . . . . . 350
13.4.4 Prediction Models and Results . . . . . . . . . . . . . 351
13.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 351

14 Feature Engineering for Twitter-Based Applications 359


Sanjaya Wijeratne, Amit Sheth, Shreyansh Bhatt, Lakshika Balasuriya,
Hussein S. Al-Olimat, Manas Gaur, Amir Hossein Yazdavar,
Krishnaprasad Thirunarayan
14.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 359
14.2 Data Present in a Tweet . . . . . . . . . . . . . . . . . . . . 361
14.2.1 Tweet Text-Related Data . . . . . . . . . . . . . . . . 362
14.2.2 Twitter User-Related Data . . . . . . . . . . . . . . . 363
14.2.3 Other Metadata . . . . . . . . . . . . . . . . . . . . . 364
14.3 Common Types of Features Used in Twitter-Based Applications 364
14.3.1 Textual Features . . . . . . . . . . . . . . . . . . . . . 365
14.3.2 Image and Video Features . . . . . . . . . . . . . . . . 368
14.3.3 Twitter Metadata-Related Features . . . . . . . . . . . 369
14.3.4 Network Features . . . . . . . . . . . . . . . . . . . . . 370
14.4 Twitter Feature Engineering in Selected Twitter-Based Studies 370
14.4.1 Twitter User Profile Classification . . . . . . . . . . . 371
14.4.2 Assisting Coordination during Crisis Events . . . . . . 372
14.4.3 Location Extraction from Tweets . . . . . . . . . . . . 375
14.4.4 Studying the Mental Health Conditions of Depressed
Twitter Users . . . . . . . . . . . . . . . . . . . . . . . 377
xiv Contents

14.4.5 Sentiment and Emotion Analysis on Twitter . . . . . . 379


14.5 Twitris: A Real-Time Social Media Analysis Platform . . . . 381
14.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 383
14.7 Acknowledgment . . . . . . . . . . . . . . . . . . . . . . . . . 384

Index 395
Preface

Feature engineering plays a vital role in big data analytics. Machine learning
and data mining algorithms cannot work without data. Little can be achieved
if there are few features to represent the underlying data objects, and the
quality of results of those algorithms largely depends on the quality of the
available features. Data can exist in various forms such as image, text, graph,
sequence, and time series. A common way to represent data for data analytics
is to use feature vectors. Feature engineering meets the needs in the generation
and selection of useful features, as well as several other issues.
This book is devoted to feature engineering. It covers various aspects
of feature engineering, including feature generation, feature extraction, fea-
ture transformation, feature selection, and feature analysis and evaluation. It
presents concepts, methods, examples, as well as applications.
Feature engineering is often data type specific and application dependent.
This calls for multiple chapters on different data types that require specialized
feature engineering techniques to meet various data analytic needs. Hence, this
book contains chapters on feature engineering for major data types such as
texts, images, sequences, time series, graphs, streaming data, software engi-
neering data, Twitter data, and social media data. It also contains generic
feature generation approaches, as well as methods for generating tried-and-
tested, hand-crafted, domain-specific features.
This book contains many useful feature engineering concepts and tech-
niques, which are an important part of machine learning and data analytics.
They can help readers to meet their needs in multiple scenarios: (a) gener-
ate features to represent the data when there are no features, (b) generate
effective features when (one may be concerned that) existing features are
not good/competitive enough, (c) select features when there are too many
features, (d) generate and select effective features for specific types of appli-
cations, and (e) understand the challenges associated with, and the needed
approaches to handle, various data types. This list is certainly not exhaustive.
The first chapter is an introduction, which defines the concepts of fea-
tures and feature engineering, offers an overview of the book, and provides
pointers to topics not covered in this book. The next six chapters are devoted
to feature engineering, including feature generation, for specific data types,
namely texts, images, sequences, time series, graphs, and streaming data. The
subsequent four chapters cover generic approaches for feature engineering,
namely feature selection, feature transformation-based feature engineering,

xv
xvi Preface

deep learning–based feature engineering, and pattern-based feature genera-


tion and engineering. The last three chapters discuss feature engineering for
social bot detection, software management, and Twitter-based applications
respectively.
Getting familiar with the concepts and techniques covered in this book will
boost one’s understanding and expertise in machine learning and big data
analytics. This book can be used as a reference for data analysts, big data
scientists, data preprocessing workers, project managers, project developers,
prediction modelers, professors, researchers, graduate students, and upper-
level undergraduate students. This book can be used as the primary text for
courses on feature engineering, and as supplementary materials for courses on
machine learning, data mining, and big data analytics.
We wish to express our profound gratitude to the contributing authors of
the chapters of the book; without their expertise and dedicated efforts, this
book would not be possible. We are grateful to Randi Cohen and Veronica
Rodriguez who provided guidance and assistance on the publishing side of this
effort. We are indebted to Jiawei Han, Jian Pei, Nicholas Skapura, Xintao Wu,
and Junjie Zhang who kindly suggested domain experts as potential authors
and so on, and also to Vineeth Rakesh Mohan who provided useful feedback
on parts of this book.

Guozhu Dong, Dayton, Ohio

Huan Liu, Phoenix, Arizona


10 Feature Engineering for Machine Learning and Data Analytics

[4] Usama Fayyad and Keki Irani. Multi-interval discretization of


continuous-valued attributes for classification learning. In Proceedings
of the 13th International Joint Conference on Artificial Intelligence (IJ-
CAI), pages 1022–1029, 1993.
[5] Ekta Gandotra, Divya Bansal, and Sanjeev Sofat. Malware analysis and
classification: A survey. Journal of Information Security, 5(02):56, 2014.
[6] Jiawei Han, Jian Pei, and Micheline Kamber. Data Mining: Concepts and
Techniques. Elsevier, 2011.
[7] David J Hand, Heikki Mannila, and Padhraic Smyth. Principles of Data
Mining. MIT Press, 2001.
[8] Jia-Lien Hsu, Arbee LP Chen, and C-C Liu. Efficient repeating pattern
finding in music databases. In Proceedings of the Seventh International
Conference on Information and Knowledge Management, pages 281–288.
ACM, 1998.
[9] H Günes Kayacik, A Nur Zincir-Heywood, and Malcolm I Heywood. Se-
lecting features for intrusion detection: A feature relevance analysis on
KDD 99 intrusion detection datasets. In Proceedings of the Third Annual
Conference on Privacy, Security and Trust, 2005.
[10] KDD Cup. 1999. http://kdd.ics.uci.edu/databases/kddcup99
[11] Yelin Kim, Honglak Lee, and Emily Mower Provost. Deep learning
for robust feature generation in audiovisual emotion recognition. In
IEEE International Conference on Acoustics, Speech and Signal Process-
ing (ICASSP), pages 3687–3691. IEEE, 2013.
[12] Hua-Fu Li. Memsa: Mining emerging melody structures from music query
data. Multimedia Systems, 17(3):237–245, 2011.
[13] Tao Li, Mitsunori Ogihara, and George Tzanetakis. Music Data Mining.
CRC Press, 2011.
[14] Wei-Hao Lin and Alexander Hauptmann. News video classification using
SVM-based multimodal classifiers and combination strategies. In Proceed-
ings of the Tenth ACM International Conference on Multimedia, pages
323–326. ACM, 2002.
[15] Huan Liu, Farhad Hussain, Chew Lim Tan, and Manoranjan Dash. Dis-
cretization: An enabling technique. Data Mining and Knowledge Discov-
ery, 6(4):393–423, 2002.
[16] Huan Liu and Hiroshi Motoda. Feature Extraction, Construction and
Selection: A Data Mining Perspective, volume 453. Springer Science &
Business Media, 1998.
Another random document with
no related content on Scribd:
PRINTED BY
MORRISON AND GIBB LIMITED, EDINBURGH.

FOOTNOTES:

[1]More accurately the coffee-house of the dome.—Translator’s


Note.
[2]The Arabs in this country keep no account of their age. The
most they can remember is that they were born the year this or
that happened.
[3]Chest.
[4]Correspondence in the Paris newspaper, the Journal des
Debats of 5th September 1893.
[5]M. Rouvier is newly appointed to Stockholm as French
Representative.
[6]Anthropologie Criminelle des Tunisiens Musulmans; Les
formes de la famille chez les premiers habitants de l’Afrique du
Nord; Exploration anthropologique de la Khroumirie.
[7]Duveyrier, Les Touareg du Nord; Captain Bissuel, Les
Touareg de l’Ouest; Largeau, Le Sahara Algerien.
[8]From Dr. E. F. Bojesen’s Handbook on Greek Antiquities.
*** END OF THE PROJECT GUTENBERG EBOOK THE CAVE
DWELLERS OF SOUTHERN TUNISIA ***

Updated editions will replace the previous one—the old editions


will be renamed.

Creating the works from print editions not protected by U.S.


copyright law means that no one owns a United States copyright
in these works, so the Foundation (and you!) can copy and
distribute it in the United States without permission and without
paying copyright royalties. Special rules, set forth in the General
Terms of Use part of this license, apply to copying and
distributing Project Gutenberg™ electronic works to protect the
PROJECT GUTENBERG™ concept and trademark. Project
Gutenberg is a registered trademark, and may not be used if
you charge for an eBook, except by following the terms of the
trademark license, including paying royalties for use of the
Project Gutenberg trademark. If you do not charge anything for
copies of this eBook, complying with the trademark license is
very easy. You may use this eBook for nearly any purpose such
as creation of derivative works, reports, performances and
research. Project Gutenberg eBooks may be modified and
printed and given away—you may do practically ANYTHING in
the United States with eBooks not protected by U.S. copyright
law. Redistribution is subject to the trademark license, especially
commercial redistribution.

START: FULL LICENSE


THE FULL PROJECT GUTENBERG LICENSE
PLEASE READ THIS BEFORE YOU DISTRIBUTE OR USE THIS WORK

To protect the Project Gutenberg™ mission of promoting the


free distribution of electronic works, by using or distributing this
work (or any other work associated in any way with the phrase
“Project Gutenberg”), you agree to comply with all the terms of
the Full Project Gutenberg™ License available with this file or
online at www.gutenberg.org/license.

Section 1. General Terms of Use and


Redistributing Project Gutenberg™
electronic works
1.A. By reading or using any part of this Project Gutenberg™
electronic work, you indicate that you have read, understand,
agree to and accept all the terms of this license and intellectual
property (trademark/copyright) agreement. If you do not agree to
abide by all the terms of this agreement, you must cease using
and return or destroy all copies of Project Gutenberg™
electronic works in your possession. If you paid a fee for
obtaining a copy of or access to a Project Gutenberg™
electronic work and you do not agree to be bound by the terms
of this agreement, you may obtain a refund from the person or
entity to whom you paid the fee as set forth in paragraph 1.E.8.

1.B. “Project Gutenberg” is a registered trademark. It may only


be used on or associated in any way with an electronic work by
people who agree to be bound by the terms of this agreement.
There are a few things that you can do with most Project
Gutenberg™ electronic works even without complying with the
full terms of this agreement. See paragraph 1.C below. There
are a lot of things you can do with Project Gutenberg™
electronic works if you follow the terms of this agreement and
help preserve free future access to Project Gutenberg™
electronic works. See paragraph 1.E below.
1.C. The Project Gutenberg Literary Archive Foundation (“the
Foundation” or PGLAF), owns a compilation copyright in the
collection of Project Gutenberg™ electronic works. Nearly all the
individual works in the collection are in the public domain in the
United States. If an individual work is unprotected by copyright
law in the United States and you are located in the United
States, we do not claim a right to prevent you from copying,
distributing, performing, displaying or creating derivative works
based on the work as long as all references to Project
Gutenberg are removed. Of course, we hope that you will
support the Project Gutenberg™ mission of promoting free
access to electronic works by freely sharing Project
Gutenberg™ works in compliance with the terms of this
agreement for keeping the Project Gutenberg™ name
associated with the work. You can easily comply with the terms
of this agreement by keeping this work in the same format with
its attached full Project Gutenberg™ License when you share it
without charge with others.

1.D. The copyright laws of the place where you are located also
govern what you can do with this work. Copyright laws in most
countries are in a constant state of change. If you are outside
the United States, check the laws of your country in addition to
the terms of this agreement before downloading, copying,
displaying, performing, distributing or creating derivative works
based on this work or any other Project Gutenberg™ work. The
Foundation makes no representations concerning the copyright
status of any work in any country other than the United States.

1.E. Unless you have removed all references to Project


Gutenberg:

1.E.1. The following sentence, with active links to, or other


immediate access to, the full Project Gutenberg™ License must
appear prominently whenever any copy of a Project
Gutenberg™ work (any work on which the phrase “Project
Gutenberg” appears, or with which the phrase “Project
Gutenberg” is associated) is accessed, displayed, performed,
viewed, copied or distributed:

This eBook is for the use of anyone anywhere in the United


States and most other parts of the world at no cost and with
almost no restrictions whatsoever. You may copy it, give it
away or re-use it under the terms of the Project Gutenberg
License included with this eBook or online at
www.gutenberg.org. If you are not located in the United
States, you will have to check the laws of the country where
you are located before using this eBook.

1.E.2. If an individual Project Gutenberg™ electronic work is


derived from texts not protected by U.S. copyright law (does not
contain a notice indicating that it is posted with permission of the
copyright holder), the work can be copied and distributed to
anyone in the United States without paying any fees or charges.
If you are redistributing or providing access to a work with the
phrase “Project Gutenberg” associated with or appearing on the
work, you must comply either with the requirements of
paragraphs 1.E.1 through 1.E.7 or obtain permission for the use
of the work and the Project Gutenberg™ trademark as set forth
in paragraphs 1.E.8 or 1.E.9.

1.E.3. If an individual Project Gutenberg™ electronic work is


posted with the permission of the copyright holder, your use and
distribution must comply with both paragraphs 1.E.1 through
1.E.7 and any additional terms imposed by the copyright holder.
Additional terms will be linked to the Project Gutenberg™
License for all works posted with the permission of the copyright
holder found at the beginning of this work.

1.E.4. Do not unlink or detach or remove the full Project


Gutenberg™ License terms from this work, or any files
containing a part of this work or any other work associated with
Project Gutenberg™.
1.E.5. Do not copy, display, perform, distribute or redistribute
this electronic work, or any part of this electronic work, without
prominently displaying the sentence set forth in paragraph 1.E.1
with active links or immediate access to the full terms of the
Project Gutenberg™ License.

1.E.6. You may convert to and distribute this work in any binary,
compressed, marked up, nonproprietary or proprietary form,
including any word processing or hypertext form. However, if
you provide access to or distribute copies of a Project
Gutenberg™ work in a format other than “Plain Vanilla ASCII” or
other format used in the official version posted on the official
Project Gutenberg™ website (www.gutenberg.org), you must, at
no additional cost, fee or expense to the user, provide a copy, a
means of exporting a copy, or a means of obtaining a copy upon
request, of the work in its original “Plain Vanilla ASCII” or other
form. Any alternate format must include the full Project
Gutenberg™ License as specified in paragraph 1.E.1.

1.E.7. Do not charge a fee for access to, viewing, displaying,


performing, copying or distributing any Project Gutenberg™
works unless you comply with paragraph 1.E.8 or 1.E.9.

1.E.8. You may charge a reasonable fee for copies of or


providing access to or distributing Project Gutenberg™
electronic works provided that:

• You pay a royalty fee of 20% of the gross profits you derive from
the use of Project Gutenberg™ works calculated using the
method you already use to calculate your applicable taxes. The
fee is owed to the owner of the Project Gutenberg™ trademark,
but he has agreed to donate royalties under this paragraph to
the Project Gutenberg Literary Archive Foundation. Royalty
payments must be paid within 60 days following each date on
which you prepare (or are legally required to prepare) your
periodic tax returns. Royalty payments should be clearly marked
as such and sent to the Project Gutenberg Literary Archive
Foundation at the address specified in Section 4, “Information
about donations to the Project Gutenberg Literary Archive
Foundation.”

• You provide a full refund of any money paid by a user who


notifies you in writing (or by e-mail) within 30 days of receipt that
s/he does not agree to the terms of the full Project Gutenberg™
License. You must require such a user to return or destroy all
copies of the works possessed in a physical medium and
discontinue all use of and all access to other copies of Project
Gutenberg™ works.

• You provide, in accordance with paragraph 1.F.3, a full refund of


any money paid for a work or a replacement copy, if a defect in
the electronic work is discovered and reported to you within 90
days of receipt of the work.

• You comply with all other terms of this agreement for free
distribution of Project Gutenberg™ works.

1.E.9. If you wish to charge a fee or distribute a Project


Gutenberg™ electronic work or group of works on different
terms than are set forth in this agreement, you must obtain
permission in writing from the Project Gutenberg Literary
Archive Foundation, the manager of the Project Gutenberg™
trademark. Contact the Foundation as set forth in Section 3
below.

1.F.

1.F.1. Project Gutenberg volunteers and employees expend


considerable effort to identify, do copyright research on,
transcribe and proofread works not protected by U.S. copyright
law in creating the Project Gutenberg™ collection. Despite
these efforts, Project Gutenberg™ electronic works, and the
medium on which they may be stored, may contain “Defects,”
such as, but not limited to, incomplete, inaccurate or corrupt
data, transcription errors, a copyright or other intellectual
property infringement, a defective or damaged disk or other
medium, a computer virus, or computer codes that damage or
cannot be read by your equipment.

1.F.2. LIMITED WARRANTY, DISCLAIMER OF DAMAGES -


Except for the “Right of Replacement or Refund” described in
paragraph 1.F.3, the Project Gutenberg Literary Archive
Foundation, the owner of the Project Gutenberg™ trademark,
and any other party distributing a Project Gutenberg™ electronic
work under this agreement, disclaim all liability to you for
damages, costs and expenses, including legal fees. YOU
AGREE THAT YOU HAVE NO REMEDIES FOR NEGLIGENCE,
STRICT LIABILITY, BREACH OF WARRANTY OR BREACH
OF CONTRACT EXCEPT THOSE PROVIDED IN PARAGRAPH
1.F.3. YOU AGREE THAT THE FOUNDATION, THE
TRADEMARK OWNER, AND ANY DISTRIBUTOR UNDER
THIS AGREEMENT WILL NOT BE LIABLE TO YOU FOR
ACTUAL, DIRECT, INDIRECT, CONSEQUENTIAL, PUNITIVE
OR INCIDENTAL DAMAGES EVEN IF YOU GIVE NOTICE OF
THE POSSIBILITY OF SUCH DAMAGE.

1.F.3. LIMITED RIGHT OF REPLACEMENT OR REFUND - If


you discover a defect in this electronic work within 90 days of
receiving it, you can receive a refund of the money (if any) you
paid for it by sending a written explanation to the person you
received the work from. If you received the work on a physical
medium, you must return the medium with your written
explanation. The person or entity that provided you with the
defective work may elect to provide a replacement copy in lieu
of a refund. If you received the work electronically, the person or
entity providing it to you may choose to give you a second
opportunity to receive the work electronically in lieu of a refund.
If the second copy is also defective, you may demand a refund
in writing without further opportunities to fix the problem.

1.F.4. Except for the limited right of replacement or refund set


forth in paragraph 1.F.3, this work is provided to you ‘AS-IS’,
WITH NO OTHER WARRANTIES OF ANY KIND, EXPRESS
OR IMPLIED, INCLUDING BUT NOT LIMITED TO
WARRANTIES OF MERCHANTABILITY OR FITNESS FOR
ANY PURPOSE.

1.F.5. Some states do not allow disclaimers of certain implied


warranties or the exclusion or limitation of certain types of
damages. If any disclaimer or limitation set forth in this
agreement violates the law of the state applicable to this
agreement, the agreement shall be interpreted to make the
maximum disclaimer or limitation permitted by the applicable
state law. The invalidity or unenforceability of any provision of
this agreement shall not void the remaining provisions.

1.F.6. INDEMNITY - You agree to indemnify and hold the


Foundation, the trademark owner, any agent or employee of the
Foundation, anyone providing copies of Project Gutenberg™
electronic works in accordance with this agreement, and any
volunteers associated with the production, promotion and
distribution of Project Gutenberg™ electronic works, harmless
from all liability, costs and expenses, including legal fees, that
arise directly or indirectly from any of the following which you do
or cause to occur: (a) distribution of this or any Project
Gutenberg™ work, (b) alteration, modification, or additions or
deletions to any Project Gutenberg™ work, and (c) any Defect
you cause.

Section 2. Information about the Mission of


Project Gutenberg™
Project Gutenberg™ is synonymous with the free distribution of
electronic works in formats readable by the widest variety of
computers including obsolete, old, middle-aged and new
computers. It exists because of the efforts of hundreds of
volunteers and donations from people in all walks of life.

Volunteers and financial support to provide volunteers with the


assistance they need are critical to reaching Project
Gutenberg™’s goals and ensuring that the Project Gutenberg™
collection will remain freely available for generations to come. In
2001, the Project Gutenberg Literary Archive Foundation was
created to provide a secure and permanent future for Project
Gutenberg™ and future generations. To learn more about the
Project Gutenberg Literary Archive Foundation and how your
efforts and donations can help, see Sections 3 and 4 and the
Foundation information page at www.gutenberg.org.

Section 3. Information about the Project


Gutenberg Literary Archive Foundation
The Project Gutenberg Literary Archive Foundation is a non-
profit 501(c)(3) educational corporation organized under the
laws of the state of Mississippi and granted tax exempt status by
the Internal Revenue Service. The Foundation’s EIN or federal
tax identification number is 64-6221541. Contributions to the
Project Gutenberg Literary Archive Foundation are tax
deductible to the full extent permitted by U.S. federal laws and
your state’s laws.

The Foundation’s business office is located at 809 North 1500


West, Salt Lake City, UT 84116, (801) 596-1887. Email contact
links and up to date contact information can be found at the
Foundation’s website and official page at
www.gutenberg.org/contact

Section 4. Information about Donations to


the Project Gutenberg Literary Archive
Foundation
Project Gutenberg™ depends upon and cannot survive without
widespread public support and donations to carry out its mission
of increasing the number of public domain and licensed works
that can be freely distributed in machine-readable form
accessible by the widest array of equipment including outdated
equipment. Many small donations ($1 to $5,000) are particularly
important to maintaining tax exempt status with the IRS.

The Foundation is committed to complying with the laws


regulating charities and charitable donations in all 50 states of
the United States. Compliance requirements are not uniform
and it takes a considerable effort, much paperwork and many
fees to meet and keep up with these requirements. We do not
solicit donations in locations where we have not received written
confirmation of compliance. To SEND DONATIONS or
determine the status of compliance for any particular state visit
www.gutenberg.org/donate.

While we cannot and do not solicit contributions from states


where we have not met the solicitation requirements, we know
of no prohibition against accepting unsolicited donations from
donors in such states who approach us with offers to donate.

International donations are gratefully accepted, but we cannot


make any statements concerning tax treatment of donations
received from outside the United States. U.S. laws alone swamp
our small staff.

Please check the Project Gutenberg web pages for current


donation methods and addresses. Donations are accepted in a
number of other ways including checks, online payments and
credit card donations. To donate, please visit:
www.gutenberg.org/donate.

Section 5. General Information About Project


Gutenberg™ electronic works
Professor Michael S. Hart was the originator of the Project
Gutenberg™ concept of a library of electronic works that could
be freely shared with anyone. For forty years, he produced and
distributed Project Gutenberg™ eBooks with only a loose
network of volunteer support.

Project Gutenberg™ eBooks are often created from several


printed editions, all of which are confirmed as not protected by
copyright in the U.S. unless a copyright notice is included. Thus,
we do not necessarily keep eBooks in compliance with any
particular paper edition.

Most people start at our website which has the main PG search
facility: www.gutenberg.org.

This website includes information about Project Gutenberg™,


including how to make donations to the Project Gutenberg
Literary Archive Foundation, how to help produce our new
eBooks, and how to subscribe to our email newsletter to hear
about new eBooks.
back
back
back

You might also like