Download as pdf or txt
Download as pdf or txt
You are on page 1of 69

R Data Analysis Without Programming :

Explanation and Interpretation 2nd


Edition David W. Gerbing
Visit to download the full and correct content document:
https://ebookmeta.com/product/r-data-analysis-without-programming-explanation-and
-interpretation-2nd-edition-david-w-gerbing/
More products digital (pdf, epub, mobi) instant
download maybe you interests ...

Functional Programming in R 4: Advanced Statistical


Programming for Data Science, Analysis, and Finance -
Second Edition Thomas Mailund

https://ebookmeta.com/product/functional-programming-
in-r-4-advanced-statistical-programming-for-data-science-
analysis-and-finance-second-edition-thomas-mailund/

R for Data Analysis in easy steps 2nd Edition Mike


Mcgrath

https://ebookmeta.com/product/r-for-data-analysis-in-easy-
steps-2nd-edition-mike-mcgrath/

Beginning Data Science in R 4: Data Analysis,


Visualization, and Modelling for the Data Scientist 2nd
Edition Thomas Mailund

https://ebookmeta.com/product/beginning-data-science-in-r-4-data-
analysis-visualization-and-modelling-for-the-data-scientist-2nd-
edition-thomas-mailund/

Applied Mixed Model Analysis A Practical Guide 2nd


Edition Jos W R Twisk

https://ebookmeta.com/product/applied-mixed-model-analysis-a-
practical-guide-2nd-edition-jos-w-r-twisk/
A Primer in Biological Data Analysis and Visualization
Using R 2nd Edition Gregg Hartvigsen

https://ebookmeta.com/product/a-primer-in-biological-data-
analysis-and-visualization-using-r-2nd-edition-gregg-hartvigsen/

R Programming for Data Science Roger D. Peng

https://ebookmeta.com/product/r-programming-for-data-science-
roger-d-peng/

Advanced R 4 Data Programming and the Cloud: Using


PostgreSQL, AWS, and Shiny 2nd Edition Matt Wiley

https://ebookmeta.com/product/advanced-r-4-data-programming-and-
the-cloud-using-postgresql-aws-and-shiny-2nd-edition-matt-wiley/

Cholesterol Conspiracy Heart Health Without Drugs 1st


Edition David R Hastings Lloyd

https://ebookmeta.com/product/cholesterol-conspiracy-heart-
health-without-drugs-1st-edition-david-r-hastings-lloyd/

Statistics for Ecologists Using R and Excel Data


Collection Exploration Analysis and Presentation 2nd
Edn 2nd Edition Mark Gardener

https://ebookmeta.com/product/statistics-for-ecologists-using-r-
and-excel-data-collection-exploration-analysis-and-
presentation-2nd-edn-2nd-edition-mark-gardener/
R Data Analysis without
Programming
The new edition of this innovative book, R Data Analysis without Programming, prepares readers to quickly analyze data
and interpret statistical results using R. Professor Gerbing has developed lessR, a ground-breaking method for alleviating
the challenges of R programming. The lessR package extends R, removing the need for programming. This edition expands
upon the first edition’s introduction to R through lessR, which enables the readers to learn how to organize data for analy-
sis, read the data into R, and generate output without performing numerous functions and programming exercises first.
With lessR, readers can select the necessary procedure and change the relevant variables with simple function calls. The
text reviews and explains basic statistical procedures with the lessR enhancements added to the standard R environment.
Using lessR, data analysis with R becomes immediately accessible to the novice user and easier to use for the experienced
user.
Highlights along with content new to this edition include:

• Explanation and Interpretation of all data analysis techniques; much more than a computer manual, this book
shows the reader how to explain and interpret the results.
• Introduces the concepts and commands reviewed in each chapter.
• Clear, relaxed writing style more effectively communicates the underlying concepts than more stilted academic
writing.
• Extensive margin notes highlight, define, illustrate, and cross-reference the key concepts. When readers encounter
a term previously discussed, the margin notes identify the page number for the initial introduction.
• Scenarios that highlight the use of a specific analysis followed by the corresponding R/lessR input, output, and an
interpretation of the results.
• Numerous examples of output from psychology, business, education, and other social sciences, that demonstrate
the analysis and how to interpret results.
• Two data sets are analyzed multiple times in the book, providing continuity throughout.
• Comprehensive: A wide range of data analysis techniques are presented throughout the book.
• Integration with machine learning as regression analysis is presented from both the traditional perspective and
from the modern machine learning perspective.
• End of chapter worked problems help readers test their understanding of the concepts.
• A website at www.lessRstats.com that features the data sets referenced in both standard text and SPSS formats so
readers can practice using R/lessR by working through the text examples and worked problems, R/lessR videos to
help readers better understand the program, and more.

This book is ideal for graduate and undergraduate courses in statistics beyond the introductory course, research methods,
and/or any data analysis course, taught in departments of psychology, business, education, and other social and health sci-
ences; this book is also appreciated by researchers doing data analysis. Prerequisites include basic statistical knowledge,
though the concepts are explained from the beginning in the book. Previous knowledge of R is not assumed.
David Gerbing is Professor in the Applied Data Science program in the School of Business, Portland State University. He has
published extensively in a wide range of social, behavioral, and methodological journals, and contributed basic methodologi-
cal concepts to develop the two-step method for the analysis called structural equation modeling.
Taylor & Francis
Taylor & Francis Group
http://taylorandfrancis.com
R Data Analysis without
Programming
Explanation and Interpretation
Second Edition

David W. Gerbing
Portland State University
Designed cover image: David W. Gerbing

Second edition published 2023


by Routledge
605 Third Avenue, New York, NY 10158

and by Routledge
4 Park Square, Milton Park, Abingdon, Oxon, OX14 4RN

Routledge is an imprint of the Taylor & Francis Group, an informa business

© 2023 David W. Gerbing

The right of David W. Gerbing to be identified as authors of this work has been asserted in accordance with sections 77 and 78 of the Copyright,
Designs and Patents Act 1988.

All rights reserved. No part of this book may be reprinted or reproduced or utilised in any form or by any electronic, mechanical, or other means, now
known or hereafter invented, including photocopying and recording, or in any information storage or retrieval system, without permission in writing
from the publishers.

Trademark notice: Product or corporate names may be trademarks or registered trademarks, and are used only for identification and explanation without
intent to infringe.

First edition published by Routledge 2014

ISBN: 978-1-032-24402-0 (hbk)


ISBN: 978-1-032-24403-7 (pbk)
ISBN: 978-1-003-27841-2 (ebk)

DOI: 10.4324/9781003278412

Publisher’s note: This book has been prepared from camera-ready copy provided by the author.
To the wonderful woman who remains my wife
even through the second edition

Rachel Maculan Sodré

Eu te amo
Contents

List of Figures xiii

List of Tables xviii

Preface xix

1 R for Data Analysis 1


1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.2 R with lessR . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Prepare R for Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2.1 Download R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2.2 Download RStudio . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2.3 R in the Cloud . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2.4 Start R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.2.5 Extend R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.2.6 Access lessR . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.2.7 Get Help . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.2.8 R Functions for Analysis . . . . . . . . . . . . . . . . . . . . . . 10
1.2.9 Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.3 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.3.1 Data Example I: Employee Data . . . . . . . . . . . . . . . . . 13
1.3.2 Data Example II: Machiavellianism . . . . . . . . . . . . . . . . 16
1.3.3 Create a Data File . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.4 Analysis Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2 Read and Write Data 21


2.1 Quick Start . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.2 Types of Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.2.1 Variables as a Concept . . . . . . . . . . . . . . . . . . . . . . . 22
2.2.2 Variables in the Computer . . . . . . . . . . . . . . . . . . . . . 24
2.3 Read Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.3.1 Access the Data . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.3.2 Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.3.3 Missing Values . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.3.4 Row Names . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.4 More Data Formats . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.4.1 lessR Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.4.2 SPSS, SAS, and Stata Data . . . . . . . . . . . . . . . . . . . . 30

vi
CONTENTS vii

2.4.3 Fixed-Width Data . . . . . . . . . . . . . . . . . . . . . . . . . 31


2.4.4 More Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.5 Variable Labels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.5.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.5.2 Variable Labels File . . . . . . . . . . . . . . . . . . . . . . . . 34
2.5.3 Variable Labels with R Functions . . . . . . . . . . . . . . . . . 35
2.6 Write Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.6.1 Choose an Output Format . . . . . . . . . . . . . . . . . . . . . 36
2.6.2 Write a Data Frame to a File . . . . . . . . . . . . . . . . . . . 37
2.7 Analysis Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

3 Manage Data 41
3.1 Quick Start . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.2 Categorical Variables as Factors . . . . . . . . . . . . . . . . . . . . . . 43
3.2.1 Order Levels . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.2.2 Value Labels . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.2.3 Add Levels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.3 Transform Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.3.1 Arithmetic Operators . . . . . . . . . . . . . . . . . . . . . . . 46
3.3.2 Mathematical Functions . . . . . . . . . . . . . . . . . . . . . . 47
3.4 Recode Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.4.1 Reverse Score Items . . . . . . . . . . . . . . . . . . . . . . . . 49
3.4.2 Missing Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.5 Sort Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.5.1 Sort by Variables . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.5.2 Sort by Other Criteria . . . . . . . . . . . . . . . . . . . . . . . 52
3.6 Subset Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.6.1 Select Rows and/or Columns . . . . . . . . . . . . . . . . . . . 53
3.6.2 Randomly Select Rows . . . . . . . . . . . . . . . . . . . . . . . 55
3.7 Revise Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.7.1 Change an Individual Data Value . . . . . . . . . . . . . . . . . 56
3.7.2 Change a Variable Name . . . . . . . . . . . . . . . . . . . . . 57
3.8 Merge Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.8.1 Inner Join . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.8.2 Outer and Full Joins . . . . . . . . . . . . . . . . . . . . . . . . 58
3.8.3 Add Rows to a Data Frame . . . . . . . . . . . . . . . . . . . . 60
3.9 Analysis Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

4 Categorical Variables 62
4.1 Quick Start . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.2 One Categorical Variable . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.2.1 Bar Chart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.2.2 Pie Chart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.2.3 Customization . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.2.4 Bar Chart from the Summary Table . . . . . . . . . . . . . . . 67
4.2.5 Bar Chart of Deviation Scores . . . . . . . . . . . . . . . . . . 68
4.2.6 Stack the Bars across Multiple Variables . . . . . . . . . . . . . 69
4.2.7 Generalize Beyond the One Sample . . . . . . . . . . . . . . . . 70
viii CONTENTS

4.3 Two Categorical Variables . . . . . . . . . . . . . . . . . . . . . . . . . 73


4.3.1 Bar Chart from Joint Frequencies . . . . . . . . . . . . . . . . 73
4.3.2 100% Stacked Bar Chart . . . . . . . . . . . . . . . . . . . . . . 75
4.3.3 Description with Summary Tables . . . . . . . . . . . . . . . . 77
4.3.4 Inferential Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 78
4.4 Analysis Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

5 Continuous Variables 82
5.1 Quick Start . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.2 Histogram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
5.2.1 Bins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
5.2.2 Default Histogram . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.2.3 Customize the Bins . . . . . . . . . . . . . . . . . . . . . . . . . 86
5.2.4 Smooth the Bins . . . . . . . . . . . . . . . . . . . . . . . . . . 87
5.2.5 Bandwidth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
5.2.6 Cumulative Histogram . . . . . . . . . . . . . . . . . . . . . . . 89
5.2.7 Histograms for All . . . . . . . . . . . . . . . . . . . . . . . . . 89
5.3 Histogram Alternatives . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
5.3.1 Box Plot and Outliers . . . . . . . . . . . . . . . . . . . . . . . 90
5.3.2 Violin-Box-Scatter Plot . . . . . . . . . . . . . . . . . . . . . . 91
5.4 Visualize Data over Time . . . . . . . . . . . . . . . . . . . . . . . . . 92
5.4.1 Run Chart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
5.4.2 Time series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
5.5 Analysis Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

6 Statistics 99
6.1 Quick Start . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
6.2 Types of Summary Statistics . . . . . . . . . . . . . . . . . . . . . . . 100
6.2.1 Parametric Statistics . . . . . . . . . . . . . . . . . . . . . . . . 100
6.2.2 Order Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
6.2.3 Obtain the Statistics . . . . . . . . . . . . . . . . . . . . . . . . 107
6.2.4 Data Aggregation . . . . . . . . . . . . . . . . . . . . . . . . . 109
6.3 Evaluate a Single Mean . . . . . . . . . . . . . . . . . . . . . . . . . . 110
6.3.1 Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
6.3.2 Basis of Inference . . . . . . . . . . . . . . . . . . . . . . . . . . 112
6.3.3 Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
6.3.4 One-Tailed vs. Two-Tailed Tests . . . . . . . . . . . . . . . . . 120
6.4 Evaluate a Proportion . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
6.5 Analysis Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125

7 Compare Two Samples 127


7.1 Quick Start . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
7.2 Independent-Samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
7.2.1 Research Design for Independent-Samples . . . . . . . . . . . . 128
7.2.2 Example 1: Two Existing Groups . . . . . . . . . . . . . . . . . 129
7.2.3 Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
7.2.4 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
7.2.5 Nonparametric Alternative . . . . . . . . . . . . . . . . . . . . 135
CONTENTS ix

7.2.6 Example 2: Two Experimental Groups . . . . . . . . . . . . . . 137


7.3 Dependent Samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
7.3.1 Dependent-Samples t-test . . . . . . . . . . . . . . . . . . . . . 141
7.3.2 Nonparametric Comparison . . . . . . . . . . . . . . . . . . . . 145
7.4 Multiple Proportions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
7.5 Analysis Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149

8 Compare Multiple Samples 151


8.1 Quick Start . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
8.2 Experimental Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
8.3 One-Way Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
8.3.1 Variability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
8.3.2 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
8.3.3 Data and Input . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
8.3.4 Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
8.3.5 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
8.3.6 Search for Outliers . . . . . . . . . . . . . . . . . . . . . . . . . 159
8.3.7 Nonparametric Alternative . . . . . . . . . . . . . . . . . . . . 160
8.4 Randomized Block Design . . . . . . . . . . . . . . . . . . . . . . . . . 161
8.4.1 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
8.4.2 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
8.4.3 Input . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
8.4.4 Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
8.4.5 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
8.4.6 Other Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
8.4.7 Nonparametric Alternative . . . . . . . . . . . . . . . . . . . . 168
8.4.8 Advantage of Blocking . . . . . . . . . . . . . . . . . . . . . . . 168
8.5 Analysis Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170

9 Factorial Designs 171


9.1 Quick Start . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
9.2 Two-Way Factorial Design . . . . . . . . . . . . . . . . . . . . . . . . . 172
9.2.1 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
9.2.2 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
9.2.3 Input . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
9.2.4 Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
9.2.5 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
9.3 More Advanced Designs . . . . . . . . . . . . . . . . . . . . . . . . . . 179
9.3.1 Randomized Block Factorial Design . . . . . . . . . . . . . . . 180
9.3.2 Split-Plot Factorial Design . . . . . . . . . . . . . . . . . . . . 184
9.3.3 Unbalanced Designs . . . . . . . . . . . . . . . . . . . . . . . . 188
9.4 Analysis Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191

10 Correlation 193
10.1 Quick Start . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
10.2 Relation of Two Numeric Variables . . . . . . . . . . . . . . . . . . . . 194
10.2.1 Scatterplot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
10.2.2 Correlation Coefficient . . . . . . . . . . . . . . . . . . . . . . . 196
x CONTENTS

10.2.3 Two Unrelated Variables . . . . . . . . . . . . . . . . . . . . . . 197


10.2.4 Two Variables Positively Related . . . . . . . . . . . . . . . . . 199
10.2.5 Scatterplot Classification Variable . . . . . . . . . . . . . . . . 202
10.2.6 Bubble Plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
10.3 Correlation Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
10.3.1 All Numeric Variables . . . . . . . . . . . . . . . . . . . . . . . 206
10.3.2 List of Variables . . . . . . . . . . . . . . . . . . . . . . . . . . 206
10.3.3 Missing Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
10.3.4 Visualizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208
10.3.5 Save the Correlations . . . . . . . . . . . . . . . . . . . . . . . 210
10.3.6 Cluster Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 210
10.4 Nonparametric Correlation Coefficients . . . . . . . . . . . . . . . . . . 213
10.5 Analysis Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215

11 Regression Analysis 217


11.1 Quick Start . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217
11.2 Regression Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218
11.2.1 Supervised Machine Learning . . . . . . . . . . . . . . . . . . . 218
11.2.2 Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220
11.3 Model Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
11.3.1 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
11.3.2 Standardization . . . . . . . . . . . . . . . . . . . . . . . . . . . 224
11.3.3 Inference for the Slope . . . . . . . . . . . . . . . . . . . . . . . 224
11.4 Model Fit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227
11.4.1 Residuals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227
11.4.2 Fit Indices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229
11.5 Prediction Intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232
11.5.1 Prediction Error . . . . . . . . . . . . . . . . . . . . . . . . . . 232
11.5.2 Predict from Existing Data . . . . . . . . . . . . . . . . . . . . 234
11.5.3 Predict from New Data . . . . . . . . . . . . . . . . . . . . . . 235
11.6 Outliers and Diagnostics . . . . . . . . . . . . . . . . . . . . . . . . . . 236
11.6.1 Bivariate Outliers . . . . . . . . . . . . . . . . . . . . . . . . . 236
11.6.2 Case-Deletion Statistics . . . . . . . . . . . . . . . . . . . . . . 238
11.6.3 Predictive Residuals . . . . . . . . . . . . . . . . . . . . . . . . 240
11.7 Model Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241
11.7.1 Properties of the Residuals . . . . . . . . . . . . . . . . . . . . 241
11.7.2 Curvilinear Relationships . . . . . . . . . . . . . . . . . . . . . 243
11.8 Analysis Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247

12 Multiple Regression 250


12.1 Quick Start . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 250
12.2 Multiple Regression Model . . . . . . . . . . . . . . . . . . . . . . . . . 251
12.2.1 Multiple Predictor Variables . . . . . . . . . . . . . . . . . . . 251
12.2.2 Partial Slope Coefficients . . . . . . . . . . . . . . . . . . . . . 252
12.3 Model Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254
12.3.1 Total Effects . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254
12.3.2 Net Effects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255
12.4 Model Fit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258
CONTENTS xi

12.4.1 Fit Indices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258


12.4.2 Outliers and Assumptions . . . . . . . . . . . . . . . . . . . . . 259
12.5 Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 260
12.5.1 Predictive Precision . . . . . . . . . . . . . . . . . . . . . . . . 260
12.5.2 Training vs. Testing Data . . . . . . . . . . . . . . . . . . . . . 261
12.5.3 Data Splitting . . . . . . . . . . . . . . . . . . . . . . . . . . . 262
12.6 Model Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264
12.6.1 Collinearity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264
12.6.2 Best Subsets . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265
12.6.3 Nested Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 267
12.7 Analysis of Covariance . . . . . . . . . . . . . . . . . . . . . . . . . . . 268
12.7.1 Covariates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269
12.7.2 Homogeneity of Regression . . . . . . . . . . . . . . . . . . . . 270
12.7.3 Group Differences . . . . . . . . . . . . . . . . . . . . . . . . . 271
12.7.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272
12.7.5 More Advanced Designs . . . . . . . . . . . . . . . . . . . . . . 273
12.8 Analysis Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274

13 Categorical Regression Variables 277


13.1 Quick Start . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277
13.2 Indicator Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 278
13.2.1 Dummy Variables . . . . . . . . . . . . . . . . . . . . . . . . . 278
13.2.2 Dummy Variable Regression . . . . . . . . . . . . . . . . . . . . 279
13.2.3 General Linear Model . . . . . . . . . . . . . . . . . . . . . . . 280
13.3 Custom Indicator Variables . . . . . . . . . . . . . . . . . . . . . . . . 281
13.3.1 Contrast Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . 282
13.3.2 Effects Coding Regression . . . . . . . . . . . . . . . . . . . . . 283
13.4 Binary Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . 285
13.4.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285
13.4.2 Logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 286
13.4.3 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 287
13.4.4 Odds Ratio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 289
13.4.5 Fit Indices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 290
13.4.6 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . 290
13.4.7 Outliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295
13.4.8 Multiple Predictors . . . . . . . . . . . . . . . . . . . . . . . . . 295
13.5 Analysis Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 300

14 Causality 302
14.1 Quick Start . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 302
14.2 Correlation is not Causation . . . . . . . . . . . . . . . . . . . . . . . . 303
14.2.1 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303
14.2.2 Real Life Consequences . . . . . . . . . . . . . . . . . . . . . . 305
14.3 Moderation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 306
14.3.1 The Concept . . . . . . . . . . . . . . . . . . . . . . . . . . . . 306
14.3.2 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 306
14.3.3 Manual Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 310
14.4 Mediation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 310
xii FIGURES

14.4.1 The Concept . . . . . . . . . . . . . . . . . . . . . . . . . . . . 310


14.4.2 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 312
14.4.3 The Indirect Effect . . . . . . . . . . . . . . . . . . . . . . . . . 315
14.5 Path Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 317
14.6 Analysis Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 319

15 Item and Factor Analysis 321


15.1 Quick Start . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 321
15.2 Overview of Factor Analysis . . . . . . . . . . . . . . . . . . . . . . . . 322
15.2.1 Latent Variables . . . . . . . . . . . . . . . . . . . . . . . . . . 322
15.2.2 Measurement Models . . . . . . . . . . . . . . . . . . . . . . . . 323
15.3 Exploratory Factor Analysis . . . . . . . . . . . . . . . . . . . . . . . . 324
15.3.1 Extraction then Rotation . . . . . . . . . . . . . . . . . . . . . 324
15.3.2 Exploratory Analysis of Mach IV Items . . . . . . . . . . . . . 326
15.4 Confirmatory Factor Analysis . . . . . . . . . . . . . . . . . . . . . . . 330
15.4.1 Covariance Structure . . . . . . . . . . . . . . . . . . . . . . . . 331
15.4.2 Analysis of a Population Model . . . . . . . . . . . . . . . . . . 334
15.4.3 Proportionality . . . . . . . . . . . . . . . . . . . . . . . . . . . 337
15.5 Confirmatory Analysis of Mach IV Items . . . . . . . . . . . . . . . . . 338
15.5.1 Analysis of Model from Exploratory Analysis . . . . . . . . . . 338
15.5.2 Revised Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 339
15.5.3 Scale Reliability . . . . . . . . . . . . . . . . . . . . . . . . . . 342
15.5.4 Total Score Correlations . . . . . . . . . . . . . . . . . . . . . . 343
15.5.5 Beyond the Basics . . . . . . . . . . . . . . . . . . . . . . . . . 344
15.6 Analysis Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 347

References 349
Index 352
List of Figures

1.1 Three default RStudio windowpanes and optional R Source pane for
R script. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2 Run an R data analysis from the RStudio Source windowpane. . . . . 7
1.3 The lessR vignettes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.4 General procedure for functions that process data, either data analysis
or data modification. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.5 Variable names in the first row and the first eight rows of data from
the employee data table as stored in an Excel worksheet. . . . . . . . 13
1.6 (a) Choose the data validation option or circle invalid data, (b) Excel
data validation options to specify the acceptable data format in the
designated cells. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.7 Gender, Height and Weight of eight motorcyclists. . . . . . . . . . . . 20

2.1 Variable labels for Variables m01 to m04, the first four Mach IV items
stored in an Excel worksheet. . . . . . . . . . . . . . . . . . . . . . . . 35
2.2 First six rows of data for the Excel output file written by Write(). . . 38
2.3 Gender, Weight and Height of eight motorcyclists. . . . . . . . . . . . 39

3.1 Sequence of steps in a data analysis project. . . . . . . . . . . . . . . . 41

4.1 Default grayscale bar chart for a single categorical variable. . . . . . . 64


4.2 Default grayscale pie chart in the form of a ring chart for categorical
variable Dept. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.3 Horizontal bar chart with custom interior and exterior colors. . . . . . 66
4.4 Worksheet data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.5 Bar chart of the deviation score of Salary for each department from
the mean Salary of all the departments. . . . . . . . . . . . . . . . . . 68
4.6 Stacked single-bar bar charts for the 20 items of the Mach IV scale. . 69
4.7 Default grayscale bar charts for two categorical variables, stacked (left)
and unstacked (right). . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.8 Worksheet data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.9 Default grayscale bar charts for two categorical variables, stacked (left)
and 100% stacked (right). . . . . . . . . . . . . . . . . . . . . . . . . . 76
4.10 General syntax of a call to pivot(). . . . . . . . . . . . . . . . . . . . 77

5.1 Example of a bin defined over the range of data values from $50,000
to $60,000. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.2 The assignment of data value $57,358 to its corresponding bin. . . . . 84
5.3 Default grayscale histogram. . . . . . . . . . . . . . . . . . . . . . . . 85
5.4 Histogram with specified bins and default starting value. . . . . . . . 86

xiii
xiv FIGURES

5.5 Histogram with specified bin widths and starting point for the bins. . 87
5.6 Histogram with superimposed general density curve and customized
histogram. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
5.7 Cumulative and regular histogram with specified bins. . . . . . . . . . 89
5.8 Generalized box plot. . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
5.9 Default grayscale lessR box plot. . . . . . . . . . . . . . . . . . . . . 91
5.10 Integrated violin, box, and scatterplot, called here the VBS plot. . . . 92
5.11 Distribution of the mean student ratings over the 20 terms. . . . . . . 93
5.12 Run chart of the mean rating each term for a course. . . . . . . . . . 94
5.13 Apple stock price. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
5.14 Stock price of three companies on one panel. . . . . . . . . . . . . . . 96
5.15 Trellis plot of stock price of three companies on three distinct panels. 96

6.1 The mean as the balance point of a distribution. . . . . . . . . . . . . 101


6.2 Two distributions of n = 7 with the same range, 180, and mean, 100,
but different standard deviations, 53.5 (A) vs 85.1 (B). . . . . . . . . 101
6.3 Perfect normal distribution of Y with standard scores Z showing rela-
tionship to the population standard deviation, σ, and the population
mean, µ. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
6.4 Three distributions with varying amounts of positive skew assessed by
G1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
6.5 Histograms of Height and Weight from the BodyMeas data set. . . . . 109
6.6 Histogram for m07, the 7th Mach IV item, not reverse scored. . . . . . 111
6.7 Standard error of the sample mean as a function of sample size, for
an arbitrary standard deviation of the data equal to 10. . . . . . . . . 114
6.8 Two hypothetical normal sampling distributions of m with the same
population mean, µ, but different standard errors (deviations). . . . . 114
6.9 Density plot from ttest() of m07, the 7th item on the Mach IV scale,
with sample mean and hypothesized mean and two effect sizes, not
reverse scored. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

7.1 Density curves for men and women on Mach IV subscale Deceit. . . . 132
7.2 Density curves for distributed practice vs. massed practice on test Score.140
7.3 Density plot of the differences of weight loss, in pounds. . . . . . . . . 144

8.1 Scatterplot and means for each group of task completion Time data. . 155
8.2 Tukey family-wise confidence intervals for each of three pairwise com-
parisons. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
8.3 Plot of the four data values for each person. . . . . . . . . . . . . . . . 164
8.4 Scatter plot of the data with the values fitted by the model. . . . . . . 167

9.1 Grayscale plot of the three cell means of task completion Time, each
for Dosage and Difficulty. . . . . . . . . . . . . . . . . . . . . . . . . . 174
9.2 A split-plot design with one between-groups factor (2 levels) and one
within-groups factor (4 levels) with 7 different participants for each of
the two levels. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
9.3 Cell means of response latency for two Food levels across four different
Supplements. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
FIGURES xv

10.1 Default grayscale scatterplot of Years Employed with Annual Salary. . 195
10.2 Scatterplot of Years Employed and Annual Salary and a loess curve
with a dark background. . . . . . . . . . . . . . . . . . . . . . . . . . . 195
10.3 Scatterplot of mean-deviated Years and Salary, centered at <0,0>. . . 196
10.4 The scatterplot with the 0.95 data ellipse for 250 values of variables X
and Y that exhibit almost no relationship, correlating only r = 0.016
in this sample. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
10.5 Strong linear relationship demonstrated with the scatterplot of Years
Employed and Annual Salary for r = 0.85 with the 0.95 data ellipse. . 200
10.6 Scatterplot of Years and Salary with different plot symbols for men
and women and respective least-squares fit lines. . . . . . . . . . . . . 202
10.7 A Trellis scatterplot of Years and Salary plotted on separate panels
for each value of Gender. . . . . . . . . . . . . . . . . . . . . . . . . . 203
10.8 Scatterplot in the form of a bubble plot of Mach IV items m06 and m07.204
10.9 Annotated correlation matrix of six Mach IV items from Correlation().207
10.10 Scatterplot matrix of six Mach IV items. . . . . . . . . . . . . . . . . 209
10.11 Heat map of the correlation matrix of six Mach IV items. . . . . . . . 209
10.12 Hierarchical cluster dendrogram for the first 16 Mach IV items, anno-
tated to suggest a cluster structure. . . . . . . . . . . . . . . . . . . . 211
10.13 Heat map of reordered correlation matrix for the first 16 Mach IV
items, annotated to suggest a cluster structure. . . . . . . . . . . . . . 212
10.14 Pearson and Spearman correlation coefficients for a perfect exponential
distribution (left) and the same distribution with added noise (right). 214

11.1 Predict from information entered into a prediction equation, the model.219
11.2 Two linear functions with different slopes, b1 = 2 (left) and b1 = 0.5
(right). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
11.3 Scatterplot with best-fitting least-squares regression line from the brief
output form of the Regression() function, reg_brief(). . . . . . . . 223
11.4 Logic of the confidence interval based on the hypothetical distribution
of b1 across repeated samples, where this one obtained b1 is less than β1 .226
11.5 Scatterplot with data value <10, 92681>, fitted value <10, 75206>,
and associated residual, 17475, for Xi = 10. . . . . . . . . . . . . . . . 228
11.6 Scatterplot with the null regression line and highlighted residuals. . . 230
11.7 Default Regression() scatterplot of Salary from Years experience
plus the regression line, the prediction intervals of Salary (light gray),
and the confidence intervals of the fitted value (dark gray). . . . . . . 234
11.8 Regression line with the created outlier removed (dashed) compared
to the original regression line with all the points (solid). . . . . . . . . 237
11.9 Scatterplot of fitted values and residuals. . . . . . . . . . . . . . . . . 241
11.10 Display of severe heterogeneity. . . . . . . . . . . . . . . . . . . . . . . 242
11.11 Distribution of the residuals. . . . . . . . . . . . . . . . . . . . . . . . 243
11.12 Typical non-linear functional relationships. . . . . . . . . . . . . . . . 244
11.13 Linear fit to a quadratic relationship. . . . . . . . . . . . . . . . . . . 244
11.14 Quadratic fit function for Y as a function of X. . . . . . . . . . . . . . 245
11.15 Fit to the linearized data (left) and residuals (right). . . . . . . . . . . 246
xvi FIGURES

12.1 Scatterplot matrix of response Reading ability with predictors Verbal


aptitude, days Absent, and family Income. . . . . . . . . . . . . . . . 256
12.2 Diagnostic plot. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259
12.3 Tension between overfitting and underfitting a model: A too simple
model never fits well, but an overly complex model only fits well on
training data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261
12.4 Analysis of covariance regression lines for of Salary in terms of Years
for each level of Gender based on the assumption of parallel lines. . . 270
12.5 Regression lines separately computed for Salary and Years at each
level of Gender as the relationships exist in the data. . . . . . . . . . . 271

13.1 Scatter plot and regression line from Regression() of the categorical
variable Gender with Salary. . . . . . . . . . . . . . . . . . . . . . . . 280
13.2 Least-squares regression fit and scatter plot of Hand circumference
and binary response variable Gender scored 0 and 1. . . . . . . . . . . 286
13.3 The sigmoid curve for the probability of Male for various Hand sizes
imposed over the scatter plot of Hand circumference in inches and
Gender. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291
13.4 Classification status overlaid on the scatterplot of Gender with Hand
size. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 292

14.1 Fire data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303


14.2 Path diagram that specifies Severity of the Fire causing the Number
of Fire Trucks at the fire as well as the total Fire Damage. . . . . . . 304
14.3 Fire data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304
14.4 Path diagram that specifies hormone replacement therapy directly
diminishes coronary heart disease. . . . . . . . . . . . . . . . . . . . . 305
14.5 Path diagram that specifies hormone replacement therapy increases
coronary heart disease. . . . . . . . . . . . . . . . . . . . . . . . . . . 305
14.6 Interaction plot from Regression() for Task Anxiety as a function of
Task Importance for three levels of the moderator academic self-efficacy.309
14.7 Path diagram of the direct influence of predictor X on response Y. . . 311
14.8 Partial mediation path diagram for predictor X, response Y, and
mediator M. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 311
14.9 Path diagram of full mediation of predictor X on response Y via
mediator M. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 312
14.10 Path diagram of predictor VAC on response DVB without mediation. 314
14.11 Analysis of mediation model consistent with data from Jessor and
Jessor (1977). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315

15.1 For the 20-item Mach IV correlation matrix, scree plots for the succes-
sive eigenvalues (left) and for the difference of successive eigenvalues
(right). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 327
15.2 Specified population two-factor multiple indicator measurement model,
three items per factor. . . . . . . . . . . . . . . . . . . . . . . . . . . . 331
15.3 Heat map of 6-variable correlation matrix with communalities in the
diagonal. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335
FIGURES xvii

15.4 Annotated output correlation matrix with communalities in the diago-


nal of all measured and latent variables in the analysis. . . . . . . . . 336
15.5 Annotated proportionality matrix. . . . . . . . . . . . . . . . . . . . . 338
15.6 Annotated proportionality coefficients for the items that define the
first two factors in the confirmatory factor analysis of Mach IV. . . . 340
List of Tables

1.1 The Christie and Geiss (1970) Mach IV scale. . . . . . . . . . . . . . . 16

2.1 Median read speed of three file formats over 100 trials of each Read()
function call, with the file size in MB. . . . . . . . . . . . . . . . . . . 37

3.1 Some R mathematical functions applied to the variable x. . . . . . . . 47


3.2 Logical operators. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

6.1 A distribution of values, Y , with mY = 6 and deviation scores, Y − mY .101


6.2 Conceptual expression of the variance, s2 and the standard deviation, s.102
6.3 Computation of the sample mean, mY = 6, sample variance, s2 = 6.5,
and sample standard deviation, sY = 2.55. . . . . . . . . . . . . . . . . 103

9.1 Subset of maze running data organized by block. . . . . . . . . . . . . 181

11.1 Synonymous names for the response and predictor variables, respec-
tively, in a regression analysis. . . . . . . . . . . . . . . . . . . . . . . . 219

13.1 Examples of categorical variables and one associated level (category). . 278
13.2 Data table of four people for categorical variable Gender and its two
corresponding indicator variables, here named GenderM and GenderW.278
13.3 Probabilities of being a Man, fitted, and of being a Woman, 1 -
fitted. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 299

14.1 The three variables in the mediation analysis with higher ATD scores
indicating greater intolerance of deviant behavior. . . . . . . . . . . . 312

15.1 Respective correlations of Item X1 and also Item X2 across all six
items in the model and their corresponding ratios, proportions. . . . . 337

xviii
Preface

Content Overview
Chapter 1 introduces R and then lessR and shows how to download them from
the worldwide network of R servers. The chapter then provides an example of data
analysis, shows how to structure data for analysis, and discusses general issues for
using R and lessR. Chapter 2 shows how to read the data from a computer file into R.
Then the data often must be modified before analysis begins, the topic of Chapter 3.
This editing can include changing individual data values, assigning missing data codes,
transforming, recoding, and sorting the data values for a variable, and sub-setting
and merging data tables.

Chapters 4 and 5 show how to obtain the most basic data analyses, counting how
often each data value or groups of similar data values occurred. Chapter 4 explains
bar charts and related techniques for counting the values of a categorical variable, one
with non-numeric data values. Also included is the analysis of two or more variables
in the form of bar charts for two variables and the associated cross-tabulation tables.
Chapter 5 does the same for numeric variables, which include the histogram and
related analyses, the scatter plot for one variable, the box plot, the density plot, and
the time plot, plus the associated summary statistics.

Chapter 6 explains the analysis of a single sample’s mean and other statistics. Chapter
7 focuses on the mean difference across two samples, or the direct analysis of differences
across two related samples. These analyses are based on the t-test of the mean, the
independent samples t-test, and the dependent samples t-test. The non-parametric
alternatives are also provided. Chapter 8 extends this material to compare many
groups with the analysis of variance, ANOVA. The designs considered are the one-way
independent groups and the randomized blocks analysis, and their associated post-hoc
analyses.

Chapter 9 introduces factorial designs, the two-way ANOVA, and the concept of inter-
action. Also illustrated are the randomized block factorial design and the split-plot
factorial. Effect sizes are an integral part of each discussion. The analysis of a two-way
unbalanced design is also presented.

Chapter 10 presents correlational analysis and scatterplots, and the correlation matrix
and hierarchical cluster analysis for organizing the variables that form the correlation
matrix. Chapters 11 through 14 introduce various aspects of regression analysis. The
subject of Chapter 11 is regression analysis of a linear model of a response variable
with a single predictor variable. The discussion includes estimation of the model,
inferential evaluation of the estimated coefficients, evaluation of fit, outliers and
influential observations, and prediction intervals. The chapter ends with a discussion

xix
xx PREFACE

of assumptions and curvilinear relationships. Chapter 12 extends this discussion to


multiple regression, the analysis of models with multiple predictor variables, including
the analysis of covariance ANCOVA. Chapter 13 extends the analysis to models with
categorical predictor variables or a binary categorical response variable, the topic
of logistic regression. Chapter 14 shows some aspects of using regression models
to investigate causality, moderation regression models, and models to analyze the
indirect effects of mediation, extending to an introduction to path analysis.

The topic of Chapter 15 is factor analysis, both its exploratory and confirmatory
versions. The primary examples are item analysis, the analysis of items that form
a scale, such as from an attitude survey, and the corresponding scale reliabilities.
The concepts of factor extraction and factor rotation are presented for exploratory
factor analysis. Within this context, a linkage between exploratory and confirmatory
factor analysis for item analysis is provided, as well as a discussion of the covariance
structure that underlies the confirmatory analysis. Analysis of scale development
from published data on Machiavellianism appears throughout the chapter, with the
final outcome of the sub-scales derived with confirmatory factor analysis.

Personal Acknowledgments
I thank Jason T. Newsom, a colleague and Professor at the Department of Psychology,
Portland State University. Already a successful author at Routledge/Taylor &
Francis, Jason recommended me to the Routledge team for the first edition. From
that introduction, this book developed. Jason also provided two different forums for
presenting my work on lessR at Portland State University, his informal seminars
on quantitative topics for faculty and graduate students, and his annual (pre-Covid)
June workshops on quantitative topics.

Also, I thank Carlos Martin-Vide, of Rovira i Virgili University, Tarragona, Spain, for
inviting me to participate in several sessions of the International School on Big Data.
Presenting the seminars provided further incentive to develop lessR, and served as a
source of stimulation for perceptive feedback from students and faculty. In particular,
I am grateful for the input from Paul Bliese, of the University of South Carolina,
Columbia, South Carolina, USA, who recommended using the lattice package from
Deepayan Sarkar as a basis for Trellis graphics upon which lessR now relies.

My students at Portland State University also deserve recognition for their role in the
development of lessR and this book. During the last 14 years many undergraduate
and graduate students have used and contributed much feedback to the project from
the beginning of its development through the current version.

Finally, I would like to thank the three reviewers who provided comprehensive,
insightful reviews to the first edition: J. Patrick Gray, University of Wisconsin
– Milwaukee, Agnieszka Kwapisz, Montana State University, and Bertolt Meyer,
University of Zurich, Switzerland. The quality of their reviews illustrates how
whatever one person does is facilitated by the thoughtful critiques of others who are
also knowledgeable in the subject. Their reviews shaped the format of this book.
Chapter 1

R for Data Analysis

1.1 Introduction

1.1.1 Data Analysis


We ask many questions for which we seek answers. Some of our inquiries are about
how things work in our world. On average, do men make more than women managers
at a particular company? Which of the two pain relievers is the most effective? What
type of career should the counselor suggest based on the client’s replies to a job
aptitude survey? Do people trust each other in general? This book explains how to
do and interpret analyses for answering questions such as these. empirical:
Information based on
We seek empirical answers to these questions, answers based on our observations observations acquired
of the world around us: What we see, hear, touch, taste, and smell. Encode these from our five senses.
observations as data, the measurements of different people, organizations, places, data: Measurements
things, and events. Data is the bedrock, the foundation, of our scientific knowledge of different people or
of the world and the lives that we live. whatever the unit of
analysis.
The natural variation of measurements motivates their analysis. Two different people
have different heights, place differing amounts of trust in others, have different blood
pressures, and earn different salaries. Height, trustfulness, blood pressure, and income
are some of the many variables amenable to measurement. GPA and number of credits
completed provide additional examples of measured variables for college students. data analysis:
Application of
What to do with all the data that result from these varying measurements? Data anal- statistical methods
ysis applies statistical methods to transform data into usable information, sometimes to transform data
into usable
vast amounts of data. This derived information then informs conclusions regarding
information.
the people, organizations, places, or whatever is the topic of interest. The emerging
field of data science, now a growth engine for jobs, demonstrates how essential data data science:
Application of a wide
analysis is becoming in our modern world.
range of data
analytic procedures
There are many different forms of data analysis that focus on understanding the to realize insight and
values of a single variable and how different variables relate to each other. Yet a small prediction.
number of general categories describe the purpose of many analyses. You may not at
this moment understand what each of the following types of analyses does, but you
will find examples and explanations of these analyses throughout this book.

1
2 CHAPTER 1. R FOR DATA ANALYSIS

Data Analysis. General categories of data analysis


• Data visualizations (also referred to as computer graphics)
• Statistics
◦ Descriptive statistics that summarize characteristics of the sample data
◦ Inferential statistics that generalize results from the sample to the popula-
tion from which the sample was obtained
• Supervised machine learning, predictive models constructed from the relation-
ships among variables
◦ Relate variables that are observed directly or are simple transformations,
such as converting yards to meters
◦ Relate variables that are latent, not observed, their existence inferred
from patterns of relationships among the observed variables
• Causal inference to establish patterns of causality

lessR: Analyze data


This book demonstrates how to analyze data and interpret the results using the
with only simple
instructions, usually computer software system R, enhanced with a set of data analysis procedures known as
the name of the lessR. Use this software to perform a specific analysis with a single, simple instruction.
analysis and the Anyone with access to a computer can now easily analyze data using free software.
variable(s) analyzed.

1.1.2 R with lessR

lessR Package
From the beginning of widely available software for data analysis in the 1970s until
the early 21st century, data analysis was generally done with expensive, proprietary
statistical software running on costly computers. The cost and availability of statistical
applications for data analysis become increasingly less relevant as computer capability
and the R system’s popularity increase. In terms of pure statistical power to analyze
data, R compares favorably to the most expensive commercial applications available,
delivering all analyses that nearly any analyst might desire. The price is precisely $0.00.
R is free for you on any computer that runs Windows, Macintosh, or Linux/Unix.

The bad news? R’s capabilities and price are great, but mastering a relatively difficult
system with a steep learning curve may require much time – time not spent analyzing
data or even hanging out with friends. R is a proper programming language that
provides enormous flexibility for processing and analyzing data. Sometimes you write
a little code, and sometimes much code to analyze your data. Standard R is mainly
for those who enjoy, or at least endure, reading manuals, programming, and then
debugging the resulting code. Get your code working right, and you have harnessed
the power. Or, you might find yourself staring at some cryptic error message that
you are clueless about how to resolve.

Fortunately, R is designed so that anyone can contribute additional functionality


beyond the hundreds of functions already downloaded with R. Your author developed
one such extension to R, the lessR package (Gerbing, 2021, 2022). Compared to R,
lessR requires much less code to perform basic data analysis. Only simple instructions,
1.1. INTRODUCTION 3

more precisely, function calls, are needed. All statistical software, including R and function: Procedure
Excel, performs a particular data analysis by invoking a specific function. Each call for a specific task
such as a data
to a lessR function contains the minimum necessary information to proceed: The
analysis.
name of the analysis, and the name of the variable(s) to be analyzed, sometimes
referencing additional options.

lessR Analysis
Organize your data as a rectangular table. Each column contains the name of a data table,
variable in the first row followed by the corresponding data values. Store your data Section 1.3.1, p. 13
in one of many different formats, including Excel worksheet files, OpenDocument
Spreadsheet files (ODS), and standard text files. The same lessR function call for
reading data also reads data stored in formats unique to specific data analysis software
systems, including R and data from the commercial systems: SPSS, Stata, and SAS.
Download R,
Download R and install the lessR package according to the instructions on the
Section 1.2.1, p. 5
following pages. Begin each R session by entering the following two instructions.
install.packages()
function,
Input. To begin each data analysis session: access lessR, read data into R
Section 1.2.5, p. 8
library("lessR")
d <- Read("")

library() function,
The R function library() accesses the lessR functions stored in your R library,
Section 1.1, p. 9
created when you installed R. Next, transfer your data from a data file on your
computer or the web into a working R session. Enter the lessR function Read() with Read() function,
Section 2.3.1, p. 25
the empty quotes, "", to browse your computer’s file system to locate your data file.
Or, between the two quotes when calling Read(), locate your data with a specific
pathname, web address (URL), or name of a lessR data file. When located, Read() Read from the web,
reads the data from that file into a data table within R, what R calls a data frame, Section 1.3.1, p. 14
typically named d as in this example. data frame: A data
table within an R
After the calls to the library() and Read() functions, data analysis begins. The session, ready for
following examples of lessR instructions (function calls) are essential to almost every analysis.
data analysis endeavor. For the variables of interest in the lessR default data frame
bar chart example,
named d, create a color-coordinated bar chart of one variable (Dept), a bar chart of Section 4.1, p. 64
two variables (Dept and Gender), a histogram (of Salary), and a scatterplot of two
histogram example,
variables (Years employed with Salary). Section 5.3, p. 85

Input. Histogram, bar chart, and scatter plot function calls scatterplot example,
Section 10.1, p. 195
BarChart(Dept)
BarChart(Dept, by=Gender)
Histogram(Salary)
Plot(Years, Salary)

These analyses also compute and display each visualization’s relevant statistics, such
as cross-tabulation tables, inferential tests, and correlation coefficients.

Other core data analytic procedures – t-tests, analysis of variance and covariance,
regression analysis, and factor/item analysis – are also straightforward to perform
4 CHAPTER 1. R FOR DATA ANALYSIS

with lessR. You may not yet know the meaning of these analyses but the lessR
instructions to produce them are straightforward. Simplicity is the theme. Data
analysis should involve minimal effort learning how to code. Instead, focus on the
meaning and interpretation of the results. Learn data analysis, not computer coding.

Creating a histogram with the statistics directly from an Excel worksheet is much
faster, easier, and more comprehensive with lessR than with Excel. This conclusion
holds throughout the material presented in this book: Data analysis with lessR is
simultaneously more accessible and more comprehensive than analysis with Excel.
Moreover, “Excel worksheets exhibit a fundamental flaw, the confounding of the data
with the instructions to process that data” (Gerbing, 2021, p. 251). By separating the
instructions to process the data distinct from the data itself, data analysis generally
becomes more straightforward, much easier to develop, and easier to debug.

If you are familiar with menu-based commercial systems such as SPSS, you recognize
the equivalents of SPSS procedures but without the menu pull-downs. Multiple
windows and menu options are replaced by a single reproducible, brief instruction.
Like Excel, SPSS costs money, whereas R is free and runs identically on all standard
computers. There is no longer any need to pay for data analytics software.

And no need for everyone to repeat the same programming to achieve a basic set of
regression analysis, standard, indispensable analyses. A comprehensive regression analysis with standard
Chapter 11, p. 217, R begins with about a dozen separate R statements and programming multiple lines of
Chapter 12, p. 250,
Chapter 13, p. 277
R code to organize the results. Replace the many R function calls, as well as the extra
programming to organize the output, by a single call to the lessR Regression()
function. The lessR procedures, such as for regression, rely upon R’s statistical and
visualization capabilities while also providing internal programming to obtain and
organize the results for you. Let lessR do the extra programming for you.

Two primary objectives underlie the lessR project to minimize the needed program-
ming to use R for data analysis.

◦ A data analysis procedure should produce needed text and visualization output
without additional instructions or information other than the procedure’s name,
the name of the analysis procedure, and the relevant variable name or names.
◦ If changes to the default output are desired, such as choosing a new background
color for a visualization, then scan a list of the available options in the manual
to understand how to provide all the information needed to proceed. Instead of
lines of code, simply add one or more options to the instruction that generates
the analysis.

Let’s get started!

the cloud:
Computer servers, 1.2 Prepare R for Analysis
usually in locations
apart from the user,
that run applications
Download and install R on your computer, or run in the cloud for free or minimal
accessed via a web cost with any device that runs a web browser such as a Chromebook or an iPad. The
browser.
1.2. PREPARE R FOR ANALYSIS 5

choice is yours. R works the same on your computer or in the cloud.

1.2.1 Download R

The best way to learn R is to start using R, available on many Internet servers CRAN: World-wide
around the world. These servers and the information on them comprise CRAN, the network of servers
Comprehensive R Archive Network. Obtain the latest version of R at: and information for
the R system.

Web Address. Download R


http://cloud.r-project.org

Select an operating system near the top of the resulting web page.

• Download R for Linux


• Download R for MacOS
• Download R for Windows

Windows: On the next web page, on the first line, click base. On the subsequent
page, on the first line again, click the Download R for Windows link that includes
the version number.

Mac: About 15 or so lines down the page, where x.y.z is the version number such as
4.2.2, click either the R-x.y.z-arm64.pkg link for the Apple M-series CPU version
or, further down the web page, the R-x.y.z.pkg link for the older Intel CPU version.
If you do not know the type of CPU in your Macintosh computer, on the Apple
menu at the top-left of the screen, choose the About this Mac option, then locate
the information for Chip.

Linux: On the subsequent web page, choose your distribution. Or, for a Debian
version of Linux, or Debian based versions such as Ubuntu and Mint, instead download
R from the usual software repository available with the Debian package system.

After downloading R, follow the instructions from the installer app. For each prompt
from the installer, choose the default. During the installation process the following
question may appear.
Would you like to use a personal library instead?
If asked, usually respond with a y, for yes, so that you have permission to access the
files created for and needed by R.

1.2.2 Download RStudio


RStudio:
A favored option runs R from within an app called RStudio because of the additional Feature-rich
features that RStudio provides. You can download now, or later, or never, but environment for
RStudio provides compelling advantages. Obtain Rstudio from the following link. running R.

Web Address. Download RStudio for running R


https://posit.co/download/rstudio-desktop/
6 CHAPTER 1. R FOR DATA ANALYSIS

Within the RStudio environment you are running the same R app. As shown in
Figure 1.1, RStudio presents several windowpanes, all resizable to customize for a
specific analysis. If desired, change the RStudio default color theme. From the Tools
menu, choose Global Options... and then Appearance. Your author prefers the
iPlastic theme, but there are many choices.

Figure 1.1: Three default RStudio windowpanes and optional R Source pane for R script.
console: The
window for entering The primary windowpane is the R console, the same display available from running R
R commands and for by itself. RStudio directs data visualizations into a second windowpane labeled Plots.
text output. Depending on the chosen tab, display other information, such as your file directory
with the Files tab. A third windowpane displays your data from the Environment
tab or your history of entered R instructions from the History tab.

The first three windowpanes appear by default. A potential fourth windowpane,


shown in Listing 1.1, provides for text files of R code, ready for analysis. Request
this fourth windowpane within RStudio by creating a new R script file.

Input. Open a new R script file in RStudio


File menu –> New File –> R Script
reproducibility:
Analyses can be
re-run in the future An analysis of saved R instructions is reproducible. You can then save the R script file
to reproduce for later access. You, or someone else in your organization, can then repeat or extend
previously obtained
the analysis without having to re-enter the R instructions.
results.

command prompt: By itself, or within RStudio, R processes instructions at the R console’s command
The > symbol, which prompt, illustrated in Figure 1.1. RStudio enhances this process by running stored
signals that R awaits instructions from the command prompt. Enter R instructions into the script window,
an instruction.
select one or more instructions, and press the Run button at the top-right of the
windowpane. RStudio will then copy the selected information to the command prompt
and run the instructions as if you had directly entered them into the console. Or,
click on the Compile Report button and send the input and output to an HTML file
1.2. PREPARE R FOR ANALYSIS 7

for reading with a web browser or a Word document (a pdf option is also available,
but requires LaTeX software installed).

Figure 1.2 shows the two function calls required to access lessR and retrieve the first two functions,
data, as well as the single function call required to generate a bar chart of the variable Section 1.1.2, p. 3
Dept from that data. Find the variable in the lessR data file called Employee, read
into the default d data table.

text editor: A
Figure 1.2: Run an R data analysis from the RStudio Source windowpane.
simplified word
processor for editing
The RStudio Source windowpane is a text editor that saves information as a standard a text file.
text file.1 Editing R script files is not limited to RStudio. Any text editor such as the
popular vim editor (used by your author to edit R instructions and write text such as
this book) can edit text files and, therefore, R script files accessed with RStudio. text file: A file that
consists only of
letters, digits, and
The Source windowpane provides for saving your R instructions over time to gradually punctuation plus a
build a collection of instructions to perform a variety of analyses. Reproduce the few control codes.
output of any one analysis or modify to obtain a related analysis. Saving your
instructions in a separate file allows you to examine the logic of your underlying
computations more straightforwardly than scattering those instructions over multiple
cells and perhaps multiple worksheets with an app such as Excel.

1.2.3 R in the Cloud


An important company in the R ecosystem, Posit (formerly RStudio, Inc.), offers a
free cloud account for running R at rstudio.cloud. However, the account is free
only for the first 25 hours per month of connect time and 1 hour of execution time.
The monthly accumulated time limit accrues against the time the computer requires
to perform the data analysis computations and the time a cloud project is open. Wait
to log into your account until you are ready to enter the instructions needed to do an
analysis, and then log out of your account when an analysis is complete. Restricting
the time you access the cloud account to the time entering commands and analysis
may allow completion of data analysis projects within the free time limits.
install lessR,
R and RStudio are pre-installed on a cloud account, so you only need to install
Section 1.2.5, p. 8
lessR. Start a new project by clicking on the New Project button for each analysis.
Running in the cloud, R will not read data files directly from your computer. Instead,
upload a data file to the cloud with the Upload button under the Files tab on the
bottom-right RStudio windowpane.
1
Word processors, such as MS Word or the free and MS Word compatible LibreOffice Write,
generally save their files in a non-text format with many hidden codes such as for formatting and
paging the document. They also tend to change aspects of the file, such as MS Word’s propensity to
change straight quotes to curly quotes, which do not work with R.
8 CHAPTER 1. R FOR DATA ANALYSIS

1.2.4 Start R
Windows and Macintosh users begin an R session the same as any application, of
which there are several possibilities. For example, if you chose to have the R icon
placed on your desktop during the installation process, double-click that icon to start
R. A new R session opens a window called the console. Or, if running within RStudio,
the R console opens in a windowpane, illustrated in Figure 1.1.

An R session is interactive. The last line of the information from the R console in
Figure 1.1 contains only a >, the R command prompt. Enter each R instruction in
response to this prompt. Press the Enter/Return key, and R immediately processes
that instruction. The entire session consists of sequential function calls, each entered
in response to another command prompt, followed by any subsequent R output.

Sometimes the entered function call needs to be completed when Enter/Return is


continuation pressed, such as missing the closing parenthesis. R responds with the continuation
prompt: The + sign, prompt, +. Enter the missing information to continue as usual, or press ESC to cancel.
which indicates the
entered instruction is A function call previously entered at the command line can be re-run or edited
incomplete. without re-entering. One way to re-run previous instructions is to push the up-arrow
key, ↑, to retrieve the instructions. Or, click the History tab in the upper-right
RStudio windowpane. Then select one or more instructions and click on either the
To Source button to copy the instructions to the Source or script window, or click
on To Console to copy to the command prompt, ready to re-run.

1.2.5 Extend R

package: A set of R organizes its many functions into groups of related functions called packages. The
related R functions. initially installed version of R includes six different packages, including the stat
package, the graphics package, and the base package for various utilities. These
packages are installed as part of R on your computer system and are automatically
loaded into memory when an R session begins.
contributed
package: An R Many more functions that considerably extend the functionality of R are available
package provided by from contributed packages, such as lessR, written by users in the R community.
the user community. To access the functions within a contributed package, download the package from
the CRAN servers. To download the lessR package, start up R and invoke the
install.packages()
function, R: install.packages() function at the R command prompt, >.
Download a package
from the R servers. Input. Install lessR and related packages
install.packages("lessR")

The lessR functions rely upon R functions as well as functions from other contributed
packages. R downloads these additional packages with lessR.

Sometimes one or more of these dependent packages is in the process of being updated
by its developer, which means that the package is not yet in a form ready to run on
Windows or Macintosh computers, what is called a compiled version. When installing
lessR, R may prompt with the following question.
1.2. PREPARE R FOR ANALYSIS 9

Do you want to install from sources the packages which need compilation?
(Yes/no/cancel)

Generally, answer no. Avoid compilation since your computer likely is not equipped
with the needed tools. By answering no you download the slightly older Windows or
Macintosh ready-to-go (binary) versions of the packages, where the compilation has
already been done for you.
update.packages()
Use the update.packages() function to update one or more packages.
function, R: Update
the versions of the
Input. Update all installed packages installed R packages.
update.packages()

RStudio also provides the option of updating from its Tools menu. Select the Check
for Package Updates... option.

1.2.6 Access lessR


install R, p. 8
After installation, begin each new R session with the library() function shown in
Listing 1.1 to access the over 60 lessR functions for data analysis, plus the included library() function,
data sets. R: Activate the
functions of the
specified package.
Input. First entry for a new R session after lessR is installed
library("lessR")
lessR 4.2.5 feedback: gerbing@pdx.edu
--------------------------------------------------------------
> d <- Read("") Read text, Excel, SPSS, SAS, or R data file
d is default data frame, data= in analysis routines optional

Learn about reading, writing, and manipulating data, graphics,


testing means and proportions, regression, factor analysis,
customization, and descriptive statistics from pivot tables.
Enter: browseVignettes("lessR")

Listing 1.1: Access lessR with the R library() function.

To learn more about what to do next, get some help.

1.2.7 Get Help


browseVignettes()
R provides two ways to obtain help. A great way to learn more about R functions function, R: Present a
accesses the R function browseVignettes(). A vignette provides explanations and list of documents
examples to illustrate various data analyses. To access a package’s vignettes, enter the that illustrate
different data
package name enclosed in quotes in the function call. Figure 1.3 applies the function analyses.
to lessR, which organizes the vignettes by theme: Data, Visualize, Models, Factor
Analysis, and Customize. Click on an HTML link to view the primary information,
the corresponding vignette displayed as a web page. Also, click on a source link to
view the input that created the HTML, or a code link to extract the R code from the
vignette.
10 CHAPTER 1. R FOR DATA ANALYSIS

Input. Access the lessR vignettes


browseVignettes("lessR")

Figure 1.3: The lessR vignettes.

help() function, R: Another help system provides the details of a specific function. Use the R function
Help to obtain the
user manual for a
help() by referencing a specific function by its name. Or, use the abbreviation ?
specified function. followed by the function name. The output is the manual for the specified function.

Input. Access the manual for a specified function


?BarChart

Each required manual follows a specific structure, though somewhat geekier than a
vignette. The resulting explanation includes a definition of all the parameter options
for the function, a more detailed discussion of how to use the function, and examples
of using the function. However, if the package offers vignettes, they generally provide
more detailed explanations with output.

The next step explains how to invoke the various R, lessR, and other functions.

1.2.8 R Functions for Analysis

parameters of a Function Parameters


function:
Information input
Functions do all the work in R. The values of a function’s parameters contain the
into the calculations information provided to the function for analysis. Specify any parameter values
of the function. between the parentheses of the function call. If no parameter values are needed, still
include the parentheses after the function name but with no content, ().

In our bar chart example, the one parameter value passed to BarChart() is the
variable name, Dept. Because it is listed first in the function call, and is also the
function manual,
Section 1.2.7, p. 10 position of the first parameter in the definition of the function found in its manual,
the specified value does not require the parameter name. The parameter name that
specifies the first variable to analyze with BarChart() is x, so the following two
function calls are equivalent.
1.2. PREPARE R FOR ANALYSIS 11

Input. Do not need the parameter name for the variable analyzed if listed first
BarChart(Dept)
BarChart(x=Dept) default value:
Assumed value for a
function’s parameter
Many function parameters are defined with default values, values assumed but can
that can be explicitly
be overridden. For example, lessR functions such as BarChart() reference the name changed.
of the relevant data frame, R’s name for the data table, with the data parameter.
Unlike R functions, lessR provides a default value for the input data frame, d, that data frame,
contains the variable(s) to analyze. Section 1.1.2, p. 3

data parameter:
Input. Access the default d data frame with or without the data parameter The name of the
BarChart(Dept) data frame that
contains the
BarChart(Dept, data=d) variables for analysis,
with default name d.
You only need to invoke the data parameter when the input data frame is not d. If
all the parameters in a function call have default values, then no information needs
to be passed to the function, just empty parentheses in the function call. For any
parameter without a default value, necessarily specify a value for that parameter.

For the primary lessR visualization functions, lessR also provides an interactive
display. You point-and-click to choose the data file and select the variable(s) of
interest. lessR then displays the visualization and associated statistics. Point-and-
click to choose different parameter values, such as changing the color of the bars or
points. Not even a need to enter the simple lessR function calls!

Use the interact() function to initiate an interactive display. Pass no values to the
function to display a list of available displays, shown in Listing 1.2.

interact()
Run interact() with a one of the following app names, such as:
interact("BarChart")

Valid names (enclose in quotes):


"BarChart", "Histogram", "PieChart", "ScatterPlot", "Trellis"

Listing 1.2: Available interactive displays of parameter values for data visualization.

When finished, lessR writes the function calls for you to generate the same output,
complete with explanatory comments.
utility function:
Function Types Inform or set
characteristics of the
Find at least three different types of functions within the R system. Utility functions
data processing
either provide information or set characteristics of the overall environment to facilitate environment.
data processing. An example of a utility function is head(), which lists the variable
head() function,
names and the first six lines of data from a specified data frame. Section 1.3, p. 15

A data analysis function applies statistical methods to process the data to obtain the
requested analysis, the function’s output, as illustrated in Figure 1.4.
12 CHAPTER 1. R FOR DATA ANALYSIS

Input Function Output Possibilities


to process data
data values text: console
for one or more transform graphics: window or file
variables input into output object: e.g., data table

Figure 1.4: General procedure for functions that process data, either data analysis or data
modification.

data analysis
function: Procedure An example of a data analysis function is BarChart(). Many lessR functions, such as
to access and analyze BarChart(), direct their resulting text output to the R console and any visualizations
data.
to the plot window.

Another way to process data is to create or modify the data in preparation for
data modification subsequent data analysis. One example of a data modification function is Read(),
function: Procedure which reads data from an external data file and then usually directs the data into
to create or modify
an R data frame. Other examples of data modification are sorting the data values
data.
or transforming them, such as taking their logarithms. The modified data can then
either be directed back to overwrite the original data table, d, or to a new data table,
leaving the original d unmodified.

1.2.9 Vectors

vector: List of A common task when calling a data analysis function specifies either a list of variables
variables or constants or a list of constants of either numbers or character strings. Such a list is called a
defined as one entity. vector. Suppose you wish to fill the bars of a bar chart with two alternating colors,
fill parameter, lessR, "darkred" and "darkblue". The fill parameter manually specifies colors for the
Section 4.2.3, p. 66 lessR data visualization functions that fill the interior of an object, bars or points.

Input. Bar chart with bars of alternating color for data in data frame d
BarChart(Dept, fill=c("darkred", "darkblue"))

c() function, R:
Separate the individual values in a list with commas, embedding the entire set in the
Combine a list of
values to define a combine c() function. This function delineates a list of values, a vector, from the rest
vector. of the information that surrounds the vector. In the above example, the value for the
parameter fill is a vector of two values, separated from the rest of the information
in the function call by the c() function.

Another example is the list of the first eight integers specified with the c() function.

c(1,2,3,4,5,6,7,8)
: notation: Specify a
sequential set of If the items in the list are sequential integers, the : notation can be used instead.
integers.
1:8

Or, combine the types of expressions. The c() function surrounds the individual
components of the vector with commas to separate the components.
1.3. DATA 13

c(1:4,5,6,7:8)

Enter any of the three preceding expressions at the command prompt to obtain the
following output.

[1] 1 2 3 4 5 6 7 8

The [1] indicates that the first value in the output begins on the first line, the only
line of output in this example.

When referring to character constants, all analysis apps such as R and Excel enclose
these constants within quotes. Here create a character vector of three letters.

c("c", "a", "t")

list of variables,
The concept of a vector applies not only to a sequence of numbers or characters as
Section 10.3.2, p. 206
illustrated here but also to lists of variables.

1.3 Data

Data analysis begins, naturally enough, with data. Organize the data for analysis into
a rectangular table, the form required by R and virtually every other data analysis
system such as Python or SPSS. This section presents two sets of data analyzed
throughout this book, and a discussion of how to create your own data table.

1.3.1 Data Example I: Employee Data


A company’s human resources department recorded the following for each employee:
Name, Years of employment, Gender, Department employed, annual Salary, Job
Satisfaction, and Health Plan. Consider a data file with measurements organized into data file: A file on a
a table for 37 employees, available on the web. computer system
that contains data,
http://lessRstats.com/data/employee.xlsx usually organized
into a table of rows
and columns.
Figure 1.5 displays the first nine lines of the Excel version of this data file.

Figure 1.5: Variable names in the first row and the first eight rows of data from the employee
data table as stored in an Excel worksheet.
14 CHAPTER 1. R FOR DATA ANALYSIS

unit of analysis:
The class of people, Organize the Data
organizations, things
or places from which
The rows of a data table are of the unit of the analysis, the object of study. In this
measurements are example, the unit of analysis is a person, an employee at a specific company. Other
obtained. potential units of analysis include organizations, places, events, or things in general.

The worksheet, Excel or similar, stores either a variable name or a data value in a
cell, the intersection of a row and a column. A data value is the value of a variable
data value: The for a specific instance of the unit of the analysis, here a specific person. For example,
value of a single the person listed in the first row of data, the second row of the worksheet, is Darnell
measurement or
Ritchie. Darnell’s annual Salary is $53,788.26. The data value for the variable Dept
classification.
is ADMN, which indicates employment in general administration. The data value
for the Years employed at the company is not available for the second person, James
Wu, so the corresponding cell is empty.
variable: Attribute
that varies from unit Data analysis depends upon the concept of a variable. A variable is a characteristic
to unit (e.g., different of an object or event with different values for different people, organizations, etc.
people.) Each variable name is concise, usually less than 10 or so characters, and serves as the
reference for the data values of the variable in any subsequent data analysis.

The first row of the data table usually contains the variable names. The data values
for each variable are within the same column. However, the data values of the first
column in this particular data table are not for a variable but are instead unique ID
values that identify the employee for each row of data. For example, the first row of
data in Figure 1.5 consists of the data values for employee Darnell Ritchie.
case (observation,
instance): Data
The data values in a single row of the data table in Figure 1.5 are the data for a
values for a single
unit such as a person, single person. Unfortunately, there is no standard notation for a row of data. Various
organization, thing authors use terms such as case, observation, instance, example, and sample.
or region.
The form of the data table in Figure 1.5 is the wide format. We explore another data
data table, wide:
Data values for each format in Chapter 8, but the wide format is more commonly encountered in data
case are in a row and analysis. All the data values for one case (e.g., person) are in one row, and all the
for each variable in a data values for a variable are in one column.
column.
What does a data table in Excel have to do with R? If R did not exist, the data file
Read() function, would still be an Excel worksheet. The data table has nothing intrinsically to do with
lessR, Section 2.3, R. However, R can analyze the data in the worksheet. Read the data into R using the
p. 25 lessR function Read(), such as into the R data frame named d.
data frame,
Section 1.2.8, p. 11
Input. Read a data file from the web into the d data frame
d <- Read("http://lessRstats.com/data/employee.xlsx")

When read into the d data frame, the data continues to exist as an Excel worksheet
file but now also exists within the working R session.

View the Data


One reason for the popularity of worksheet apps such as Excel is that the data are
always visible. Working within the command-line environment of R makes it possible
1.3. DATA 15

never to view the data. However, not viewing your data is a mistake. Always be
aware of your data, what it looks like, the variable names, the number of cases (rows),
the number of variables (columns) of the data table, etc.

Viewing your data from within R is easy, accessed with simple function calls from the
command line, or, in RStudio, from the Environment tab in the upper-right window
head() function, R:
pane. Listing 1.3 shows how easy it is to view a useful excerpt of your data in R by View the variable
using the head() function. The name of the relevant data frame is the first parameter names and first
value. several rows of the
specified data table
head(d) within R.

Years Gender Dept Salary JobSat Plan Pre Post


Ritchie, Darnell 7 M ADMN 53788.26 med 1 82 92
Wu, James NA M SALE 94494.58 low 1 62 74
Downs, Deborah 7 W FINC 57139.90 high 2 90 86
Hoang, Binh 15 M SALE 111074.86 low 3 96 97
Jones, Alissa 5 W <NA> 53772.58 <NA> 1 65 62
Afshari, Anbar 6 W ADMN 69441.93 high 2 100 100

Listing 1.3: The R function head() displays the variable names and first six lines of data,
here for data frame d.

tail() function, R:
The head() function is your friend as you proceed with a data analysis. Access the View the variable
function regularly to see what you are modifying and analyzing. The corresponding names and last
function tail() displays the variable names and the last six rows of the data table. several rows of the
specified data table
To see the entire data set, enter only the name of the data frame, here d. The R within R.
function names() displays the variable names in the specified data table. names() function, R:
View the variable
To retrieve the dimensions of your data frame, R provides three functions. The names of the
function dim( ) displays the number of rows and then the number of columns. Or, specified data table
obtain each separately with nrow() and ncol(). Listing 1.4 illustrates these functions within R.
applied to the data frame d. Specify the data frame for each function call.

> dim(d)
[1] 37 8
> nrow(d)
[1] 37
> ncol(d)
[1] 8

Listing 1.4: Three R functions to display the dimensions of the data frame, here d.

Always know your data when working in R or any other data analysis system. What
you thought was there is not always there. You cannot avoid seeing your data with a
worksheet, but only because worksheets mix up the instructions to process the data
with the data. R overcomes this worksheet deficiency while simultaneously offering
simple ways to show your data, allowing you to reap the benefits of decoupling the
data from the instructions for processing the data.
16 CHAPTER 1. R FOR DATA ANALYSIS

1.3.2 Data Example II: Machiavellianism


The data values described in this section are the responses to an attitude survey
from Hunter, Gerbing, and Boster (1982). The topic is Machiavellianism, an essential
component of the Dark Triad (Paulhus & Williams, 2002) personality configuration.
Christie and Geis (1970) introduced the concept of Machiavellianism and its mea-
surement to psychological researchers. Their Mach IV scale, a set of 20 items for
assessing endorsement of Machiavellian beliefs, appears in Table 1.1.

Mach IV Item
1. Never tell anyone the real reason you did something unless it is useful to do so.
2. The best way to handle people is to tell them what they want to hear.
3. One should take action only when sure it is morally right.
4. Most people are basically good and kind.
5. It is safest to assume that all people have a vicious streak and it will come out
when they are given a chance.
6. Honesty is the best policy in all cases.
7. There is no excuse for lying to someone else.
8. Generally speaking, people won’t work hard unless they’re forced to do so.
9. All in all, it is better to be humble and honest than to be important and dishonest.
10. When you ask someone to do something for you, it is best to give the real reasons
for wanting it rather than giving reasons which carry more weight.
11. Most people who get ahead in the world lead clean, moral lives.
12. Anyone who completely trusts anyone else is asking for trouble.
13. The biggest difference between most criminals and other people is that the criminals
are stupid enough to get caught.
14. Most people are brave.
15. It is wise to flatter important people.
16. It is possible to be good in all respects.
17. Barnum was wrong when he said that there’s a sucker born every minute.
18. It is hard to get ahead without cutting corners here and there.
19. People suffering from incurable diseases should have the choice of being put
painlessly to death.
20. Most people forget more easily the death of a parent than the loss of their property.

Table 1.1: The Christie and Geiss (1970) Mach IV scale.


Machiavellianism:
Beliefs and values To understand Machiavellianism, consider two different people with two radically
that endorse the different perspectives on life and human nature. One person believes in the inherent
cynical manipulation goodness of people, in living an ethical life, treating others with respect, trust, and
of others.
honesty. The other person believes that a primary goal of life is to achieve as much
power and material possessions as possible, at any means possible, that the “end
justifies the means”, to do whatever it takes. To “get ahead” may include lying and
cheating. This second person tends to believe that other people are like themselves
or are naive and gullible, almost willing subjects for their manipulation.

This second perspective aligns with the writings of Niccolo Machiavelli from the
1500’s, who wrote extensively on the need for political leaders to do whatever is
necessary to defend their political power and success, even if their methods are morally
contemptible. Being successful for good or bad requires cunning sophistication and
tactics to achieve the desired goal. His most famous writing, The Prince (1902/1513),
1.3. DATA 17

has become a manual of sorts for those seeking to implement Machiavelli’s advice.
The concept applies to both political power, in terms of government and organizations,
and personal power, in terms of interpersonal relationships.
Likert scale:
As with thousands of other such scales devised by psychologists and social scientists, Measurement of
assess each item on a Likert scale. The respondent considers the extent of agreement usually 4 to 7 scale
for each item along a continuum of Disagreement/Agreement, assessed by a small points along the
Disagree-Agree
number of categories that vary from Disagree to Agree, usually 5, 6, or 7 categories. continuum.
An odd number of categories allows for a neutral point in the middle, separating the
Disagreement side from the Agreement side. In contrast, an even number forces the
respondent to choose one side or the other, even if favored slightly. The Mach IV
data were collected with each of the 20 items responded to on the following 6-point
Likert scale.

Strongly Disagree Slightly Slightly Agree Strongly


Disagree Disagree Agree Agree

Hunter et al. (1982) administered the Mach IV items to college students as part of
a longer attitude survey. All items on the survey were randomized for presentation
on the resulting questionnaire. Measurements were collected for 351 respondents.
The responses to each item on the 6-pt scale were numerically coded. A Strongly
Disagree was coded as a 0, stepping through the integers for each successive category
until reaching Strongly Agree, coded as a 5.

These data provide the basis for showing how to read and analyze survey data in
lessR, referenced throughout this book. Find the data file at the following location
on the web.

http://lessRstats.com/data/Mach4.fwd
read fixed width data,
Listing 1.5 shows the first six rows of the data file. How to read the data into R is Section 2.4.3, p. 33
covered in the next chapter.

0100004150541540000401324
0127001440330440111244310
0134121054405341400202401
0222105240444520001115440
0264123230222533101312320
0282022332323141312223321

Listing 1.5: First six rows of data for the 351 row Mach IV data file.

For each respondent, a four-digit ID number occupies the first four columns of each
row. The data values are sorted by the values of this ID field. The 5th column is
Gender, encoded with 0 for Man and 1 for Woman. The last 20 columns are the
responses to the 20 Mach IV items, each response a single digit ranging from 0 to 5.
Accordingly, there are 25 columns of data.

The Mach IV data fixed-width data file is available on the web.


18 CHAPTER 1. R FOR DATA ANALYSIS

http://lessRstats.com/data/Mach4.fwd

The Mach IV data set is also included as part of lessR with the name Mach4, ready
for analysis. The internal version does not include the four-digit ID number.

1.3.3 Create a Data File

Worksheet Apps
Data values for analysis are usually not directly entered into the data analysis
application, such as R. Data typically originate instead from some other application or
source, perhaps a file saved in the form of a worksheet. Typical worksheet applications
that open and save worksheet files in the Excel format, xlsx, include:
a. Microsoft Excel: Costs money and runs on Windows, Macintosh, and Android
and iOS devices
LibreOffice,
libreoffice.org b. LibreOffice Calc: Free of cost, open source, runs on Windows, Macintosh, and
Linux
c. Apple Numbers: Free of cost, runs only on Apple devices, Macintosh and iOS
The lessR function Read() reads data from a worksheet in the Excel or OpenDocu-
ment Spreadsheet (ODS) format, so all three of the above choices of a worksheet app
work equally well to store data for analysis in R. All three applications save in Excel
format and Excel and LibreOffice can save in ODS format.

Perhaps your IT department provided the data in worksheet form, or some app
collected the data such as from an online survey. Or, maybe you entered your data
into a worksheet. To enter data directly into a worksheet, open the corresponding
worksheet app, then enter the variable names across the cells in the first row, as
in Figure 1.5. Enter the corresponding data row by row, aligning each column to
contain the data values for a single variable.

The worksheet is a convenient container that matches the tabular structure of data. A
worksheet wonderfully organizes the data into the proper format of rows and columns,
providing an excellent means to enter and view the data. Do not underestimate the
importance of the phrase “wonderfully organizes”. Typically, much of the work for
data analysis consists of organizing and cleaning the data for subsequent analysis.
Read() function:
Section 2.3.1, p. 25 A worksheet application and R are complementary tools for data analysis. With the
Write function, lessR functions Read() and Write(), easily exchange data stored in a worksheet
Section 2.6.2, p. 37 with R. For example, it is generally much faster and less work with more flexibility and
histogram, more extensive results, to read Excel worksheet data into R to compute a histogram
Section 5.2, p. 83 rather than doing the histogram from within Excel.

Data Validation
The data values within each column of a data table should all share the same format
and the values should lie within an acceptable range. For example, the data values
for the Salary column in Figure 1.5 are all positive numbers stored in the currency
format. It is not meaningful to include a value such as −3000.82 for a Salary, or
1.3. DATA 19

values of abc, or >100,000. Nor is it meaningful to represent Gender with multiple


encodings such as "W", "w", "Woman", and "woman". Instead, consistently code each
level of a categorical variable with only one encoding.

In data analytics, messy, incorrectly recorded data such as inconsistently coded data
values are all too prevalent. Someone has to clean and better organize such data
before analysis can begin. Data validation is a valuable tool for improving the quality data validation:
Verify that all data
of the initial organization of the data. Activate the data validation procedure when values within the
entering data values into a spreadsheet to ensure that the data values in each column same column have
are formatted consistently. the same format and
conform to the same
Data validation can, for example, verify that all data values in a column are between specifications.
0 and 100 inclusive, that all zip codes have five characters, and that all names have
only alphabetical characters. If a data value entered into a worksheet does not meet
the specified valid attributes, it is rejected and cannot be inserted into the relevant
cell.

To activate data validation in Excel, go to the Data ribbon, choose the Validate
pull-down menu, and then the Data Validation... option. Select Circle Invalid
Data as shown in Figure 1.6a. The resulting dialog box, illustrated in Figure 1.6b,
provides several options for specifying the specific type of data that Excel can accept
into the column of designated cells.

Figure 1.6: (a) Choose the data validation option or circle invalid data, (b) Excel data
validation options to specify the acceptable data format in the designated cells.

What if an erroneous data value was entered into a cell before data validation was
used? Excel can flag invalid numbers by creating a red circle around the cells in
question. Although R data analysis has far more possibilities than Excel data analysis,
Excel or similar spreadsheets are ideal for storing small to medium-sized data files.
R and worksheets work well together. Data can be cleaned in R, but it can also
be cleaned in a worksheet with data validation, then re-saved and read into R for
analysis.

In LibreOffice Calc, activate data validation by selecting the relevant column of data
values. From the Data menu, choose the Validity... option . Choose the resulting
Criteria tab and specify the desired property of valid data values in the specified
cell range.
20 CHAPTER 1. R FOR DATA ANALYSIS

1.4 Analysis Problems


1. Access. Get and access R and lessR.
a. Download the latest version of R for your computer.
b. Start an R session and download lessR.
c. Verify that lessR is downloaded and properly working by displaying the infor-
mation lessR writes to the console when successfully loaded.

2. Vectors. Create a vector of . . .

a. Consecutive integers from 10 to 16.


b. The numbers -5, 0, 5.
c. The letters d, o, and g.

3. Data. Consider the data in Figure 1.7, randomly selected from a data file of the
body measurements of thousands of motorcyclists.

Figure 1.7: Gender, Height and Weight of eight motorcyclists.

a. Enter this data table into a worksheet.


b. Indicate any missing data.
c. List each variable and classify each as continuous or categorical. If categorical,
state if nominal or ordinal and specify the categories.
d. Read the data into R. Display the output.
e. Display the entire contents of the data frame into which you read the data.
f. How does the way in which the data values for each variable are encoded within
the computer match your assignment as continuous or categorical. Explain
why the computer encoding is consistent with the continuous and categorical
distinction.
g. Which value in R is encoded as NA. Why?
Chapter 2

Read and Write Data

2.1 Quick Start

How to access lessR functions was addressed in Section 1.2.5. After R and lessR are
installed, load the package into memory with library("lessR") each time R is run.

Store data in a rectangular data table with variables in columns and people or data table,
whatever the unit of analysis in rows. Usual, though not necessary, to place the Section 1.3.1, p. 13
variable names in the first row of the data file. Store the data in various formats,
including Excel and OpenDocument Spreadsheet formats, plain text, and R and SPSS
native formats.

Read the data file into an R data table, called a data frame, to analyze the data.
data frame,
Usually read into a data frame named d, the default data name for the lessR data Section 2.3, p. 25
analysis functions. One call to the lessR function Read() reads the data from one
of many formats and then displays relevant information regarding the data. In the
call to Read(), enclose the reference to the data’s location within quotes, "". If the
quotes are empty, Read() has you browse your file directory to locate the data file.

d <- Read("")

If the data file exists on the web, enclose the full web address (URL) within the
quotes, including the http://, as the first value passed to Read(). Or, include the
path name, the file location and name, within the quotes.

d <- Read("http://lessRstats.com/data/employee.csv")

Optionally, include row names by identifying a column of the data file that consists
of unique row ID’s with the row_names parameter, such as the following if the ID’s
are in the first column. row_names
parameter,
Section 2.3.4, p. 29
d <- Read("", row_names=1)

By default, missing data values may be literally missing, that is, no data value is
present for a specific cell in the data table. Or, if a missing value is identified by one missing parameter,
or more codes, such as -99 or "XX", invoke the missing parameter. Section 2.3.3, p. 27

21
22 CHAPTER 2. READ AND WRITE DATA

d <- Read("", missing=c(-99,"XX"))

If you use a comma as the decimal separator instead of a period and a semicolon
instead of a comma for the value separator, invoke the Read2() function to read a
Read2() function, csv data file.
Section 2.4.4, p. 33
d <- Read2("")

Read variable labels from a separate text or Excel file to provide more informative
output than shorter variable names. Read the variable labels into the l data frame,
variable labels, the required name as of this writing.
Section 2.5.2, p. 34
l <- Read("", var_labels=TRUE)

Variable labels are optional but enhance the interpretability of the output.

2.2 Types of Variables

Data analysis revolves around the concept of a variable. Data analysis is the analysis
of the values of one or more variables, done today using the computer to perform
the computations. Yet people were doing data analysis well before computers were
invented. The meaning of a variable exists apart from the computer. Accordingly, we
define variables at two different levels: conceptual and computational. We need to
understand the meaning of our variables, and we also need to know how the data
values for a variable are represented digitally in computer storage. This distinction
applies to all data analysis software, including R, Excel, Python, SPSS, and others.

2.2.1 Variables as a Concept


Consider the conceptual meaning of variables. Data analysis distinguishes between two
types of variables, categorical and continuous. Variables are analyzed in different ways
depending on their type, so it is important to know which variables are continuous
and which are categorical before beginning any research. These variable types can
then be subdivided further, as described below.

Continuous Variables
continuous The values for continuous variables are ordered along a quantitative continuum, the
variable: A variable abstraction of the infinitely dense real number line. Choose any two values and find
with numerical an unlimited number of numeric values lie between them. Examples of continuous
values.
variables for a person are Age, Salary, and extent of Agreement with an opinion
about some political issue; for a car, MPG and Weight; and for a light bulb, Mean
Number of Hours until Failure and Electrical Consumption per Hour (kilowatt hours).
A continuous variable is sometimes called a quantitative variable.

Distinguish between a continuous variable’s actual values and the data values that
emerge from measuring those values. Measurement categorizes data values into
2.2. TYPES OF VARIABLES 23

specific groups. The value of the variable as it exists always differs from the value of
its measurement, the data value. Nothing, for example, weighs exactly 2 pounds, 2.01
pounds, or even 2.0000000001 pounds. The real weight may theoretically be stated as
an indefinitely large number of decimal digits. In contrast, indicate a measurement to
a specific level of precision, such as, for weight, to the nearest pound, ounce, or gram.
Measurement groups all similar weights together, approximating the true weight to
the nearest pound or whatever.
ratio data:
Interpret the data values measured on a numeric scale for a continuous variable
Numerical scale with
according to one of two types. Ratio data follow a numeric scale with the usual a fixed zero point
properties that are assigned to numbers. There is a fixed zero point and values on and equal intervals.
either side of zero scale proportionality. In particular, two different values can be
compared by their ratios: 20 is twice as much as 10. Equal intervals of measurement
separate values that are equal distance from each other. For example, the distance
between 21 and 22 represents the same underlying difference for 22 and 23.

A weaker numerical scale applies to interval data, which maintains the equal interval interval data:
Numerical scale
property of ratio data, but does not have a fixed, natural zero point. The classic without a fixed zero
example of two alternative interval scales compares Fahrenheit and Celsius temper- point but equal
atures. Each has a different value of zero in terms of the actual magnitude of the intervals.
temperature. Because the value of zero is arbitrary in either scale, ratio comparisons
are not valid. For example, 20◦ F is not twice as warm as 10◦ F.
categorical
Categorical Variables variable: A variable
with a relatively
The primary type of variable other than continuous is the categorical variable. The small number of
values of a categorical variable form a relatively small number of categories called levels. unique values, called
categories or levels.
Each level represents a distinct group. For example, the values of the categorical
levels: Values of a
variable Gender define groups of Men, Women, and Other. Other categorical variables categorical variable.
are Cola Preference, State of Residence, or Football Jersey Number. Yes, the number
on the jersey consists of numeric digits, but those digits are labels, not subject
to arithmetic operations such as computing an average. A categorical variable is
sometimes also called a qualitative variable or a grouping variable.
labeling: The
The values of categorical variables are labeled rather than measured. We do not process of classifying
measure the state of the USA in which a person resides, but rather assign a person to an observation into a
that state based on self-report or an examination of public records. The classification group described by a
into a group assigns a label, such as Oregon or Texas. label.

nominal data:
One type of categorical data are classifications into discrete, unordered categories. The Data values grouped
resulting data values are called nominal data. Data for Gender, State of Residence, into unordered
and Phone Manufacturer are examples of nominal data. categories.

Or, the measured value of a continuous variable is so imprecise that, instead of a


numerical scale, only a few categories exist in which the measured values can be
placed. Suppose that persons admitted to the emergency room are swiftly placed into
one of only three severity categories: mild, moderate, or severe. This simple rating
scale recognizes that some injuries are more severe than others, but the severity is
classified into one of only three categories. The underlying variable for Injury Severity
24 CHAPTER 2. READ AND WRITE DATA

is continuous. This underlying progression of severity is assumed, but not equal


intervals of severity that separate the levels. The rater’s interpretation of Moderate
Severity of Injury may be closer to Mild Severity than Severe Severity of Injury.

ordinal data: Data values grouped into ordered non-numeric categories, rankings, are ordinal data.
Ordered categories, As another example, suppose the top three sprinters are ranked in order of finish
rankings. in the 100 meter dash: 1st, 2nd, and 3rd. The finish times represent a continuous
variable, but simply ranking contestants by order of their finish does not convey if
the race was extremely close or if the winner finished well ahead of their nearest
competitor.

2.2.2 Variables in the Computer


The data analyst has a conceptual understanding of how a variable’s data values are
structured. Are they continuous or categorical? Are they numeric or non-numeric?
What is their valid response range? The data values for the variables are analyzed on
the computer, so how the data values are conceptually defined should align with how
the computer stores them, such as within an R data frame.

After reading the data values into any data analysis app, before data analysis begins,
verify that the data were read correctly and represented correctly in the resulting
data frame. Many things can go wrong. Perhaps errors occurred as the data values
were entered into the data file. Maybe the data values were not correctly read into
the data analysis app, such as R. Perhaps there is too much missing data to permit
meaningful analysis.
data storage type:
How the data values The data storage type is the computer’s digital representation of a data value in its
of a variable are memory. The storage type should match the conceptual definition of the variable. The
physically stored in common R data storage types for numeric variables are type integer, for numbers
the computer.
without decimal digits, and double, for numbers with decimal digits, each represented
in computer memory with a long storage unit called double precision. Dates and
Date variables,
Section 5.4.2, p. 94
times can also be directly represented1

Different data storage types may be used for categorical variables, both in the data
file that holds the data and in the data frame that R uses to store the data during a
data processing session. For example, the data values of a categorical variable can
be integers with a different integer assigned to each level, such as 0 for Man, 1 for
Woman, and 2 for Other. Or, the values could be stored as non-numeric characters,
what R calls type character, such as M for Man, W for Woman, and O for Other.
factor: R data
Regardless of type integer or character as read into an R data frame, transform
storage type for
categorical variables. a categorical variable to a variable of type factor before analysis begins. The
transformation provides information beyond that contained in the data, such as
factor function, meaningfully ordering the levels of the categorical variable. Data of type factor are
Section 3.2, p. 43 internally stored as integers but displayed as descriptive labels.
1
Another data type sometimes encountered is logical, a variable with only two possible values,
TRUE and FALSE. When reading data, however, this data type is not typically encountered. The two
data values for a logical variable are usually encoded with character codes or the integers 0 and 1
instead of TRUE and FALSE.
Another random document with
no related content on Scribd:
SWEET CARROTS. (ENTREMETS.)

Boil quite tender some fine highly-flavoured carrots, press the


water from them, and rub them through the back of a fine hair-sieve;
put them into a clean saucepan or stewpan, and dry them thoroughly
over a gentle fire; then add a slice of fresh butter, and when this is
dissolved and well mixed with them, strew in a dessertspoonful or
more of powdered sugar, and a little salt; next, stir in by degrees
some good cream, and when this is quite absorbed, and the carrots
again appear dry, dish and serve them quickly with small sippets à la
Reine (see page 5), placed round them.
Carrots, 3 lbs., boiled quite tender: stirred over a gentle fire 5 to 10
minutes. Butter, 2 oz.; salt, 1/2 teaspoonful; pounded sugar, 1
dessertspoonful; cream, 1/2 pint, stewed gently together until quite
dry.
Obs.—For excellent mashed carrots omit the sugar, add a good
seasoning of salt and white pepper, and half a pint of rich brown
gravy; or for a plain dinner rather less than this of milk.
MASHED (OR BUTTERED) CARROTS.

(A Dutch Receipt.)
Prepare some finely flavoured carrots as above, and dry them
over a gentle fire like mashed turnips; then for a dish of moderate
size mix well with them from two to three ounces of good butter, cut
into small bits, keeping them well stirred. Add a seasoning of salt
and cayenne, and serve them very hot, garnished or not at pleasure
with small sippets (croutons) of fried bread.
CARROTS AU BEURRE, OR BUTTERED CARROTS.

(French.)
Either boil sufficient carrots for a dish quite tender, and then cut
them into slices a quarter of an inch thick, or first slice, and then boil
them: the latter method is the most expeditious, but the other best
preserves the flavour of the vegetable. Drain them well, and while
this is being done just dissolve from two to three ounces of butter in
a saucepan, and strew in some minced parsley, some salt, and white
pepper or cayenne; then add the carrots, and toss them very gently
until they are equally covered with the sauce, which should not be
allowed to boil: the parsley may be omitted at pleasure. Cold carrots
may be re-warmed in this way.
CARROTS IN THEIR OWN JUICE.

(A simple but excellent Receipt.)


By the following mode of dressing carrots, whether young or old,
their full flavour and all the nutriment they contain are entirely
preserved; and they are at the same time rendered so palatable by it
that they furnish at once an admirable dish to eat without meat, as
well as with it. Wash the roots very clean, and scrape or lightly pare
them, cutting out any discoloured parts. Have ready boiling and
salted, as much water as will cover them; slice them rather thick,
throw them into it, and should there be more than sufficient to just
float them (and barely that), pour it away. Boil them gently until they
are tolerably tender, and then very quickly, to evaporate the water, of
which only a spoonful or so should be left in the saucepan. Dust a
seasoning of pepper on them, throw in a morsel of butter rolled in
flour, and turn and toss them gently until their juice is thickened by
them and adheres to the roots. Send them immediately to table.
They are excellent without any addition but the pepper; though they
may be in many ways improved. A dessertspoonful of minced
parsley may be strewed over them when the butter is added, and a
little thick cream mixed with a small proportion of flour to prevent its
curdling, may be strewed amongst them, or a spoonful or two of
good gravy.
TO BOIL PARSNEPS.

These are dressed in precisely the same manner as carrots, but


require much less boiling. According to their quality and the time of
year, they will take from twenty minutes to nearly an hour. Every
speck or blemish should be cut from them after they are scraped,
and the water in which they are boiled should be well skimmed. They
are a favourite accompaniment to salt fish and boiled pork, and may
be served either mashed or plain.
20 to 25 minutes.
FRIED PARSNEPS.

Boil them until they are about half done, lift them out, and let them
cool; slice them rather thickly, sprinkle them with fine salt and white
pepper, and fry them a pale brown in good butter. Serve them with
roast meat, or dish them under it.
JERUSALEM ARTICHOKES.

Wash the artichokes, pare them quickly, and throw them as they
are done into a saucepan of cold water, or of equal parts of milk and
water; and when they are about half boiled add a little salt to them.
Take them up the instant they are perfectly tender: this will be in from
fifteen to twenty-five minutes, so much do they vary in size and as to
the time necessary to dress them. If allowed to remain in the water
after they are done, they become black and flavourless. Melted
butter should always be sent to table with them.
15 to 25 minutes.
TO FRY JERUSALEM ARTICHOKES. (ENTREMETS.)

Boil them from eight to twelve minutes; lift them out, drain them on
a sieve, and let them cool; dip them into beaten eggs, and cover
them with fine bread-crumbs. Fry them a light brown, drain, pile them
in a hot dish, and serve them quickly.
JERUSALEM ARTICHOKES, À LA REINE.

Wash and wipe the


artichokes, cut off one end of
each quite flat, and trim the
other into a point; boil them in
milk and water, lift them out the
instant they are done, place Artichokes à la Reine.
them upright in the dish in which
they are to be served, and
sauce them with a good béchamel, or with nearly half a pint of cream
thickened with a rice-crustspoonful of flour, mixed with an ounce and
a half of butter, and seasoned with a little mace and some salt. When
cream cannot be procured use new milk, and increase the proportion
of flour and butter.
MASHED JERUSALEM ARTICHOKES.

Boil them tender, press the water well from them, and then
proceed exactly as for mashed turnips, taking care to dry the
artichokes well, both before and after the milk or cream is added to
them; they will be excellent if good white sauce be substituted for
either of these.
HARICOTS BLANCS.

The haricot blanc is the seed of a particular kind of French bean,


of which we find some difficulty in ascertaining the English name, for
though we have tried several which resembled it in appearance, we
have found their flavour, after they were dressed, very different, and
far from agreeable. The large white Dutch runner, is, we believe, the
proper variety for cooking; at least we have obtained a small quantity
under that name, which approached much more nearly than any
others we had tried to those which we had eaten abroad. The
haricots, when fresh may be thrown into plenty of boiling water, with
some salt and a small bit of butter; if dry, they must be previously
soaked for an hour or two, put into cold water, brought to boil gently,
and simmered until they are tender, for if boiled fast the skins will
burst before the beans are done. Drain them thoroughly from the
water when they are ready, and lay them into a clean saucepan over
two or three ounces of fresh butter, a small dessertspoonful of
chopped parsley, and sufficient salt and pepper to season the whole;
then gently shake or toss the beans until they are quite hot and
equally covered with the sauce; add the strained juice of half a
lemon, and serve them quickly. The vegetable thus dressed, is
excellent; and it affords a convenient resource in the season when
the supply of other kinds is scantiest. In some countries the dried
beans are placed in water, over-night, upon a stove, and by a very
gentle degree of warmth are sufficiently softened by the following
day to be served as follows:—they are drained from the water,
spread on a clean cloth and wiped quite dry, then lightly floured and
fried in oil or butter, with a seasoning of pepper and salt, lifted into a
hot dish, and served under roast beef, or mutton.
TO BOIL BEET ROOT.

Wash the roots delicately clean, but neither scrape nor cut them,
for should even the small fibres be taken off before they are cooked,
their beautiful colour would be much injured. Throw them into boiling
water, and, according to their size, which varies greatly, as they are
sometimes of enormous growth, boil them from one hour and a half
to two and a half, or longer if requisite. Pare and serve them whole,
or cut into thick slices and neatly dished in a close circle: send
melted butter to table with them. Cold red beet root is often
intermingled with other vegetables for winter salads; and it makes a
pickle of remarkably brilliant hue. A common mode of serving it at
the present day is in the last course of a dinner with the cheese: it is
merely pared and sliced after having been baked or boiled tender.
1-1/2 to 2-1/2 hours, or longer.
TO BAKE BEET ROOT.

Beet root if slowly and carefully baked until it is tender quite


through, is very rich and sweet in flavour, although less bright in
colour than when it is boiled: it is also, we believe, remarkably
nutritious and wholesome. Wash and wipe it very dry, but neither cut
nor break any part of it; then lay it into a coarse earthen dish, and
bake it in a gentle oven for four or five hours: it will sometimes
require even a longer time than this. Pare it quickly if it be served
hot; but leave it to cool first, when it is to be sent to table cold.
In slow oven from 4 to 6 hours.
STEWED BEET ROOT.

Bake or boil it tolerably tender, and let it remain until it is cold, then
pare and cut it into slices; heat and stew it for a short time in some
good pale veal gravy (or in strong veal broth for ordinary occasions),
thicken this with a teaspoonful of arrow-root, and half a cupful or
more of good cream, and stir in, as it is taken from the fire, from a
tea to a tablespoonful of chili vinegar. The beet root may be served
likewise in thick white sauce, to which, just before it is dished, the
mild eschalots of page 128 may be added.
TO STEW RED CABBAGE.

(Flemish Receipt.)
Strip the outer leaves from a fine and fresh red cabbage; wash it
well, and cut it into the thinnest possible slices, beginning at the top;
put it into a thick saucepan in which two or three ounces of good
butter have been just dissolved; add some pepper and salt, and stew
it very slowly indeed for three or four hours in its own juice, keeping it
often stirred, and well pressed down. When it is perfectly tender add
a tablespoonful of vinegar; mix the whole up thoroughly, heap the
cabbage in a hot dish, and serve broiled sausages round it; or omit
these last, and substitute lemon-juice, cayenne pepper, and a half-
cupful of good gravy.
The stalk of the cabbage should be split in quarters and taken
entirely out in the first instance.
3 to 4 hours.
BRUSSELS SPROUTS.

These delicate little sprouts, or miniature cabbages, which at their


fullest growth scarcely exceed a large walnut in size, should be quite
freshly gathered. Free them from all discoloured leaves, cut the
stems even, and wash the sprouts thoroughly. Throw them into a
pan of water properly salted, and boil them quickly from eight to ten
minutes; drain them well, and serve them upon a rather thick round
of toasted bread buttered on both sides. Send good melted butter to
table with them. This is the Belgian mode of dressing this excellent
vegetable, which is served in France with the sauce poured over it,
or it is tossed in a stewpan with a slice of butter and some pepper
and salt: a spoonful or two of veal gravy (and sometimes a little
lemon-juice) is added when these are perfectly mixed.
8 to 10 minutes.
SALSIFY.

We are surprised that a vegetable so excellent as this should be


so little cared for in England. Delicately fried in batter—which is a
common mode of serving it abroad—it forms a delicious second
course dish: it is also good when plain-boiled, drained, and served in
gravy, or even with melted butter. Wash the roots, scrape gently off
the dark outside skin, and throw them into cold water as they are
done, to prevent their turning black; cut them into lengths of three or
four inches, and when all are ready put them into plenty of boiling
water with a little salt, a small bit of butter, and a couple of spoonsful
of white vinegar or the juice of a lemon: they will be done in from
three quarters of an hour to an hour. Try them with a fork, and when
perfectly tender, drain, and serve them with white sauce, rich brown
gravy, or melted butter.
3/4 to 1 hour.
FRIED SALSIFY. (ENTREMETS.)

Boil the salsify tender, as directed above, drain, and then press it
lightly in a soft cloth. Make some French batter (see Chapter V.),
throw the bits of salsify into it, take them out separately, and fry them
a light brown, drain them well from the fat, sprinkle a little fine salt
over them after they are dished, and serve them quickly. At English
tables, salsify occasionally makes its appearance fried with egg and
bread-crumbs instead of batter. Scorgonera is dressed in precisely
the same manner as the salsify.
BOILED CELERY.

This vegetable is extremely good dressed like sea-kale, and


served on a toast with rich melted butter. Let it be freshly dug, wash
it with great nicety, trim the ends, take off the coarse outer-leaves,
cut the roots of equal length, tie them in bunches, and boil them in
plenty of water, with the usual proportion of salt, from twenty to thirty
minutes.
20 to 30 minutes.
STEWED CELERY.

Cut five or six fine roots of celery to the length of the inside of the
dish in which they are to be served; free them from all the coarser
leaves, and from the green tops, trim the root ends neatly, and wash
the vegetable in several waters until it is as clean as possible; then,
either boil it tender with a little salt, and a bit of fresh butter the size
of a walnut, in just sufficient water to cover it quite, drain it well,
arrange it on a very hot dish, and pour a thick béchamel, or white
sauce over it; or stew it in broth or common stock, and serve it with
very rich, thickened, Espagnole or brown gravy. It has a higher
flavour when partially stewed in the sauce, after being drained
thoroughly from the broth. Unless very large and old, it will be done
in from twenty-five to thirty minutes, but if not quite tender, longer
time must be allowed for it. A cheap and expeditious method of
preparing this dish is to slice the celery, to simmer it until soft in as
much good broth as will only just cover it, and to add a thickening of
flour and butter, or arrow-root, with some salt, pepper, and a small
cupful of cream.
25 to 30 minutes, or more.

You might also like