B.Sc., B.Tech, PGDCA, PGDM, MBA, PhD EXPLORATORY DATA ANALYSIS Exploratory data analysis is both a data analysis perspective and a set of techniques. We will present unique and conventional techniques including graphical and tabular devices to visualize the data. In exploratory data analysis (EDA) the researcher has the fl exibility to respond to the patterns revealed in the preliminary analysis of the data. Thus, patterns in the collected data guide the data analysis or suggest revisions to the preliminary data analysis plan. This flexibility is an important attribute of this approach. Data Exploration, Examination, and Analysis in the Research Process Exploratory data analysis techniques Frequency Tables, Bar Charts, and Pie Charts Nominal Displays of Data (Minimum Age for Social Networking Nominal Displays of Data (Minimum Age for Social Networking Histograms The histogram is a conventional solution for the display of interval-ratio data. Histograms are used when it is possible to group the variable’s values into intervals. Histograms are constructed with bars (or asterisks) that represent data values, where each value occupies an equal amount of area within the enclosed area. Data analysts find histograms useful for (1) displaying all intervals in a distribution, even those without observed values, and (2) examining the shape of the distribution for skewness, kurtosis, and the modal pattern. When looking at a histogram, one might ask: Is there a single hump (a mode)? Are subgroups identifiable when multiple modes are present? Are straggling data values detached from the central concentration? Pareto Diagrams In quality management, J. M. Juran fi rst applied this concept by noting that only a vital few defects account for most problems evaluated for quality and that the trivial may explain the rest. Historically, this has come to be known as the 80/20 rule—that is, an 80 percent improvement in quality or performance can be expected by eliminating 20 percent of the causes of unacceptable quality or performance. The Pareto diagram is a bar chart whose percentages sum to 100 percent. The data are derived from a multiple-choice, single-response scale; a multiple-choice, multiple-response scale; or frequency counts of words (or themes) from content analysis. The respondents’ answers are sorted in decreasing importance, with bar height in descending order from left to right. The pictorial array that results reveals the highest concentration of improvement potential in the fewest number of remedies. The cumulative frequency line in this exhibit shows that the top two problems (the repair did not resolve the customer’s problem, and the product was returned multiple times for repair) accounted for 80 percent of the perceptions of inadequate repair service. Pareto Diagrams Boxplots The boxplot, or box-and-whisker plot, is another technique used frequently in exploratory data analysis.
A boxplot reduces the detail of the stem-and-leaf display and
provides a different visual image of the distribution’s location, spread, shape, tail length, and outliers. Boxplots are extensions of the five-number summary of a distribution.
This summary consists of the median, the upper and lower
quartiles, and the largest and smallest observations. The median and quartiles are used because they are particularly resistant statistics . z-scores A z-score allows us to measure the relative location of a value in the data set. More specifically, a z-score helps us determine how far a particular value is from the mean relative to the data set’s standard deviation. Suppose we have a sample of n observations, with the values denoted by x 1 , x 2 , . . . , x n . In addition, assume that the sample mean, x̄, and the sample standard deviation, s, are already computed. Associated with each value, x i , is another value called its z-score. Box Plots A box plot is a graphical summary of the distribution of data. A box plot is developed from the quartiles for a data set. Figure 14 is a box plot for the home sales data. Here are the steps used to construct the box plot: Box Plots What can we learn from these box plots? The most expensive houses appear to be in Shadyside and the cheapest houses in Hamilton. The median home selling price in Groton is about the same as the median home selling price in Irving. However, home sales prices in Irving have much greater variability. Homes appear to be selling in Irving for many different prices, from very low to very high. Home selling prices have the least variation in Groton and Hamilton. Unusually expensive home sales (relative to the respective distribution of home sales values) have occurred in Fairview, Groton, and Irving, which appear as outliers. Groton is the only location with a low outlier, but note that most homes sell for very similar prices in Groton, so the selling price does not have to be too far from the median to be considered an outlier.