Submitted To Submitted by

You might also like

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 44

A Report On Statistical Analysis With MS Excel

Submitted In The Partial Requirement For The Completion Of


Degree Of
Bachelor Of Commerce (Hons.)
To
MAHARSHI DAYANAND UNIVERSITY, ROHTAK
(2017-2020)

Submitted To Submitted By
Name :Anil Bisht
Class:- B.com Hons 4th sem
Government College CollegeRoll No. – 1924
Sector – 9, Gurugram Registration No. :17GU301062
University Roll No. -

GOVERNMENT COLLEGE, SECTOR – 9, GURUGRAM, HARYANA


(Affiliated To Maharshi Dayanand University)
BIO – DATA

Name :Anil Bisth


Father Name :Mr khimendra Singh
Date Of Birth :01/09/1999
Address :h.no-120 gali no-8 block-b sheetla colony
(Gurgaon)
Contact No. : +91 9911268826
Nationality : Indian
Religion : Hindu
Topic Of Research : Statistical and sampling analysis
DECLARATION

I undersigned Anil Bisht student of B.Com (Hons.) 4th semester declare


that I have done project report on “STATISTICAL ANALYSIS WITH MS
EXCEL” has been personally done by me under the guiadence of Prof.
Sanjeev khurana in partial fulfillment of bachelor during Academic Year
2017-2020 All the data presented in the project is true and correct to the best of
my knowledge and belief.
I also declare that this project report is my own preparation and not
copied from anywhere.

Anil Bisht
ACKNOWLEDGEMENT

I would like to express my special Thanks of Gratitude To My Teachers.


Alankrita who gave me The Golden Opportunity to do this wonderful report on
the topic which also helped me in doing a lot of research and I came to know
about so many things I am really thankful to her.
Secondly I would also like to thanks my friends and teachers who helped
me in finalizing this project report with in the limited time frame.

Anil Bisht
STATISTICAL ANALYSIS

Introduction :- Statistical analysis is the science of


collecting, exploring and presenting large amounts of data
to discover underlying patterns and trends. Statistics are
applied every day – in research, industry and government –
to become more scientific about decisions that need to be
made.
For example:
 Manufacturers use statistics to weave quality into
beautiful fabrics, to bring lift to the airline industry and
to help guitarists make beautiful music.
 Researchers keep children healthy by using statistics to
analyze data from the production of viral vaccines,
which ensures consistency and safety.
 Communication companies use statistics to optimize
network resources, improve service and reduce
customer churn by gaining greater insight into
subscriber requirements.
 Government agencies around the world rely on
statistics for a clear understanding of their countries,
their businesses and their people.
Around us, from the tube of toothpaste in your bathroom
to the planes flying overhead, you see hundreds of
products and processes every day that have been
improved through the use of statistics.
Statistical Computing
Traditional methods for statistical analysis – from sampling data to interpreting
results – have been used by scientists for thousands of years. But today’s data
volumes make statistics ever more valuable and powerful. Affordable storage,
powerful computers and advanced algorithms have all led to an increased use of
computational statistics.

Whether you are working with large data volumes or running multiple permutations of
your calculations, statistical computing has become essential for today’s statistician.
Popular statistical computing practices include:

 Statistical programming – From traditional analysis of


variance and linear regression to exact methods and statistical
visualization techniques, statistical programming is essential for
making data-based decisions in every field.

 Econometrics –Modeling, forecasting and simulating business


processes for improved strategic and tactical planning. This
method applies statistics to economics to forecast future trends.

 Operations research –Identify the actions that will produce the


best results – based on many possible options and outcomes.
Scheduling, simulation, and related modeling processes are used
to optimize business processes and management challenges.

 Matrix programming –Powerful computer techniques for


implementing your own statistical methods and exploratory data
analysis using row operation algorithms.

 Statistical visualization –Fast, interactive statistical analysis


and exploratory capabilities in a visual interface can be used to
understand data and build models.

 Statistical quality improvement –A mathematical approach to


reviewing the quality and safety characteristics for all aspects of
production.

Simple Statistical Analysis


Once we have collected quantitative data, we will have a lot of
numbers. It’s now time to carry out some statistical analysis to
make sense of, and draw some inferences from, our data. There is
a wide range of possible techniques that we can use.

Summarising Data: Grouping and Visualising

The first thing to do with any data is to summarise it, which means to
present it in a way that best tells the story.

The starting point is usually to group the raw data into categories, and/or to
visualise it. For example, if you think you may be interested in differences
by age, the first thing to do is probably to group your data in age categories,
perhaps ten- or five-year chunks.

One of the most common techniques used for summarising is using graphs,


particularly bar charts, which show every data point in order, or histograms,
which are bar charts grouped into broader categories.

An example is shown below, which uses three sets of data,


grouped by four categories. This might, for example, be men,
women, and ‘no gender specified’, grouped by age categories
20–29, 30–39, 40–49 and 50–59.
Visualise Your Data

The important thing about drawing a graph is that it gives you an


immediate ‘picture’ of the data. This is important because it shows
us straight away whether our data are grouped together, spread
about, tending towards high or low values, or clustered around a
central point. It will also show us whether we have any ‘outliers’,
that is, very high or very low data values, which we may want to
exclude from the analysis, or at least revisit to check that they are
correct.

It is always worth drawing a graph before you start any further


analysis, just to have a look at your data.

We can also display grouped data in a pie chart, such as this


one.
Pie charts are best used when you are interested in the relative size of
each group, and what proportion of the total fits into each category, as
they illustrate very clearly which groups are bigger.

Measures of Location: Averages

The average gives us information about the size of the effect of


whatever we are testing, in other words, whether it is large or small.
There are three measures of average: mean, median and mode.

When most people say average, they are talking about the mean. It
has the advantage that it uses all the data values obtained and can be
used for further statistical analysis. However, it can be skewed by
‘outliers’, values which are atypically large or small.

As a result, researchers sometimes use the median instead. This is the


mid-point of all the data. The median is not skewed by extreme
values, but it is harder to use for further statistical analysis.

The mode is the most common value in a data set. It cannot be used


for further statistical analysis.
The values of mean, median and mode are not the same, which is why it
is really important to be clear which ‘average’ you are talking about.

Measures of Spread:
Range, Variance and Standard Deviation
Researchers often want to look at the spread of the data, that is, how
widely the data are spread across the whole possible measurement
scale.

There are three measures which are often used for this:

The range is the difference between the largest and smallest values.
Researchers often quote the interquartile range, which is the range of
the middle half of the data, from 25%, the lower quartile, up to 75%,
the upper quartile, of the values (the median is the 50% value). To find
the quartiles, use the same procedure as for the median, but take the
quarter- and three-quarter-point instead of the mid-point.

The standard deviation measures the average spread around the mean,
and therefore gives a sense of the ‘typical’ distance from the mean.

The variance is the square of the standard deviation. They are


calculated by:

calculating the difference of each value from the mean;

squaring each one (to eliminate any difference between those


above and below the mean);

summing the squared differences;

dividing by the number of items minus one.


This gives the variance.

To calculate the standard deviation, take the square root of the


variance.

Skew -The skew measures how symmetrical the data set is, or
whether it has more high values, or more low values. A sample with
more low values is described as negatively skewed and a sample with
more high values as positively skewed.

Generally speaking, the more skewed the sample, the less the mean,
median and mode will coincide.

More Advanced Analysis

Once you have calculated some basic values of location, such as


mean or median, spread, such as range and variance, and
established the level of skew, you can move to more advanced
statistical analysis, and start to look for patterns in the data.

MICROSOFT – EXCEL
Microsoft Excel is a spreadsheet developed by Microsoft for
Windows, macOS, Android and iOS. It features calculation,
graphing tools, pivot tables, and a macro programming language
called Visual Basic for Applications. It has been a very widely
applied spreadsheet for these platforms, especially since version
5 in 1993, and it has replaced Lotus 1-2-3 as the industry
standard for spreadsheets. Excel forms part of Microsoft Office.
In one Excel sheet are now 407 632x1 048 576 cells, so there are
427 428 937 728 cells. The last cell is named XFD1048576.

Microsoft Excel has the basic features of all spreadsheets, using a grid of
cells arranged in numbered rows and letter-named columns to organize data
manipulations like arithmetic operations. It has a battery of supplied
functions to answer statistical, engineering and financial needs. In addition,
it can display data as line graphs, histograms and charts, and with a very
limited three-dimensional graphical display.
It allows sectioning of data to view its dependencies on various factors for
different perspectives (using pivot tables and the scenario manager). It has
a programming aspect, Visual Basic for Applications, allowing the user to
employ a wide variety of numerical methods, for example, for solving
differential equations of mathematical physics, and then reporting the results
back to the spreadsheet.
It also has a variety of interactive features allowing user interfaces that can
completely hide the spreadsheet from the user, so the spreadsheet presents
itself as a so-called application, or decision support system (DSS), via a
custom-designed user interface, for example, a stock analyzer, or in
general, as a design tool that asks the user questions and provides answers
and reports.
In a more elaborate realization, an Excel application can automatically poll
external databases and measuring instruments using an update schedule,
analyze the results, make a Word report or PowerPoint slide show, and e-
mail these presentations on a regular basis to a list of participants. Excel
was not designed to be used as a database.

History - From its first version Excel supported end user programming
of macros (automation of repetitive tasks) and user defined functions
(extension of Excel's built-in function library). In early versions of Excel
these programs were written in a macro language whose statements had
formula syntax and resided in the cells of special purpose macro sheets
(stored with file extension .XLM in Windows.) XLM was the default macro
language for Excel through Excel 4.0. Beginning with version 5.0 Excel
recorded macros in VBA by default but with version 5.0 XLM recording was
still allowed as an option. After version 5.0 that option was discontinued. All
versions of Excel, including Excel 2010 are capable of running an XLM
macro, though Microsoft discourages their use.

Charts -Excel supports charts, graphs, or histograms generated from


specified groups of cells. The generated graphic component can either be
embedded within the current sheet, or added as a separate object.
These displays are dynamically updated if the content of cells change. For
example, suppose that the important design requirements are displayed
visually; then, in response to a user's change in trial values for parameters,
the curves describing the design change shape, and their points of
intersection shift, assisting the selection of the best design.

Data storage and communication


Number of rows and columns - Versions of Excel up to 7.0 had a
limitation in the size of their data sets of 16K (214 = 16384) rows. Versions 8.0
through 11.0 could handle 64K (216 = 65536) rows and 256 columns (28 as label 'IV').
Version 12.0 can handle 1M (220 = 1048576) rows, and 16384 (214 as label 'XFD')
columns.
File formats - Microsoft Excel up until 2007 version used a proprietary binary
file format called Excel Binary File Format (.XLS) as its primary format. Excel 2007
uses Office Open XML as its primary file format, an XML-based format that followed
after a previous XML-based format called "XML Spreadsheet" ("XMLSS"), first
introduced in Excel 2002.
Although supporting and encouraging the use of new XML-based formats as
replacements, Excel 2007 remained backwards-compatible with the traditional,
binary formats. In addition, most versions of Microsoft Excel can
read CSV, DBF, SYLK, DIF, and other legacy formats. Support for some older file
formats was removed in Excel 2007. The file formats were mainly from DOS-based
programs.

Binary - OpenOffice.org has created documentation of the Excel


format. Since then Microsoft made the Excel binary format specification
available to freely download.

.Current file extensions


Microsoft Excel 2007, along with the other products in the Microsoft
Office 2007 suite, introduced new file formats. The first of these (.xlsx) is
defined in the Office Open XML (OOXML) specification.

Excel 2007 formats


Format Extension Description
The default Excel 2007 and later workbook format. In reality
a ZIP compressed archive with a directory structure of XML text
Excel
.xlsx documents. Functions as the primary replacement for the former
Workbook
binary .xls format, although it does not support Excel macros for
security reasons.
Excel Macro- .xlsm As Excel Workbook, but with macro support.
enabled
Workbook
As Excel Macro-enabled Workbook, but storing information in
binary form rather than XML documents for opening and saving
Excel Binary
.xlsb documents more quickly and efficiently. Intended especially for
Workbook
very large documents with tens of thousands of rows, and/or
several hundreds of columns.
Excel Macro-
A template document that forms a basis for actual workbooks,
enabled .xltm
with macro support. The replacement for the old .xlt format.
Template
Excel add-in to add extra functionality and tools. Inherent macro
Excel Add-in .xlam
support because of the file purpose.

Old file extensions


Format Extension Description
Main spreadsheet format which holds data in worksheets, charts,
Spreadsheet .xls
and macros
Add-in
.xla Adds custom functionality; written in VBA
(VBA)
The file extension where Microsoft Excel custom toolbar settings
Toolbar .xlb
are stored.
A chart created with data from a Microsoft Excel spreadsheet that
only saves the chart. To save the chart and spreadsheet save
Chart .xlc
as .XLS. XLC is not supported in Excel 2007 or in any newer
versions of Excel.
Dialog .xld Used in older versions of Excel.
Archive .xlk A backup of an Excel Spreadsheet
Adds custom functionality; written in C++/C, Visual Basic, Fortran,
Add-in (DLL) .xll
etc. and compiled in to a special dynamic-link library
Macro .xlm A macro is created by the user or pre-installed with Excel.
A pre-formatted spreadsheet created by the user or by Microsoft
Template .xlt
Excel.
A module is written in VBA (Visual Basic for Applications) for
Module .xlv
Microsoft Excel
Code written in VBA may access functions in a DLL, typically this
Library .DLL
is used to access the Windows API
Workspace .xlw Arrangement of the windows of multiple Workbooks

Using other Windows applications- Windows applications such


as Microsoft Access and Microsoft Word, as well as Excel can
communicate with each other and use each other's capabilities. The
most common are Dynamic Data Exchange: although strongly
deprecated by Microsoft, this is a common method to send data between
applications running on Windows, with official MS publications referring
to it as "the protocol from hell".As the name suggests, it allows
applications to supply data to others for calculation and display. It is very
common in financial markets, being used to connect to important
financial data services such as Bloomberg and Reuters.
OLE Object Linking and Embedding: allows a Windows application to
control another to enable it to format or calculate data. This may take on
the form of "embedding" where an application uses another to handle a
task that it is more suited to, for example a PowerPoint presentation may
be embedded in an Excel spreadsheet or vice versa.
Using external data

Password protection - Microsoft Excel protection offers several types of


passwords:
 Password to open a document 
 Password to modify a document 
 Password to unprotect worksheet
 Password to protect workbook
 Password to protect the sharing workbook 
All passwords except password to open a document can be removed instantly
regardless of Microsoft Excel version used to create the document. These types of
passwords are used primarily for shared work on a document. Such password-
protected documents are not encrypted, and a data sources from a set password is
saved in a document’s header.

Statistical Analysis With MS - Excel

At A Glance - We used Excel to do some basic data analysis tasks


to see whether it is a reasonable alternative to using a statistical
package for the same tasks.  We concluded that Excel is a poor choice
for statistical analysis beyond textbook examples, the simplest
descriptive statistics, or for more than a very few columns.  The
problems we encountered that led to this conclusion are in four general
areas

 Missing values are handled inconsistently, and sometimes


incorrectly.
 Data organization differs according to analysis, forcing you to
reorganize your data in many ways if you want to do many
different analyses.
 Many analyses can only be done on one column at a time, making
it inconvenient to do the same analysis on many columns.
 Output is poorly organized, sometimes inadequately labeled, and
there is no record of how an analysis was accomplished.

Excel is convenient for data entry, and for quickly manipulating rows and
columns prior to statistical analysis. However when you are ready to do
the statistical analysis, we recommend the use of a statistical package
such as SAS, SPSS, Stata, Systat or Minitab.

Introduction - Excel is probably the most commonly used


spreadsheet for PCs. Newly purchased computers often arrive with
Excel already loaded. It is easily used to do a variety of calculations,
includes a collection of statistical functions, and a Data Analysis
ToolPak. As a result, if you suddenly find you need to do some statistical
analysis, you may turn to it as the obvious choice. We decided to do
some testing to see how well Excel would serve as a Data Analysis
application.

To present the results, we will use a small example. The data for this
example is fictitious. It was chosen to have two categorical and two
continuous variables, so that we could test a variety of basic statistical
techniques. Since almost all real data sets have at least a few missing
data points, and since the ability to deal with missing data correctly is
one of the features that we take for granted in a statistical analysis
package, we introduced two empty cells in the data:

We used this data to do some simple analyses and compared the results with a
standard statistical package. The comparison considered the accuracy of the results as
well as the ease with which the interface could be used for bigger data sets - i.e. more
columns. We used SPSS as the standard, though any of the statistical packages OIT
supports would do equally well for this purpose. In this article when we say "a statistical
package," we mean SPSS, SAS, STATA, SYSTAT, or Minitab.
Most of Excel statistical procedures are part of the Data Analysis tool pack, which is in
the Tools menu. It includes a variety of choices including simple descriptive statistics, t-
tests, correlations, 1 or 2-way analysis of variance, regression, etc. If you do not have a
Data Analysis item on the Tools menu, you need to install the Data Analysis
 
ToolPak.  Search in Help for "Data Analysis Tools" for instructions on loading the
ToolPak. 

Two other Excel features are useful for certain analyses, but the Data Analysis tool pack
is the only one that provides reasonably complete tests of statistical significance. Pivot
Table in the Data menu can be used to generate summary tables of means, standard
deviations, counts, etc. Also, you could use functions to generate some statistical
measures, such as a correlation coefficient. Functions generate a single number, so
using functions you will likely have to combine bits and pieces to get what you want.
Even so, you may not be able to generate all the parts you need for a complete
analysis.

Unless otherwise stated, all statistical tests using Excel were done with the Data
Analysis ToolPak. In order to check a variety of statistical tests, we chose the following
tasks:

 Get means and standard deviations of X and Y for the entire group, and for each
treatment group.

 Get the correlation between X and Y.

 Do a two sample t-test to test whether the two treatment groups differ on X and Y.

 Do a paired t-test to test whether X and Y are statistically different from each other.

 Compare the number of subjects with each outcome by treatment group, using a chi-
squared test.

All of these tasks are routine for a data set of this nature, and all of them could be easily
done using any of the aobve listed statistical packages.

General Issues
Enable the Analysis ToolPak - The Data Analysis ToolPak is not installed with the
standard Excel setup.  Look in the Tools menu.  If you do not have a Data Analysis
item, you will need to install the Data Analysis tools.  Search Help for "Data Analysis
Tools" for instructions.

Missing Values - A blank cell is the only way for Excel to deal with missing data.  If
you have any other missing value codes, you will need to change them to blanks.
Data Arrangement - Different analyses require the data to be arranged in various
ways.  If you plan on a variety of different tests, there may not be a single
arrangement that will work.  You will probably need to rearrange the data several
ways to get everything you need.

Dialog Boxes -Choose Tools/Data Analysis, and select the kind of


analysis you want to do.  The typical dialog box will have the following
items:
            Input Range:  Type the upper left and lower right corner cells.
e.g. A1:B100.  You can only choose adjacent rows and columns.  Unless
there is a checkbox for grouping data by rows or columns (and there
usually is not), all the data is considered as one glop.
            Labels - There is sometimes a box you can check off to indicate
that the first row of your sheet contains labels.  If you have labels in the
first row, check this box, and your output MAY be labeled with your
label.  Then again, it may not. 
            Output location - New Sheet is the default.  Or, type in the cell
address of the upper left corner of where you want to place the output in
the current sheet.  New Worksheet is another option, which I have not
tried.  Ramifications of this choice are discussed below.
            Other items, depending on the analysis.

Output location - The output from each analysis can go to a new sheet
within your current Excel file (this is the default), or you can place it
within the current sheet by specifying the upper left corner cell where
you want it placed.  Either way is a bit of a nuisance.  If each output is in
a new sheet, you end up with lots of sheets, each with a small bit of
output.  If you place them in the current sheet, you need to place them
appropriately; leave room for adding comments and labels; changes you
need to make to format one output properly may affect another output
adversely.  Example:  Output from Descriptives has a column of labels
such as Standard Deviation, Standard Error, etc.  You will want to make
this column wide in order to be able to read the labels.  But if a simple
Frequency output is right underneath, then the column displaying the
values being counted, which may just contain small integers, will also be
wide.  
 
Results of Analyses

Descriptive Statistics - The quickest way to get means and standard deviations
for a entire group is using Descriptives in the Data Analysis tools. We can choose
several adjacent columns for the Input Range (in this case the X and Y columns),
and each column is analyzed separately. The labels in the first row are used to label
the output, and the empty cells are ignored. If you have more, non-adjacent columns
you need to analyze, you will have to repeat the process for each group of
contiguous columns. The procedure is straightforward, can manage many columns
reasonably efficiently, and empty cells are treated properly.

Correlations - Using the Data Analysis tools, the dialog for correlations is much
like the one for descriptives - you can choose several contiguous columns, and get
an output matrix of all pairs of correlations. Empty cells are ignored appropriately.
The output does NOT include the number of pairs of data points used to compute
each correlation (which can vary, depending on where you have missing data), and
does not indicate whether any of the correlations are statistically significant. If you
want correlations on non-contiguous columns, you would either have to include the
intervening columns, or copy the desired columns to a contiguous location.

Two-Sample T-test - This test can be used to check whether the two treatment
groups differ on the values of either X or Y. In order to do the test you need to enter
a cell range for each group. Since the data were not entered by treatment group, we
first need to sort the rows by treatment. Be sure to take all the other columns
along with treatment, so that the data for each subject remains intact. After the
data is sorted, you can enter the range of cells containing the X measurements for
each treatment. Do not include the row with the labels, because the second group
does not have a label row. Therefore your output will not be labeled to indicate that
this output is for X. If you want the output labeled, you have to copy the cells
corresponding to the second group to a separate column, and enter a row with a
label for the second group. If you also want to do the t-test for the Y measurements,
youll need to repeat the process. The empty cells are ignored, and other than the
problems with labeling the output, the results are correct.

Paired t-test - The paired t-test is a method for testing whether the difference
between two measurements on the same subject is significantly different from 0. In
this example, we wish to test the difference between X and Y measured on the same
subject. The important feature of this test is that it compares the measurements
within each subject. If you scan the X and Y columns separately, they do not look
obviously different. But if you look at each X-Y pair, you will notice that in every case,
X is greater than Y. The paired t-test should be sensitive to this difference. In the two
cases where either X or Y is missing, it is not possible to compare the two measures
on a subject. Hence, only 8 rows are usable for the paired t-test.

Crosstabulation and Chi-Squared Test of Independence -


Our final task is to
count the two outcomes in each treatment group, and use a chi-square test of
independence to test for a relationship between treatment and outcome.  In order to
count the outcomes by treatment group, you need to use Pivot Tables. In the Pivot
Table Wizard's Layout option, drag Treatment to Row, Outcome to Column and also
to Data.  The Data area should say "Count of Outcome" – if not, double-click on it
and select "Count". If you want percents, double-click "Count of Outcome", and click
Options; in the “Show Data As” box which appears, select "% of row".  If you want
both counts and percents, you can drag the same variable into the Data area twice,
and use it once for counts and once for percents.

Additional Analyses
The remaining analyses were not done on this data set, but some comments about
them are included for completeness.

Simple Frequencies –
We can use Pivot Tables to get simple frequencies.  (see
Crosstabulations for more about how to get Pivot Tables.)  Using Pivot Tables, each
column is considered a separate variable, and labels in row 1 will appear on the
output.  You can only do one variable at a time.
 
Linear Regression - Since regression is one of the more frequently used statistical
analyses, we tried it out even though we did not do a regression analysis for this
example. The Regression procedure in the Data Analysis tools lets you choose one
column as the dependent variable, and a set of contiguous columns for the
independents. However, it does not tolerate any empty cells anywhere in the input
ranges, and you are limited to 16 independent variables. Therefore, if you have any
empty cells, you will need to copy all the columns involved in the regression to new
columns, and delete any rows that contain any empty cells. Large models, with more
than 16 predictors, cannot be done at all.

Analysis of Variance
In general, the Excel's ANOVA features are limited to a few special cases rarely
found outside textbooks, and require lots of data re-arrangements.
One-way ANOVA - Data must be arranged in separate and adjacent columns (or
rows) for each group.  Clearly, this is not conducive to doing 1-ways on more than
one grouping.  If you have labels in row 1, the output will use the labels. 

Two-Factor ANOVA Without Replication - This only does the case with one
observation per cell (i.e. no Within Cell error term).  The input range is a
rectangular arrangement of cells, with rows representing levels of one factor,
columns the levels of the other factor, and the cell contents the one value in that
cell. 

Two-Factor ANOVA with Replicates - This does a two-way ANOVA with equal


cell sizes.  Input must be a rectangular region with columns representing the levels
of one factor, and rows representing replicates within levels of the other factor.  The
input range MUST also include an additional row at the top, and column on the left,
with labels indicating the factors.  However, these labels are not used to label the
resulting ANOVA table.  Click Help on the ANOVA dialog for a picture of what the
input range must look like.

Requesting Many Analyses - If you had a variety of different statistical


procedures that you wanted to perform on your data, you would almost certainly find
yourself doing a lot of sorting, rearranging, copying and pasting of your data. This is
because each procedure requires that the data be arranged in a particular way, often
different from the way another procedure wants the data arranged. In our small test,
we had to sort the rows in order to do the t-test, and copy some cells in order to get
labels for the output. We had to clear the contents of some cells in order to get the
correct paired t-test, but did not want those cells cleared for some other test. And we
were only doing five tasks. It does not get better when you try to do more. There is
no single arrangement of the data that would allow you to do many different analyses
without making many different copies of the data. The need to manipulate the data in
many ways greatly increases the chance of introducing errors.

Using a statistical program, the data would normally be arranged with the rows
representing the subjects, and the columns representing variables (as they are in our
sample data). With this arrangement you can do any of the analyses discussed here,
and many others as well, without having to sort or rearrange your data in any way.
Only much more complex analyses, beyond the capabilities of Excel and the scope
of this article would require data rearrangement.

Working with Many Columns - What if your data had not 4, but
40 columns, with a mix of categorical and continuous measures? How
easily do the above procedures scale to a larger problem?

At best, some of the statistical procedures can accept multiple


contiguous columns for input, and interpret each column as a different
measure. The descriptives and correlations procedures are of this type,
so you can request descriptive statistics or correlations for a large
number of continuous variables, as long as they are entered in adjacent
columns. If they are not adjacent, you need to rearrange columns or use
copy and paste to make them adjacent.

Discriptive Statistics in MS – Excel


Excel - To open Excel in windows go Start - Programs - Microsoft Office -
Excel
 
When it opens we will see a blank worksheet, which consists of
alphabetically titled columns and numbered rows. Each cell is referenced
by its coordinates of columns and rows, for example A1 is the cell located
in column A and row 1; B7 is the cell in column B and row 7.we can
reference a range of cells, for example C1:C5 are cells in columns C and
rows 1 to 5. we can also reference a matrix, A10:C15, are cells in columns
A, B and C and rows 10 to 15.
 
Excel has 256 columns and 65,536 rows.
 

 
There are some shortcuts to move within the current sheet:
 
         Home - moves to the first column in the current row

         End - Right Arrow - moves to the last filled cell in the current row
         End - Down Arrow - moves to the last filled cell in the current column
         Ctrl-Home -moves to cell A1

         Ctrl-End - moves to the last cell in your document (not the last cell of the

current sheet)
         Ctrl-Shift-End - selects everything between the active cell to the last cell

in the document
 

 
Entering data -we can type anything on a cell, in general you can enter
text (or labels), numbers, formulas (starting with the =sign), and logical
values (as in – trueor false).
 
Click on a cell and start typing, once we finish typing press - enter (to move
to the next cell below) or tab (to move to the next cell to the right)
 
We can write long sentences in one single cell but we may see it partially
depending on the column width of the cell (and whether the adjacent
column is full). To adjust the width of a column go to Format - Column -
Width or select - AutoFit Selection.
 
Numbers are assumed to be positive, if we need to enter a negative value
use the minus sign (-) or enclose the number in parentheses (number).
 
If we need to enter percentages, dollar sign, or any other symbol to identify
the number just add the %or $. We can also enter the number and change
its format using the menu: Format - Cell and select the number tab which
has all the different formats.
 
Dates are automatically stored as mm/dd/yyyy (or the default format if
changed) but there is some flexibility here. Enter month and number and
excel will enter the date in the default format. If we press ctrl and ; (Crtl-;)
excel will enter the current date.
 
Time is also entered in a default format. Enter 5 pm, excel will write 5:00
PM. To enter the current time press ctrland : (Ctrl-:)
 
To practice enter the following table (these data are made-up, not real)
 
 
 
 
Each column has a list of items. Column A has IDs, column B has last
names of students and so on.
 
Lets say for example we do not want capital letters for the columns - Last
Name and First Name. we do not want SMITH we want Smith. Two
options, we can re-type all the names or we can use the following
formula (IMPORTANT: All formulas start with the equal (= sign):
 
=PROPER(cell with the text you want to change)
 
 
The full table should look like this. This is a made up table, it is just a
collection of random info and data.
 

Exploring data in excel
 
 
Descriptive statistics (using excels data analysis tool)
 
Generally one of the first things to do with new data is to get to know it by
asking some general questions like but not limited to the following:
 
         What variables are included? What information are we getting?

         What is the format of the variables: string, numeric, etc.?

         What type of variables: categorical, continuous, and discrete?

         Is this sample or population data?

 
After looking at the data you may want to know
 
         How many males/females?

         What is the average age?

         How many undergraduate/graduates students?

         What is the average SAT score? It is the same for graduates and

undergraduates?
         Who reads the newspaper more frequently: men or women?

 
We can start answering some of these questions by looking directly at the
table, for some other questions we may have to do some calculations by
obtaining a set of descriptive statistics. These statistics are a collection of
measurements of two things: location and variability. Location tells you the
central value (the mean is the most common measure of this) of your
variables. Variability refers to the spread of the data from the center value
(i.e. variance, standard deviation). Statistics is basically the study of what
causes such variability.
 
Locatio Variability
n
Mean Variance
Mode Standard deviation
Median Range
 
 
 
Lets check this window:
 
Input Range: This is to select the data you want to analyze.
 

 
Once you click in the input range you need to select the cells you want to
analyze.
 

 
Back to the window
 
 
Since we include the labels in first row make sure to check that option. For
the output option which is the place where excel will enter the results select
O1 or you can select a new worksheet or even new workbook.
 
Check Summary statistics and the press OK. we will get the following:
 

 
While the whole descriptive statistics cells are selected go to Format Cells
to change all numbers to have one decimal point. When we get the format
cells window, select the following:
 

 
Click OK. All numbers should now have one decimal as follows:
 

 
Now we know something about our data.
 
The average student in this sample is 25.2 years, has a SAT score of
1848.9, got a grade of 80.4, is 66.4 inches tall and reads the newspaper
4.9 times a week. We know this by looking at the mean value on each
variable.
 
The mean is the sum of the observations divided by the total number of
observations. It is the most common indicator of central tendency of a
variable. If we look at the last two rows: Sum and Count we can estimate
the mean dividing Sumby Count (sum/count). we can also calculate the
mean using the function below (IMPORTANT: All functions start with the equal =
sign):
 
=AVERAGE(range of cells with the values of interest)
 
For age=AVERAGE(J2:J31)
 
Sumrefers to the sum of all the values in a range of values. For age means
the sum of the ages of all students. The excel function for sum is:
 
=SUM(range of cells with the values of interest)
 
Countrefers to the count of cell that contain values (numbers). The function
is:
 
=COUNT(range of cells with the values of interest)
 
Minis the lowest value in an array of values. The function is:
 
=MIN(range of cells with the values of interest)
 
Max is the largest value in an array of values. The function is:
 
=MAX(range of cells with the values of interest)
 
 

The pivot wizard will walk you through the process, this is the first window
 
 
Press Next. In step 2 select the range for the range of all values as in the
following picture:
 

 
In step 3 select New worksheetand press Layout
 

 
This is where you make the pivot table:
 

 
On the right side of the wizard layout we can see the list of all variables in
the data. Click and drag Genderinto the ROW area. Click and drag
Majorinto the COLUMN area, and click and drag Sat score into the
DATAarea. The wizard layout should look like this:
 

 
 
In the DATA area double-click on Sum of Sat score, a new window will pop-
up select Average and click OK.
 

 
The wizard layout should look like this. Click OK, in the wizard window step
3 click Finish
 

In a new worksheet you will see the following (the pivot table window was
moved to save some space).
 
 
This is a crosstabulation between gender and major. Each cell represents
the average SAT score for a student according to gender and major. For
example a female student with an econ major has an average SAT score of
1952 (cell B5 in the picture) while a male student also with an econ major
has 1743 (B6). Overall econ major students have an average SAT score of
1806 (B7) . In general, female students have an average SAT score in this
sample of 1871.8 (E5) while male students 1826 (E6).
 

 One-way ANOVA using excel


 
Lets say we want to explore whether there is a relationship between the
average score (grade) of each student and his/her major. In the sample we
have three majors: Econ, Math and Politics. The grades are the final
grades for the entire academic year.
 
SAMPL
To do this we use one-way ANOVA, which stands for analysis of variance.
ANOVA is a broad class of techniques for identifying and measuring the
ING
various sources of variation within a collection of data. It is closely related
to regression analysis but with the following difference: we can think of the
analysis of variance technique as testing hypotheses about the presence of
AND
relationships between predictor and criterion variables, regression analysis
as describing the nature of those relationships, and r2 as measuring

STATIS
the strength of the relationships.

TICAL
INFERE
NCE
SAMPLING
Sampling is a process used in statistical analysis in which a predetermined
number of observations are taken from a larger population. The methodology
used to sample from a larger population depends on the type of analysis being
performed but may include simple random sampling or systematic sampling.

In business, a CPA performing an audit uses sampling to determine the


accuracy of account balances in the financial statements, and mana

METHOD OF
SAMPLING
It would normally be impractical to study a whole population, for example when doing a
questionnaire survey. Sampling is a method that allows researchers to infer information
about a population based on results from a subset of the population, without having to
investigate every individual. Reducing the number of individuals in a study reduces the
cost and workload, and may make it easier to obtain high quality information, but this
has to be balanced against having a large enough sample size with enough power to
detect a true association. (Calculation of sample size is addressed in section 1B
(statistics) of the Part A syllabus.)
If a sample is to be used, by whatever method it is chosen, it is important that the
individuals selected are representative of the whole population. This may involve
specifically targeting hard to reach groups. For example, if the electoral roll for a town
was used to identify participants, some people, such as the homeless, would not be
registered and therefore excluded from the study by default.

There are several different sampling techniques available, and they can be subdivided
into two groups: probability sampling and non-probability sampling. In probability
(random) sampling, you start with a complete sampling frame of all eligible individuals
from which you select your sample. In this way, all eligible individuals have a chance of
being chosen for the sample, and you will be more able to generalise the results from
your study. Probability sampling methods tend to be more time-consuming and
expensive than non-probability sampling. In non-probability (non-random) sampling, you
do not start with a complete sampling frame, so some individuals have no chance of
being selected. Consequently, you cannot estimate the effect of sampling error and there
is a significant risk of ending up with a non-representative sample which produces non-
generalisable results. However, non-probability sampling methods tend to be cheaper
and more convenient, and they are useful for exploratory research and hypothesis
generation.
 

Probability Sampling Methods


1. Simple random sampling

In this case each individual is chosen entirely by chance and each member of the
population has an equal chance, or probability, of being selected. One way of obtaining a
random sample is to give each individual in a population a number, and then use a table
of random numbers to decide which individuals to include.1 For example, if you have a
sampling frame of 1000 individuals, labelled 0 to 999, use groups of three digits from
the random number table to pick your sample. So, if the first three numbers from the
random number table were 094, select the individual labelled “94”, and so on.

As with all probability sampling methods, simple random sampling allows the sampling
error to be calculated and reduces selection bias. A specific advantage is that it is the
most straightforward method of probability sampling. A disadvantage of simple random
sampling is that you may not select enough individuals with your characteristic of
interest, especially if that characteristic is uncommon. It may also be difficult to define a
complete sampling frame and inconvenient to contact them, especially if different forms
of contact are required (email, phone, post) and your sample units are scattered over a
wide geographical area.
 

2. Systematic sampling

Individuals are selected at regular intervals from the sampling frame. The intervals are
chosen to ensure an adequate sample size. If you need a sample size n from a
population of size x, you should select every x/nth individual for the sample.  For
example, if you wanted a sample size of 100 from a population of 1000, select every
1000/100 = 10th member of the sampling frame.
Systematic sampling is often more convenient than simple random sampling, and it is
easy to administer. However, it may also lead to bias, for example if there are
underlying patterns in the order of the individuals in the sampling frame, such that the
sampling technique coincides with the periodicity of the underlying pattern. As a
hypothetical example, if a group of students were being sampled to gain their opinions
on college facilities, but the Student Record Department’s central list of all students was
arranged such that the sex of students alternated between male and female, choosing
an even interval (e.g. every 20thstudent) would result in a sample of all males or all
females. Whilst in this example the bias is obvious and should be easily corrected, this
may not always be the case.
 

3. Stratified sampling

In this method, the population is first divided into subgroups (or strata) who all share a
similar characteristic. It is used when we might reasonably expect the measurement of
interest to vary between the different subgroups, and we want to ensure representation
from all the subgroups. For example, in a study of stroke outcomes, we may stratify the
population by sex, to ensure equal representation of men and women. The study sample
is then obtained by taking equal sample sizes from each stratum. In stratified sampling,
it may also be appropriate to choose non-equal sample sizes from each stratum. For
example, in a study of the health outcomes of nursing staff in a county, if there are
three hospitals each with different numbers of nursing staff (hospital A has 500 nurses,
hospital B has 1000 and hospital C has 2000), then it would be appropriate to choose
the sample numbers from each hospital proportionally (e.g. 10 from hospital A, 20 from
hospital B and 40 from hospital C). This ensures a more realistic and accurate estimation
of the health outcomes of nurses across the county, whereas simple random sampling
would over-represent nurses from hospitals A and B. The fact that the sample was
stratified should be taken into account at the analysis stage.
Stratified sampling improves the accuracy and representativeness of the results by
reducing sampling bias. However, it requires knowledge of the appropriate
characteristics of the sampling frame (the details of which are not always available), and
it can be difficult to decide which characteristic(s) to stratify by.
 

4. Clustered sampling

In a clustered sample, subgroups of the population are used as the sampling unit, rather
than individuals. The population is divided into subgroups, known as clusters, which are
randomly selected to be included in the study. Clusters are usually already defined, for
example individual GP practices or towns could be identified as clusters. In single-stage
cluster sampling, all members of the chosen clusters are then included in the study. In
two-stage cluster sampling, a selection of individuals from each cluster is then randomly
selected for inclusion. Clustering should be taken into account in the analysis. The
General Household survey, which is undertaken annually in England, is a good example
of a (one-stage) cluster sample. All members of the selected households (clusters) are
included in the survey.1

Cluster sampling can be more efficient that simple random sampling, especially where a
study takes place over a wide geographical region. For instance, it is easier to contact
lots of individuals in a few GP practices than a few individuals in many different GP
practices. Disadvantages include an increased risk of bias, if the chosen clusters are not
representative of the population, resulting in an increased sampling error.
 

Non-Probability Sampling Methods

1. Convenience sampling

Convenience sampling is perhaps the easiest method of sampling, because participants


are selected based on availability and willingness to take part. Useful results can be
obtained, but the results are prone to significant bias, because those who volunteer to
take part may be different from those who choose not to (volunteer bias), and the
sample may not be representative of other characteristics, such as age or sex. Note:
volunteer bias is a risk of all non-probability sampling methods.
 

2. Quota sampling

This method of sampling is often used by market researchers. Interviewers are given a
quota of subjects of a specified type to attempt to recruit. For example, an interviewer
might be told to go out and select 20 adult men, 20 adult women, 10 teenage girls and
10 teenage boys so that they could interview them about their television viewing. Ideally
the quotas chosen would proportionally represent the characteristics of the underlying
population.

Whilst this has the advantage of being relatively straightforward and potentially
representative, the chosen sample may not be representative of other characteristics
that weren’t considered (a consequence of the non-random nature of sampling). 2
 

3. Judgement (or Purposive) Sampling

Also known as selective, or subjective, sampling, this technique relies on the judgement
of the researcher when choosing who to ask to participate. Researchers may implicitly
thus choose a “representative” sample to suit their needs, or specifically approach
individuals with certain characteristics. This approach is often used by the media when
canvassing the public for opinions and in qualitative research.

Judgement sampling has the advantage of being time-and cost-effective to perform


whilst resulting in a range of responses (particularly useful in qualitative research).
However, in addition to volunteer bias, it is also prone to errors of judgement by the
researcher and the findings, whilst being potentially broad, will not necessarily be
representative.
 

4. Snowball sampling

This method is commonly used in social sciences when investigating hard-to-reach


groups. Existing subjects are asked to nominate further subjects known to them, so the
sample increases in size like a rolling snowball. For example, when carrying out a survey
of risk behaviours amongst intravenous drug users, participants may be asked to
nominate other users to be interviewed.

Snowball sampling can be effective when a sampling frame is difficult to identify.


However, by selecting friends and acquaintances of subjects already investigated, there
is a significant risk of selection bias (choosing a large number of people with similar
characteristics or views to the initial individual identified).
 

Bias in sampling

There are five important potential sources of bias that should be considered when
selecting a sample, irrespective of the method used. Sampling bias may be introduced
when:1

1. Any pre-agreed sampling rules are deviated from


2. People in hard-to-reach groups are omitted
3. Selected individuals are replaced with others, for example if they are difficult to
contact
4. There are low response rates
5. An out-of-date list is used as the sample frame (for example, if it excludes people
who have recently moved to an area)

Advantages of sampling
Sampling ensures convenience, collection of intensive and
exhaustive data, suitability in limited resources and better rapport.
In addition to this, sampling has the following advantages also.

1. Low cost of sampling


If data were to be collected for the entire population, the cost will be
quite high. A sample is a small proportion of a population. So, the
cost will be lower if data is collected for a sample of population
which is a big advantage.

2. Less time consuming in sampling


Use of sampling takes less time also. It consumes less time than
census technique. Tabulation, analysis etc., take much less time in
the case of a sample than in the case of a population.
3. Scope of sampling is high
The investigator is concerned with the generalization of data. To
study a whole population in order to arrive at generalizations would
be impractical.

Some populations are so large that their characteristics could not be


measured. Before the measurement has been completed, the
population would have changed. But the process of sampling makes
it possible to arrive at generalizations by studying the variables
within a relatively small proportion of the population.

4. Accuracy of data is high


Having drawn a sample and computed the desired descriptive
statistics, it is possible to determine the stability of the obtained
sample value. A sample represents the population from which its is
drawn. It permits a high degree of accuracy due to a limited area of
operations. Moreover, careful execution of field work is possible.
Ultimately, the results of sampling studies turn out to be sufficiently
accurate.

5. Organization of convenience
Organizational problems involved in sampling are very few. Since
sample is of a small size, vast facilities are not required. Sampling is
therefore economical in respect of resources. Study of samples
involves less space and equipment.

6. Intensive and exhaustive data


In sample studies, measurements or observations are made of a
limited number. So, intensive and exhaustive data are collected.

7. Suitable in limited resources


The resources available within an organization may be limited.
Studying the entire universe is not viable. The population can be
satisfactorily covered through sampling. Where limited resources
exist, use of sampling is an appropriate strategy while conducting
marketing research.

8. Better rapport
An effective research study requires a good rapport between the
researcher and the respondents. When the population of the study
is large, the problem of rapport arises. But manageable samples
permit the researcher to establish adequate rapport with the
respondents.

Disadvantages of sampling
The reliability of the sample depends upon the appropriateness of
the sampling method used. The purpose of sampling theory is to
make sampling more efficient. But the real difficulties lie in
selection, estimation and administration of samples.

Disadvantages of sampling may be discussed under the heads:

 Chances of bias
 Difficulties in selecting truly a representative sample
 Need for subject specific knowledge
 changeability of sampling units
 impossibility of sampling.
1. Chances of bias
The serious limitation of the sampling method is that it involves
biased selection and thereby leads us to draw erroneous
conclusions. Bias arises when the method of selection of sample
employed is faulty. Relative small samples properly selected may be
much more reliable than large samples poorly selected.

2. Difficulties in selecting a truly representative sample


Difficulties in selecting a truly representative sample produces
reliable and accurate results only when they are representative of
the whole group. Selection of a truly representative sample is
difficult when the phenomena under study are of a complex nature.
Selecting good samples is difficult.

3. In adequate knowledge in the subject


Use of sampling method requires adequate subject specific
knowledge in sampling technique. Sampling involves statistical
analysis and calculation of probable error. When the researcher
lacks specialized knowledge in sampling, he may commit serious
mistakes. Consequently, the results of the study will be misleading.
4. Changeability of units
When the units of the population are not in homogeneous, the
sampling technique will be unscientific. In sampling, though the
number of cases is small, it is not always easy to stick to the,
selected cases. The units of sample may be widely dispersed.

Some of the cases of sample may not cooperate with the researcher
and some others may be inaccessible. Because of these problems, all
the cases may not be taken up. The selected cases may have to be
replaced by other cases. Changeability of units stands in the way of
results of the study.

5. Impossibility of sampling
Deriving a representative sample is di6icult, when the universe is
too small or too heterogeneous. In this case, census study is the only
alternative. Moreover, in studies requiring a very high standard of
accuracy, the sampling method may be unsuitable. There will be
chances of errors even if samples are drawn most carefully.

SAMPLING VS NON-
SAMPLING ERROR
Sampling error is one which occurs due to unrepresentativeness of the
sample selected for observation. Conversely, non-sampling error is an
error arise from human error, such as error in problem identification, method
or procedure used, etc.

An ideal research design seeks to control various types of error, but there are
some potential sources which may affect it. In sampling theory, total error can
be defined as the variation between the mean value of population parameter
and the observed mean value obtained in the research. The total error can be
classified into two categories, i.e. sampling error and non-sampling error.

In this article excerpt, you can find the important differences between
sampling and non-sampling error in detail.

Sampling Error Vs Non-Sampling Error

1. Comparison Chart
2. Definition
3. Key Differences
4. Conclusion
Comparison Chart
BASIS FOR NON-SAMPLING
SAMPLING ERROR
COMPARISON ERROR

Meaning Sampling error is a type of An error occurs due to


error, occurs due to the sources other than
sample selected does not sampling, while conducting
perfectly represents the survey activities is known as
population of interest. non sampling error.

Cause Deviation between sample Deficiency and analysis of


mean and population mean data

Type Random Random or Non-random

Occurs Only when sample is Both in sample and census.


selected.

Sample size Possibility of error reduced It has nothing to do with the


with the increase in sample sample size.
size.

Definition of Sampling Error

Sampling Error denotes a statistical error arising out of a certain sample


selected being unrepresentative of the population of interest. In simple terms,
it is an error which occurs when the sample selected does not contain the true
characteristics, qualities or figures of the whole population.

The main reason behind sampling error is that the sampler draws various
sampling units from the same population but, the units may have individual
variances. Moreover, they can also arise out of defective sample design, faulty
demarcation of units, wrong choice of statistic, substitution of sampling unit
done by the enumerator for their convenience. Therefore, it is considered as
the deviation between true mean value for the original sample and the
population.
Definition of Non-Sampling Error

Non-Sampling Error is an umbrella term which comprises of all the errors,


other than the sampling error. They arise due to a number of reasons, i.e. error
in problem definition, questionnaire design, approach, coverage, information
provided by respondents, data preparation, collection, tabulation, and
analysis.

There are two types of non-sampling error:

 Response Error: Error arising due to inaccurate answers were given


by respondents, or their answer is misinterpreted or recorded wrongly.
It consists of researcher error, respondent error and interviewer error
which are further classified as under.

o Researcher Error

 Surrogate Error
 Sampling Error
 Measurement Error
 Data Analysis Error
 Population Definition Error
 Respondent Error
 Inability Error
 Unwillingness Error
 Interviewer Error
 Questioning Error
 Recording Erro
 Respondent Selection Error
 Cheating Error
 Non-Response Error: Error arising due to some respondents
who are a part of the sample do not respond.

Key Differences Between Sampling and Non-Sampling Error

The significant differences between sampling and non-sampling error are


mentioned in the following points:

1. Sampling error is a statistical error happens due to the sample selected


does not perfectly represents the population of interest. Non-sampling
error occurs due to sources other than sampling while conducting
survey activities is known as non-sampling error.
2. Sampling error arises because of the variation between the true mean
value for the sample and the population. On the other hand, the non-
sampling error arises because of deficiency and inappropriate analysis
of data.
3. Non-sampling error can be random or non-random whereas sampling
error occurs in the random sample only.
4. Sample error arises only when the sample is taken as a representative of
a population.As opposed to non-sampling error which arises both in
sampling and complete enumeration.
5. Sampling error is mainly associated with the sample size, i.e. as the
sample size increases the possibility of error decreases. On the contrary,
the non-sampling error is not related to the sample size, so, with the
increase in sample size, it won’t be reduced.
Conclusion

To end this discussion, it is true to say that sampling error is one which is
completely related to the sampling design and can be avoided, by expanding
the sample size. Conversely, non-sampling error is a basket that covers all the
errors other than the sampling error and so, it unavoidable by nature as it is
not possible to completely remove it.

You might also like