Math4E Week 2 - Lecture 2

MATH4E
ENGINEERING DATA ANALYSIS

LECTURE 1
A STEP-BY-STEP GUIDE TO DATA
COLLECTION
• Data collection is a systematic process of gathering observations or measurements.
Whether you are performing research for business, governmental or academic
purposes, data collection allows you to gain first-hand knowledge and original
insights into your research problem.
• While methods and aims may differ between fields, the overall process of data
collection remains largely the same. Before you begin collecting data, you need to
consider:
• The aim of the research
• The type of data that you will collect
• The methods and procedures you will use to collect, store, and process the data
• To collect high-quality data that is relevant to your purposes, follow these four steps.
STEP 1: DEFINE THE AIM OF YOUR
RESEARCH
• Before you start the process of data collection, you need to identify exactly
what you want to achieve. You can start by writing a problem statement:
what is the practical or scientific issue that you want to address and why does
it matter?
• Next, formulate one or more research questions that precisely define what
you want to find out. Depending on your research questions, you might need
to collect quantitative or qualitative data:
• Quantitative data is expressed in numbers and graphs and is analyzed through
statistical methods.
• Qualitative data is expressed in words and analyzed through interpretations
and categorizations.
• If your aim is to test a hypothesis, measure something precisely, or gain large-scale
statistical insights, collect quantitative data.
• If your aim is to explore ideas, understand experiences, or gain detailed insights into
a specific context, collect qualitative data.
• If you have several aims, you can use a mixed methods approach that collects both
types of data.
• Examples of quantitative and qualitative research aims you are researching employee
perceptions of their direct managers in a large organization.
• Your first aim is to assess whether there are significant differences in perceptions of
managers across different departments and office locations.
• Your second aim is to gather meaningful feedback from employees to explore new
ideas for how managers can improve.
• You decide to use a mixed-methods approach to collect both quantitative and
qualitative data.
STEP 2: CHOOSE YOUR DATA COLLECTION
METHOD
• Based on the data you want to collect, decide which method is best suited for
your research.
• Experimental research is primarily a quantitative method.
• Interviews/focus groups and ethnography are qualitative methods.
• Surveys, observations, archival research and secondary data collection can be
quantitative or qualitative methods.
• Carefully consider what method you will use to gather data that helps you
directly answer your research questions.
Data collection methods
Method When to use How to collect data
Experiment To test a causal relationship. Manipulate variables and measure their effects on
others.
Survey To understand the general characteristics or Distribute a list of questions to a sample online, in
opinions of a group of people. person or over-the-phone.
Interview/focus To gain an in-depth understanding of Verbally ask participants open-ended questions in

group perceptions or opinions on a topic. individual interviews or focus group discussions.
Observation To understand something in its natural Measure or survey a sample without trying to affect
setting. them.
Ethnography To study the culture of a community or Join and participate in a community and record your
organization first-hand. observations and reflections.
Archival research To understand current or historical events, Access manuscripts, documents or records from
conditions or practices. libraries, depositories or the internet.
Secondary data To analyze data from populations that you Find existing datasets that have already been collected,
collection can’t access first-hand. from sources such as government agencies or research
organizations.
STEP 3: PLAN YOUR DATA COLLECTION
PROCEDURES
• When you know which method(s) you are using, you need to plan
exactly how you will implement them.
• What procedures will you follow to make accurate observations or
measurements of the variables you are interested in?
• For instance, if you’re conducting surveys or interviews, decide what
form the questions will take; if you’re conducting an experiment,
make decisions about your experimental design.
OPERATIONALIZATION
• Operationalization means turning abstract conceptual ideas into measurable observations. When
planning how you will collect data, you need to translate the conceptual definition of what you
want to study into the operational definition of what you will actually measure.
• Example of operationalization: You have decided to use surveys to collect quantitative data. The
concept you want to measure is the leadership of managers. You operationalize this concept in
two ways:
• You ask managers to rate their own leadership skills on 5-point scales assessing the ability to
delegate, decisiveness and dependability.
• You ask their direct employees to provide anonymous feedback on the managers regarding the
same topics.
• Using multiple ratings of a single concept can help you cross-check your data and assess the test
validity of your measures.
SAMPLING
• You may need to develop a sampling plan to obtain data systematically.
• This involves defining a population, the group you want to draw conclusions
about, and a sample, the group you will actually collect data from.
• Your sampling method will determine how you recruit participants or obtain
measurements for your study.
• To decide on a sampling method you will need to consider factors like the
required sample size, accessibility of the sample, and timeframe of the data
collection.
STANDARDIZING PROCEDURES
• If multiple researchers are involved, write a detailed manual to standardize
data collection procedures in your study.
• This means laying out specific step-by-step instructions so that everyone in
your research team collects data in a consistent way – for example, by
conducting experiments under the same conditions and using objective
criteria to record and categorize observations.
• This helps ensure the reliability of your data, and you can also use it to
replicate the study in the future.
CREATING A DATA MANAGEMENT PLAN
• Before beginning data collection, you should also decide how you will
organize and store your data.
• If you are collecting data from people, you will likely need to anonymize and
safeguard the data to prevent leaks of sensitive information (e.g. names or
identity numbers).
• If you are collecting data via interviews or pencil-and-paper formats, you will
need to perform transcriptions or data entry in systematic ways to minimize
distortion.
• You can prevent loss of data by having an organization system that is routinely
backed up.
STEP 4: COLLECT THE DATA
• Finally, you can implement your chosen methods to measure or observe the
variables you are interested in.
• Examples of collecting qualitative and quantitative data. To collect data
about perceptions of managers, you administer a survey with closed- and
open-ended questions to a sample of 300 company employees across
different departments and locations.
• The closed-ended questions ask participants to rate their manager’s
leadership skills on scales from 1–5. The data produced is numerical and can
be statistically analyzed for averages and patterns.
• The open-ended questions ask participants for examples of what the manager
is doing well now and what they can do better in the future. The data
produced is qualitative and can be categorized through content analysis for
further insights.
• To ensure that high quality data is recorded in a systematic way, here are
some best practices:
• Record all relevant information as and when you obtain data. For example,
note down whether or how lab equipment is recalibrated during an
experimental study.
• Double-check manual data entry for errors.
• If you collect quantitative data, you can assess the reliability and validity to
get an indication of your data quality.
BIAS IN RESEARCH AND ITS THREE TYPES
•Bias in research is anything that contaminates or compromises the

research. The three types of bias in research according to Silva, 2016
are:”
a) “Researcher bias – Researcher can be biased in favor of a particular result or finding in the
research. They can have a particular view and want that to be confirmed. They can influence the
findings of the research through the design of the study.”
b) “Sampling bias – occurs when the sampling procedure used in the research is flawed or
compromised in some way.”
c) Respondent bias:
i. “Response set – respondents answer questions in a patterned way.”
ii. “Acquiescence bias – happens when a respondent agrees with everything the researcher says.”
iii.“Social desirability bias – occurs when a respondent gives the socially desirable or the
politically correct response, instead of an honest response.”
iv. “Prestige bias – occurs when the respondent is influenced in responding through their
perception of the prestige of a group or individual.”
Define data management
According to the University Library, Research data management (RDM) refers to
the organization, storage and preservation of data created during a research project.
It covers initial planning, day-to-day processes and long-term archiving and
sharing.
Research data can take many different forms, but is essentially the evidence used
to inform or support research conclusions.
Some examples of research data are:
 Video and voice recordings

 Questionnaires and interview transcripts
 Test results held in text files and spreadsheets
 Archive materials and handwritten notes
 Code and software
 Photographs and slides
 Laboratory notebooks
There are many potential benefits of good research data management for you, other
researchers and the wider community:
 Efficiency and ease of data control, with reduced risk of data loss
 Greater visibility of data, leading to increased citations and future collaborations
 Demonstration of research integrity and validation of research results
 Compliance with funder and institutional policies and expectations
 Greater impact of research through knowledge transfer
 Research advances through reuse of data by researchers around the world
As much as possible, a researcher should create copies or create backups of their
digital data in case of unforeseen events Whitemire, (2013). Back then, researchers
would create copies of their data and place them in flash drives or external drives.
As the advancement of technology goes by, cloud services such as Google drive
and Dropbox would offer another method of backup for digital data. Uploading
these data to the cloud services is a good way to backup digital data since these
data may be accessed anytime, anywhere, in any device. It is not ideal to have one
copy of the digital file stored in a PC or laptop as there might be some scenarios
that would damage these files. Scenarios involving digital data loss include
corruption of the file, corruption of drive disk, or worst, corruption of the entire
operating system. One backup plan is better than none.
• Physical data, such as data from questionnaires, surveys, interviews should
also be managed carefully. These data may be lost or damaged especially
during transportation of data. For example, when carrying data which are
contained in pieces of papers. Carrying them outdoors, and then it
suddenly rains. These data may be damaged and be rendered illegible for
use. If possible, data should be very secured and be treated as if your life
depended on it. If transportation of data cannot be avoided, it is best to take
precautionary measures and take note of risks that may arise during the
transportation of data from one place to another.
Common software used when analyzing quantitative data
and what are their strengths and weaknesses?
In a website that is offered by the New York University Library, we can see six (6)
different softwares that were listed:

• SPSS

SPSS, also known as the Statistical Package for Social Sciences, was first
developed by Bent, Hull and Nie in 1968. As stated earlier, IBM has acquired
SPSS in July 2009. The users of this software are commonly from the programs of
Social and Health Sciences, Marketing and those related to the field of Academe.
One of the highlights of using the program is that it is easy to use and it also has an
intuitive user interface. It is also comparable to Microsoft Excel because of the
cells in its interface.
The application also computes for the standard error of the mean (SEM) which
gives the researcher an idea of the accuracy of the mean (Hopkins, 2000). It also as
a feature which easily excludes data and SPSS can also handle data that are
missing. But, as perfect as it sounds, there are also limitations to this program.
Two limitations include: (1) There are no robust methods included. According to
Arshanapalli et al. (2014), Robust Statistics is used to solve problems that come
from estimates that are not sensitive to small changes in the basic assumptions of
the employed statistical models. (2) Another limitation when using SPSS in
analyzing your data is that it is unable to perform merges that are many to many.
• JMP
JMP, or ‘John’s Macintosh Program’ is a software that was developed by SAS, or

Statistical Analysis System, in the 1980’s. John Sall created this software in order
to take advantage of GUI (graphical user interface) that was introduced by Mac.
The users of this software involve professionals who are in the field of engineering
specializing in Six Sigma, in Quality Control or those inclined in research, in the
field of biology and pharmacuetics.
• JMP
JMP is known for its interactive graphics and scripting languages. JSL is a
scripting language that is also used by SAS, R and MATLAB, which means that
JMP also has a sense of universality that could be used in other softwares in case
that the program does not run. Last among the highlights is that this software has a
great set of online resources which could be used by the researcher. The only
limitation when using this program is that, same as SPSS, the software cannot
perform all robust methods.
• Stata
Stata, the third software which can be utilized in analyzing quantitative data, was first
released in January, 1985. It was introduced by Bill Gould and Sean Becketti. They coined
the term from the words statistics and data. It was originally designed to be a regression
and data management package which has exactly 44 commands. Afterwards, it was
developed and the GUI was already released in 2003. This software’s usage can be seen in
the disciplines that are related to Institutional Research, Medicine and Political Science.
• Unlike other previous softwares, syntax are mainly used in this software. But not
to worry, because there are still menus that are user friendly. In one of its
features, Mata, the software offers programming of matrices. It also works well
with data that are related to surveys and time-series. However there are also some
disadvantages when using this software. First is that the program could only
handle one dataset at a time. Additionally, there are limited sizes of datasets that
can be used. The researcher may need to sacrifice a number of variables,
depending on the package being used. This may be detrimental since the
reliability of the amount of data would be at stake.
• SAS
SAS was first developed in the year 1966 and was released after six years. The program
stands for ‘Statistical Analysis System’. It was developed by Anthony Bar and James
Goodnight with financial assistance of the National Institute of Health. The goal of this
project was to create a software which can utilize agricultural data in order to improve the
production of crops. This program, in 2012, was the largest market share holder in
advanced analytics, which comprises 36.2 percent of the market. The users of this software
are from those in the government and those related to the field of Finance, Manufacturing
and Sciences (Health and Life Sciences). The good side with this software is that it can
handle extremely large datasets, but the graphics compared to other softwares are more
difficult to manipulate. It is also not a user friendly software for those researchers who are
new to using SAS.
•R
This software was created by two professionals from the University of Auckland. The
names of the software’s creators are Ross Ihaka and Robert Gentlemean. Hence, the name
‘R’ was formed. According to the New York University Library (n.d.), R was implemented
from the programming language ‘S’, which was developed at the Bell Labs. Highlights
that could be noticed from using this software It is that this program is free and is an open
source. In addition, there are over 6000 available user packages that are available, and it
can also interact with other programming softwares. Limitations that are found when using
R are as follows:
The program has a large online community especially when questions are being asked.
This is mainly because there is no formal technical support when using this program.
Another is that since it can interact with different programming softwares, in order to have
an ease in using R, the person using this software should already have a good
understanding of the different types of data used.
As a result, it is really very hard to examine packages that are written by other users.
• MATLAB
MATLAB, according to the Wright and Sandberg (n.d.), is a program that is widely
used in the field of applied mathematics. It is also used in research and education
of different universities and is also found the industry. The program name stands
for Matrix Laboratory. They also added that this software was built around vectors
and matrices. It is also used to solve eqauations related to algebra and calculus. It
is rich in graphic tools which can produce two-dimensional and three-dimensional
illustrations.
Stages of Data Analysis and give a description for each stage.

In general, there are 4 stages a researcher has to undergo to conduct data analysis:
description, interpretation, conclusion, and theorization. These stages follow a cyclical
process.
During the description stage, the researcher provides a descriptive analysis of all the data
gathered for their research. After giving a thorough description of all data, the researcher
gives his take on the meaning each data gives, this is the interpretation stage. From the
interpretation stage, the researcher then draws up some kind of conclusion. This
conclusion may of major or minor importance to the data. Usually conclusions drawn from
this are minor, which would add up and lead to the major conclusion, which is found in the
final chapter of the research project. Lastly, the theorization stage. This stage is where the
researcher compares the findings of their data analysis to similar literature found in the
literature review. This might coincide with the findings of other researchers or it may not.
The purpose of this is to contribute knowledge to the area of research.
For example, the data gathered clearly shows that customers aren’t happy after eating at
McDonald’s (description). The researcher then interprets these as due to the services the
crew is giving them (interpretation). The researcher then draws a conclusion which states
that the crew should improve on the services they are giving to the customers (conclusion).
The researcher should then check about similar literature found in the literature review
whether or not the same conclusion has been drawn by other researchers about different
restaurants (theorization). The conclusion may be minor as other conclusions may be
drawn up such as food quality, cleanliness of place etc. The process then goes back to the
1st stage and another conclusion will be drawn until such time that a major conclusion will
be formulated for the entire research.
What is Quantitative Data Analysis?
Quantitative data analysis uses statistics for two purposes, namely,

descriptive statistics and inferential statistics. To define the data being
gathered, descriptive statistics are used. According to Patel (2009),
before starting any kind of analysis, researchers performs descriptive
analysis.
What are the measures of central tendency?
• Median
The middle value of a range of values that can be used with ordinal, ratio or interval
measurements and assumptions are needed not to be done is called median. Also, resistant
measure is another term for the median in which, it is not affected by changes. Less impact
to the median compared to the mean are the outliers (GAO, 1992).
• Mean
Commonly known as the arithmetic average. Mean can be calculated by adding up the
number of observation values and dividing it to the amount of observation. Commonly, it
is used as a measure of the central tendency in the interval variables. If distribution is
markedly asymmetric or numerous circumstances are outliers, selecting the mean may not
provide promising results. According to GAO (1992), the presence of few extreme values
that may give distorted view of central tendency is being strongly influenced by the mean.
What are the measures of central tendency?
• Mode
Most observed and occurring value in a set of range is called mode. It is less used
in any other levels of measurement and it is commonly engaged with the nominal
variable.
Differentiate the between variables to ordinal variable and nominal
variables.
Variable is the element of measurement in quantitative analysis, it may consist of more

than one value. Cause and effect are called independent and dependent variables
respectively. When the independent variables affect dependent variable, there is
intervening variable.
Nominal variables, interval variables and ordinal variables are the different level variables
used in data analysis. Levels are necessary to determine what type of analysis can be
carried out on with each and every variable present in analysis.
• Nominal Variables
No assumptions can be done around relations of the variables. The value for each
are as distinct as categories cannot be ranked and serves only as labels. Nominal
level variables are considered as the lowest level of measurement.
• Ordinal Variables
According to some criteria, variables are being ranked in this level of

measurement. It uses parameters or scales. Nominal variable are alike to the
Ordinal variables. The only dissimilarity between the two is there are clear
ordering of variables.
• Nominal Variables
No assumptions can be done around relations of the variables. The value for each
are as distinct as categories cannot be ranked and serves only as labels. Nominal
level variables are considered as the lowest level of measurement.
• Ordinal Variables
According to some criteria, variables are being ranked in this level of

measurement. It uses parameters or scales. Nominal variable are alike to the
Ordinal variables. The only dissimilarity between the two is there are clear
ordering of variables.
What are the four stages in data analysis?
In the internet website by pluralsight, the four stages in data analysis were briefly explained:
1. Descriptive analytics
Descriptive (also known as observation and reporting) is the most basic level of analytics. Many
times, organizations find themselves spending most of their time in this level. Think about
dashboards and why they exist: to build reports and present on what happened in the past. This is a
vital step in the world of analytics and decision making, but it's really only the first step. It’s
important to get beyond the initial observations and dive into insights, which is the second level of
analytics.
2. Diagnostic analytics
• Diagnostic analytics is where we get to the why. We move beyond an observation (like whether
the chart is trending up or down) and get to the “what” that is making it happen. This is where the
What are the four stages in data analysis?
2. Diagnostic analytics
• Diagnostic analytics is where we get to the why. We move beyond an observation (like whether
the chart is trending up or down) and get to the “what” that is making it happen. This is where the
ability to ask questions about the data and tie those questions back to objectives and business
imperatives is most important.
3. Predictive analytics
Predictive analytics allows organizations to predict different decisions, test them for
success, find areas of weakness in the business, make more predictions—and so forth.
This flow allows organizations to see how the first three levels can work together.
Predictive analytics involves technologies like machine learning, algorithms, and artificial
intelligence, which gives it power because this is where the data science comes in. Now,
when we incorporate the importance of not just predicting, but using data science,
statistics, and the third-level of analytics combined with the first two levels, organizations
truly can see success with their data and analytical strategies.
4. Prescriptive analytics
Prescriptive analytics exist at a very advanced level and is the most powerful and final
phase, and truly encompasses the “why” of analytics. It’s when the data itself prescribes
what should be done. Data-driven decision making is tied most closely to predictive and
prescriptive analytics, even though these are the most advanced.
Think of prescriptive analytics as taking all other levels of analytics to prescribe things you
should be doing; the data and analytics show you the way. Thomas Matthew, chief product
officer at Zoomph describes it well:
"
Prescriptive analytics builds on predictive by informing decision makers about different decision
choices with their anticipated impact on a specific key performance indicators. Think of traffic
navigation app, Waze. Pick an origin and destination and a multitude of factors get mashed
together, and it advises you on different route choices, each with a predicted ETA. This is everyday
prescriptive analytics at work."
Think of the first three levels of analytics: you have your description of what has happened,
followed by diagnosing why, and then you end with predicting what will happen. Now, imagine
you allow the data and analytics to inform you what action to take. That is powerful and why it
matters for businesses.
All four levels create the puzzle of analytics: describe, diagnose, predict, prescribe. When all four
work together, you can truly succeed with a data and analytical strategy. If the four aren’t working
well together or one part is completely missing, the organization’s data and analytical strategy isn’t
complete.
These four levels of analytics need to permeate throughout an organization in order for data literacy
to be effective. Additionally, teams need to have better skills which allow them to tap into each
level as best they can. The ultimate hope is that those decisions tie back to the most important
business objectives and goals.

HOMEWORK: (Handwritten)
1.What are ANOVA and MANOVA differences?
2.What are the measures of central tendency?

Next Meeting Topic:
Understanding Population, Sampling Method and Data

Analysis
THANK YOU
AND STUDY WELL
FUTURE ENGINEERS 

Math4E Week 2 - Lecture 2

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Math4E Week 2 - Lecture 2

Uploaded by

Copyright:

Available Formats

MATH4E

ENGINEERING DATA ANALYSIS

Interview/focus To gain an in-depth understanding of Verbally ask participants open-ended questions in

•Bias in research is anything that contaminates or compromises the

 Video and voice recordings

JMP, or ‘John’s Macintosh Program’ is a software that was developed by SAS, or

Quantitative data analysis uses statistics for two purposes, namely,

Variable is the element of measurement in quantitative analysis, it may consist of more

According to some criteria, variables are being ranked in this level of

According to some criteria, variables are being ranked in this level of

1.What are ANOVA and MANOVA differences?

2.What are the measures of central tendency?

Understanding Population, Sampling Method and Data

You might also like