Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 15

GREAT ZIMBABWE UNIVERSITY

SCHOOL OF NATURAL SCIENCES


DEPARTMENT OF MATHS AND COMPUTER SCIENCE

MODULE TITLE: INTRODUCTION TO RESEARCH METHODS AND


STATISTICS FOR COMPUTER SCIENCE
MODULE CODE: HCS211
DATE :
TIME : 3 HOURS

INSTRUCTIONS TO CANDIDATES

Answer a total of three (3) questions, one (1) question from SECTION A, one (1) question
from SECTION B and any other question.

Marks are indicated in brackets [ ] at the end of each question or part question.

Each question carries 100 marks.

ADDITIONAL MATERIALS
Answer papers/ booklet

Graph paper

Plain paper

Ruler

Sharp HB pencil

Calculator

Formulae List and Statistical Tables


SECTION A: RESEARCH METHODS

A1 Probability and non-probability sampling methods play a pivotal role in research.


Discuss any 4 non-probability sampling methods citing when each could be used in your
field of study. [100]

4 Non-probability Sampling Methods in Computer Science

1. Convenience sampling: This method involves selecting subjects who are easily
accessible to the researcher. For example, a researcher might survey students in their
computer science class or interview colleagues at their workplace. Convenience sampling
is often used in exploratory research or pilot studies, where the goal is to get a
preliminary understanding of a topic.

Example in computer science: A researcher might use convenience sampling to survey


students in their computer science class to get their feedback on a new programming
language.

2. Purposive sampling (or judgmental sampling): This method involves selecting subjects
who the researcher believes are representative of the population of interest. For example,
a researcher might interview experts in a particular field of computer science to get their
insights on a new technology. Purposive sampling is often used in qualitative research,
where the goal is to gain in-depth understanding of a topic.

Example in computer science: A researcher might use purposive sampling to interview


experts in the field of artificial intelligence to get their insights on the ethical implications
of machine learning.

3. Snowball sampling: This method involves selecting subjects and then asking them to
refer other potential subjects. For example, a researcher might interview a few software
developers and then ask them to refer other software developers in their network.
Snowball sampling is often used in hard-to-reach populations, such as people who use
niche technologies or who have rare medical conditions.

Example in computer science: A researcher might use snowball sampling to interview


open source software developers to learn about their motivations and challenges.

4. Quota sampling: This method involves selecting subjects until certain quotas are met.
For example, a researcher might want to have a sample that is 50% male and 50% female,
so they would select subjects until they have reached that quota. Quota sampling is often
used in surveys to ensure that the sample is representative of the population of interest in
terms of certain demographic characteristics.

Example in computer science: A researcher might use quota sampling to survey computer
science students to ensure that their sample is representative of the student body in terms
of gender, race, and ethnicity.
Advantages and Disadvantages of Non-probability Sampling Methods

Non-probability sampling methods are often less expensive and time-consuming than
probability sampling methods. However, they are also less rigorous and less likely to
produce representative samples. It is important to be aware of the limitations of non-
probability sampling methods when interpreting the results of research studies.

When to Use Non-probability Sampling Methods in Computer Science

Non-probability sampling methods can be used in a variety of research studies in


computer science. For example, they can be used in exploratory research, pilot studies,
qualitative research, and research on hard-to-reach populations. However, it is important
to carefully consider the limitations of non-probability sampling methods before using
them in a research study.

Here are some additional examples of when non-probability sampling methods might be
used in computer science:

To get feedback on a new prototype or software application


To learn about the experiences of users of a particular technology
To identify the needs of a specific user group
To explore new research directions
To generate hypotheses for future research studies
It is important to note that non-probability sampling methods should not be used in
research studies where the goal is to make generalizations about the population of
interest. In these cases, it is important to use probability sampling methods to ensure that
the sample is representative of the population.

A2 Reliability and validity issues play a critical role in research. Discuss in detail on the
different ways you can use to validate your results.[100]
Reliability and validity are two important concepts in research. Reliability refers to the
consistency of research results, while validity refers to the accuracy of research results.
There are a number of ways to validate research results, including:

Replication: Replication is the process of repeating a research study to see if the same
results are obtained. Replication is one of the most important ways to validate research
results, as it helps to ensure that the results are reliable and not due to chance.
External validation: External validation involves comparing the results of a research study
to other relevant research studies. This can be done by conducting a literature review or
by conducting a meta-analysis. External validation helps to ensure that the results of a
research study are consistent with other research on the same topic.
Internal validation: Internal validation involves examining the design and methods of a
research study to ensure that they are sound. This includes checking for potential biases
and errors in the data collection and analysis. Internal validation helps to ensure that the
results of a research study are valid.
In addition to these general methods of validation, there are a number of specific methods
that can be used to validate research results depending on the type of research being
conducted. For example, in quantitative research, statistical tests can be used to assess the
significance of the results. In qualitative research, triangulation can be used to validate the
results by using multiple data sources and methods.

Here are some additional tips for validating research results:

Be transparent about your methods and data: This will allow other researchers to evaluate
your study and assess its validity.
Use multiple data sources and methods: This will help to reduce bias and increase the
validity of your results.
Pilot test your study: This will help to identify any potential problems with your design or
methods before you collect your data.
Seek feedback from other researchers: This can help you to identify any potential
problems with your study and to improve your design and methods.
It is important to note that there is no single perfect way to validate research results. The
best approach will depend on the type of research being conducted and the specific goals
of the study. However, by using the methods described above, researchers can increase
the likelihood that their results are reliable and valid.

Here are some specific examples of how the different methods of validation can be used
in computer science research:

Replication: A researcher might replicate a study that found a new algorithm for machine
learning outperforms existing algorithms. If the researcher is able to replicate the results,
this would provide evidence that the new algorithm is indeed more effective.
External validation: A researcher might compare the results of a study on the
effectiveness of a new educational software program to other studies on similar software
programs. If the researcher finds that the results of their study are consistent with other
studies, this would provide evidence that the new software program is effective.
Internal validation: A researcher might conduct a thorough review of the design and
methods of their study to ensure that they are sound. This would include checking for
potential biases and errors in the data collection and analysis. If the researcher finds that
their study is well-designed and executed, this would provide evidence that the results are
valid.
By using the different methods of validation, researchers in computer science can
increase the likelihood that their results are reliable and valid. This is important because it
helps to ensure that the research findings can be trusted and used to inform real-world
decisions.

A3 Explore any five ethical considerations that you need to adhere to when carrying out
research in your area of specialisation. [100]

Five ethical considerations that I need to adhere to when carrying out research in my
area of specialization (computer science) are:
Privacy: I need to protect the privacy of the people who participate in my research. This
means collecting and using data only in ways that have been agreed to by the
participants, and taking steps to ensure that the data is not compromised.

Informed consent: I need to obtain informed consent from all participants in my


research. This means providing them with clear and concise information about the
research, including its purpose, risks, and benefits. Participants should also be given the
opportunity to ask questions and to withdraw from the research at any time.

Confidentiality: I need to keep all participant data confidential. This means that I
should not share the data with anyone without the participant's consent, and that I
should take steps to protect the data from unauthorized access.

Conflicts of interest: I need to disclose any potential conflicts of interest. This means
disclosing any financial or other relationships that I have with any organizations or
individuals that could influence my research.

Responsible use of technology: I need to use technology responsibly in my research.


This means taking steps to minimize the potential for harm to participants, such as by
avoiding the use of invasive or harmful technologies.

In addition to these general ethical considerations, there are a number of specific ethical
considerations that are relevant to computer science research. For example, I need to be
careful about how I collect and use personal data, and I need to be aware of the
potential for bias in my research algorithms. I also need to be mindful of the potential
impact of my research on society, and I need to take steps to ensure that my research is
used for good.

Here are some specific examples of how the different ethical considerations can be
applied in computer science research:

Privacy: A researcher who is collecting data from social media users needs to make sure
that the users have agreed to their data being collected and used for research purposes.
The researcher also needs to take steps to protect the data from unauthorized access,
such as by encrypting the data and storing it on a secure server.

Informed consent: A researcher who is developing a new medical diagnostic software


program needs to obtain informed consent from all participants in their clinical trials.
The participants need to be provided with clear and concise information about the
software program, including its purpose, risks, and benefits. Participants should also be
given the opportunity to ask questions and to withdraw from the study at any time.
Confidentiality: A researcher who is studying the online behavior of children needs to
keep all participant data confidential. This means that the researcher should not share
the data with anyone without the child's consent, and that the researcher should take
steps to protect the data from unauthorized access, such as by anonymizing the data.

Conflicts of interest: A researcher who is developing a new social media platform needs
to disclose any financial or other relationships that they have with any organizations or
individuals that could influence their research. For example, if the researcher has a
financial stake in the company that is developing the social media platform, they need to
disclose this conflict of interest to their research participants.

Responsible use of technology: A researcher who is developing a new artificial


intelligence algorithm needs to be careful about how the algorithm is used. For example,
the researcher needs to make sure that the algorithm is not used to discriminate against
people or to violate their privacy.

By adhering to these ethical considerations, computer science researchers can help to


ensure that their research is conducted in a responsible and ethical manner. This is
important because it helps to build public trust in computer science research and to
ensure that the benefits of computer science research are shared equitably.

SECTION B: STATISTICS

B4 A lecturer for C++ conducted a test on his students and the following scores were
obtained and recorded below.

36 74 25 50 40 39 62 39 41 65 55 66 59 48 55 57 71 49 42 44
50 40 45 50 61 45 21 58 56 54 56 63 70 39 38 49 53 64 56 34

(a) Represent this information on a stem and leaf diagram [12]

(b) Use the information to find the

(i) Mode [4]

(ii) Median [6]

(iii) Find the inter-quartile range. [8]


(iv) Give the merit and demerits of using the values above as measures of
central tendency. [10]

The merit and demerits of using the mean, median, and mode as measures of central
tendency are as follows:
Mean

Merits:

Easy to calculate and understand.

Takes into account all of the values in the data set.

Can be used to calculate other statistical measures, such as standard deviation and
variance.

Demerits:

Can be skewed by outliers.

Not suitable for qualitative data.

Median

Merits:

Not affected by outliers.

Easy to calculate and understand.

Suitable for both quantitative and qualitative data.

Demerits:

Does not take into account all of the values in the data set.

Can be unstable for small data sets.


Mode

Merits:

Very easy to calculate and understand.

Suitable for both quantitative and qualitative data.

Demerits:

Not affected by outliers.

Does not take into account all of the values in the data set.

Can be unstable for small data sets.

May not be representative of the central tendency of the data set, especially if there
are multiple modes.

Overall, the mean is the most commonly used measure of central tendency, as it is
easy to calculate and understand, and it takes into account all of the values in the data set.
However, it is important to be aware of the potential for outliers to skew the mean. The
median is a good alternative to the mean when outliers are present, or when the data is
qualitative. The mode is not as commonly used as the mean or median, but it can be useful
for certain types of data, such as ordinal or categorical data.

Which measure of central tendency is most appropriate to use will depend on the
specific data set and the purpose of the analysis.
(a) A university follows up with its Computer Science graduating students and
collects data on salary earned and the degree classification obtained by the
students. The following data was obtained as shown in Table 1.

Table 1
Salary being earned
Degree class Low Medium High
First class 15 39 50
Second class 12 35 55
Third class 20 33 45

Conduct a chi-square test at a 5% significance level to determine if there is a dependence


between the salary being earned and degree class obtained. [60]

B5 (a) A research student was investigating the Gross Domestic Product (GDP) of a
country during the years 2015 to 2021. From the information obtained in the
table below, draw an appropriate diagram to classify this data.

Year Amount of Export Amount of Import


2015 60 120
2016 120 200
2017 150 250
2018 90 150
2019 30 50
2020 20 200
2021 80 120

Give the reason(s) for your choice of diagram above. [10]

(b) The table below shows the temperature received in winter and the sales of
coffee on 10 consecutive days.

Day 1 2 3 4 5 6 7 8 9 10

Temperature (0C) 8 12 9 10 12 8 9 14 10 14

Number of coffees 80 65 100 68 70 90 82 30 74 40


Sold
(i) Draw a scatter diagram and comment on it. [10]
(ii) Compute the product-moment correlation coefficient (r) and comment.
[40]
(iii) Compute the Spearman Rank correlation coefficient.[20]

(c) Distinguish between the following:

(i) type I and type II errors,


(ii) descriptive and inferential statistics,
(iii) quantitative and qualitative variables,
(iv) non parametric test and parametric test,

[20]

(i) Type I and type II errors

Type I and type II errors are two types of errors that can occur in hypothesis testing.

Type I error is the error of rejecting a true null hypothesis. This is also known as a false
positive.
Type II error is the error of failing to reject a false null hypothesis. This is also known as a
false negative.
The probability of making a type I error is denoted by alpha (α), and the probability of
making a type II error is denoted by beta (β).

Here is a table that summarizes the four possible outcomes of hypothesis testing:

Null hypothesis (H0) Decision True state of the world


True Accept H0 Correct decision
True Reject H0 Type I error (false positive)
False Accept H0 Type II error (false negative)
False Reject H0 Correct decision
Which type of error is more serious depends on the specific context. For example, in medical
research, a type I error could lead to a patient receiving unnecessary treatment, while a type II
error could lead to a patient not receiving the treatment they need.

(ii) Descriptive and inferential statistics

Descriptive statistics are used to summarize the characteristics of a data set. They can be used
to calculate measures of central tendency, such as the mean, median, and mode, as well as
measures of dispersion, such as the standard deviation and range.

Inferential statistics are used to make inferences about a population based on a sample. They
can be used to test hypotheses, such as whether the mean of one population is different from
the mean of another population.

Here is an example of the difference between descriptive and inferential statistics:

Suppose we have a data set of the heights of all 10 students in our class. We can use
descriptive statistics to calculate the mean height of the students in our class. This would give
us an estimate of the central tendency of the data.

However, we cannot use descriptive statistics to infer that the mean height of all students in
our school is the same as the mean height of students in our class. To do this, we would need
to use inferential statistics to test the hypothesis that the mean height of students in our class
is equal to the mean height of all students in our school.

(iii) Quantitative and qualitative variables

Quantitative variables are variables that can be measured numerically. Examples of


quantitative variables include height, weight, and age.

Qualitative variables are variables that cannot be measured numerically. Examples of


qualitative variables include gender, hair color, and eye color.

Here is an example of the difference between quantitative and qualitative variables:


Suppose we are conducting a survey of students to learn about their study habits. We ask
students the following two questions:

How many hours do you study each week?


What is your favorite subject?
The first question is a quantitative question because the answer can be measured numerically
(e.g., 10 hours). The second question is a qualitative question because the answer cannot be
measured numerically.

(iv) Nonparametric test and parametric test

Nonparametric tests are statistical tests that do not require the data to be normally distributed.
Parametric tests are statistical tests that require the data to be normally distributed.

Here is a table that summarizes the key differences between nonparametric tests and
parametric tests:

Characteristic Nonparametric test Parametric test


Assumptions Do not require the data to be normally distributed Require the data to be
normally distributed
Power Less powerful than parametric tests when the assumptions of the parametric test are
met More powerful than nonparametric tests when the assumptions of the parametric test
are met
Robustness More robust to violations of the assumptions than parametric tests Less robust
to violations of the assumptions than nonparametric tests
Here are some examples of nonparametric tests and parametric tests:

Nonparametric tests: Chi-squared test, Mann-Whitney U test, Wilcoxon signed-rank test


Parametric tests: Student's t-test, ANOVA, F-test
Which type of test to use depends on the specific context. If the data is normally distributed
and the assumptions of the parametric test are met, then the parametric test should be used.
Otherwise, a nonparametric test should be used.
B6 (a) Two interviewers, A and B award points to prospective candidates for
a job in an interview. The following table illustrates the performance of the
candidates in the interview.

Interviewer 14 12 20 18 22 13 15 20 11 16
A

Interviewer 20 18 17 16 21 17 20 17 25 10
B

Test at a 5% significance level if there is a significant difference in how the


interviewers awarded points. [35]

(b) State any five (5) characteristics of a normal distribution [10]

Symmetric: The normal distribution is symmetrical around its mean, which means
that the left and right halves of the distribution are mirror images of each other.
Unimodal: The normal distribution has a single mode, or peak, which corresponds to
the mean.
Bell-shaped: The normal distribution is bell-shaped, with the majority of values
falling near the mean and fewer values falling in the tails of the distribution.
Empirical rule: The empirical rule, also known as the 68-95-99.7 rule, states that
approximately 68% of the values in a normal distribution fall within one standard deviation
of the mean, approximately 95% of the values fall within two standard deviations of the
mean, and approximately 99.7% of the values fall within three standard deviations of the
mean.
Continuous: The normal distribution is a continuous distribution, which means that
any value within the range of the distribution is possible.
In addition to these five characteristics, normal distributions also have the following
properties:

The mean, median, and mode of a normal distribution are all equal.
The normal distribution is the only distribution that is completely determined by its
mean and standard deviation.
The normal distribution is the limiting distribution of many statistical tests, such as
the t-test and the z-test. This means that these tests are more accurate when the data is
normally distributed.
Normal distributions are very common in nature and in many different fields of study,
including statistics, physics, biology, and economics.
A manufacturing plant produces bags of cement with a mean weight of 60
kilogrammes and a variance of 25. If a bag of cement is picked at random, find
the probability that the weight of the bag of cement is

(i) more than 50 kilogrammes [6]

(ii) less than 65 kilogrammes [4]

(iii) between 55 and 70 kilogrammes [10]

(c) The data below represents the height of plants recorded to the nearest
millimetre.

Height (cm) 26-30 31-35 36-40 41-45 46-50 51-55 56-60 61-65
Frequency 4 5 23 58 61 30 3 3
(i)
Construct a histogram for the distribution. [15]
(ii) Superimpose a frequency polygon on the histogram [5]
(c) Using examples define
(i) Type II error
(ii) Significance level
(iii) Critical value [15]

(i) Type II error

A type II error is the error of failing to reject a false null hypothesis. This is also
known as a false negative.

Example:

A medical researcher is testing a new drug to treat a particular disease. The null
hypothesis is that the drug has no effect on the disease. The researcher
conducts a clinical trial and finds that the drug does not appear to be effective.
However, the drug may actually be effective, but the sample size of the
clinical trial was too small to detect the effect. In this case, the researcher has
made a type II error.

(ii) Significance level

The significance level is the probability of rejecting a true null hypothesis. This is
denoted by the symbol alpha (α).

Example:
A researcher is testing the hypothesis that the mean height of men is greater than the
mean height of women. The researcher sets the significance level to alpha =
0.05. This means that the researcher is willing to accept a 5% chance of
rejecting the true null hypothesis that the mean height of men is equal to the
mean height of women.

(iii) Critical value

The critical value is the value of the test statistic that must be exceeded in order to
reject the null hypothesis. The critical value depends on the significance level
and the distribution of the test statistic.

Example:

A researcher is using a t-test to test the hypothesis that the mean height of men is
greater than the mean height of women. The researcher has set the significance
level to alpha = 0.05. The critical value for the t-test with alpha = 0.05 and 20
degrees of freedom is 1.729. This means that if the t-statistic is greater than
1.729, then the researcher will reject the null hypothesis.

[END]

You might also like