Predictive Analysis Midterm - F22

DIVISION OF CRIME, JUSTICE &
SECURITY STUDIES PROGRAM
MS in Homeland Security Program Professor

Anthony M. Mazza, JD
anthony.mazza@UDC.edu
Predictive Analysis HLSC-560
INSTRUCTIONS: This Examination is to be completed by midnight October 15, 2022 by

an emailing to me at anthony.mazza@udc.edu AND to mazza.tony@gmail.com.
Student Name:
MIDTERM EXAM
Part 1: EXCEL Application (20%)
1. In an excel spreadsheet, calculate the (i) Final Grade for each student (NOTE: each exam is
equally weighted, but the Final Exam is the equivalent of two exams- i.e., its weighted
double); and the Final population (ii) Mean, (iii) Median, (iv), and (v) Standard Deviation for
the following data set of a Graduate Level statistics course
Exam 1 Exam 2 Exam 3 Exam 4 Final
Student Exam
Lonny 81 83 88 90 89
Charlie 87 80 87 91 89
Doris 90 88 92 92 93
Rita 76 71 85 90 87
Timmy 81 77 86 91 91
Ray 77 69 77 88 92
Donnie 64 60 71 89 87
Xavier 88 85 89 77 89
Millie 85 89 94 89 88
Alice 91 87 95 96 93
2. Using the same data set, create (i) a scatterplot of all FINAL calculated grades; a (ii) a single,
clustered bar graph illustrating EACH students’ grade performance; and (iii) a simple line
graph depicting student performance over time (i.e., collective progression from Exam 1
through the Final).
3. Using the same data set, create an ILLUSTRATED regression analysis comparing (i) the Top
Performing Student (Student A) to the Mean, and the (ii) Lowest Performing Student
(Student B) against the Mean- these may be illustrated in one graph (Student A & B v. Mean)
or separate graphs.
Part 2: Short Answer (50%)
Please provide the briefest explanation that FULLY answers the following Questions.
1. In data analysis, what do the terms (i) “frequency” and (ii) “mode” mean?
2. What are the three ways to link quantitative and qualitative data?
3. Define the following notations in data science:

a. α: alpha
b. µ: mu
c. β: beta
d. δ: sigma
e. δ2: sigma squared
4. How can we confidently generalize information of a population, gleaned from a sample?
5. What is the “gambler’s fallacy” in probabilistic reasoning?
6. Define the term “probability”.
7. Define “confidence level” in data science.

El nivel de confianza representa la capacidad teórica del análisis para producir
intervalos precisos si se es capaz de calcular muchos intervalos y conoce el valor del
parámetro de población. Para un intervalo de confianza específico de un estucio, el
intervalo contiene el valor de la población o no, no hay posibilidad de que existean
probabilidades diferentes de 0 o 1. Y no se puede elegir entre estas dos posibilidades
porque no se conoce el valor del parámetro de población.
8. What is the difference between parametric and non-parametric tests?
Las pruebas paramétricas asumen distribuciones estadísticas subyacentes a los datos. Por tanto, deben cumplirse algunas
condiciones de validez, de modo que el resultado de la prueba paramétrica sea fiable. Por ejemplo, la prueba t de Student para
dos muestras independientes será fiable solo si cada muestra se ajusta a una distribución normal y si las varianzas son
homogéneas.
Las pruebas no paramétricas no dehen ajustarse a ninguna distribución. Pueden por tanto aplicarse incluso aunque no se
cumplan las condiciones de validez paramétricas.
2
4200 Connecticut Avenue NW | Washington, DC 20008 | 202.274.5869 | udc.edu
9. What is a “t-test”?
Una prueba t (también conocida como prueba t de Student) es una herramienta para evaluar
las medias de uno o dos grupos mediante pruebas de hipótesis. Una prueba t puede usarse
para determinar si un único grupo difiere de un valor conocido (una prueba t de una
muestra), si dos grupos difieren entre sí (prueba t de muestras independientes), o si hay una
diferencia significativa en medidas pareadas (una prueba t de muestras dependientes o
pareada).
¿Cómo se usan las pruebas t?

Primero defina la hipótesis que quiera comprobar y determine un riesgo asumible de llegar a
conclusiones erróneas. Por ejemplo, al comparar dos grupos, podría estimar que sus medias
son iguales, y determinar una probabilidad aceptable de concluir que hubiera una diferencia
cuando no es el caso. A continuación, calcule la estadística de la prueba de sus datos y
compárela con un dato teórico de la distribución t. En función del resultado, o bien rechaza
o bien no puede rechazar su hipótesis nula.
¿Y si tengo más de dos grupos?

No puede usar una prueba t.
10. Define “p-value” & “z score” and their relationship.

11. La mayoría de las pruebas estadísticas comienzan al identificar una
hipótesis nula. La hipótesis nula para las herramientas de análisis de
patrón (conjunto de herramientas Análisis de patrones y conjunto de
herramientas Asignación de clusters) es la Aleatoriedad espacial completa
(CSR), ya sea de las entidades o de los valores asociados con esas
entidades. Las puntuaciones z y los valores p devueltos por las
herramientas de análisis de patrón le dicen si puede rechazar esa hipótesis
nula o no. Por lo general, ejecutará una de las herramientas de análisis de
patrón, con la esperanza de que la puntuación z y el valor p indiquen que
puede rechazar la hipótesis nula, ya que indicaría que en lugar de un
patrón aleatorio, sus entidades (o los valores asociados con las entidades)
exhiben clusterig o dispersión estadísticamente significativa. La presencia
de una estructura espacial, como el clustering, en el apaisado (o en los
datos espaciales), significa que algunos procesos espaciales subyacentes
están trabajando, y como geógrafo o analista de SIG, esto es por lo general
lo que más le interesa.
3
12. El valor p es una probabilidad. Para las herramientas de análisis de
patrón, existe la probabilidad de que el patrón espacial observado se haya
creado mediante algún proceso aleatorio. Cuando el valor p es muy
pequeño, significa que es muy poco probable (pequeña probabilidad) que
el patrón espacial observado sea el resultado de procesos aleatorios, por lo
tanto puede rechazar la hipótesis nula. Puede preguntar: ¿Cuán pequeño
es suficientemente pequeño? Buena pregunta. Consulte la tabla y el
análisis a continuación.
13. Las puntuaciones z son desviaciones estándar. Si, por ejemplo, una
herramienta devuelve una puntuación z de +2,5, diría que el resultado son
desviaciones estándar de 2,5 Tanto las puntuaciones z como los valores p
se asocian con la distribución normal estándar como se muestra a
continuación.
14. Las puntuaciones z muy altas o muy bajas (negativas), asociadas con
valores p muy pequeños, se encuentran en las colas de la distribución
normal. Cuando ejecuta una herramienta de análisis de patrón de entidad y
produce valores p pequeños y una puntuación z muy alta o muy baja, esto
indica que es poco probable que el patrón espacial observado refleje el
patrón aleatorio teórico representado por su hipótesis nula (CSR).
15. Para rechazar la hipótesis nula, debe formar una opinión subjetiva con
respecto al grado de riesgo que desea aceptar por estar equivocado (por
rechazar falsamente la hipótesis nula). Por lo tanto, antes de ejecutar la
estadística espacial, usted selecciona un nivel de confianza. Los niveles de
confianza típicos son 90, 95 ó 99 por ciento. Un nivel de confianza del 99
por ciento sería el más conservador en este caso, lo que indica que no
desea rechazar la hipótesis nula a menos que la probabilidad de que el
patrón se haya creado mediante una opción aleatoria sea realmente
pequeña (menos de 1 por ciento de probabilidad).
16. Define the term “degrees of freedom”.

What are degrees of freedom in statistics? Degrees of freedom are the number of independent
values that a statistical analysis can estimate. You can also think of it as the number of values that
are free to vary as you estimate parameters. DF encompasses the notion that the amount of
independent information you have limits the number of parameters that you can estimate. Typically,
the degrees of freedom equals your sample size minus the number of parameters you need to
calculate during an analysis. It is usually a positive whole number.
4
Degrees of freedom is a combination of how much data you have and how many parameters you
need to estimate. It indicates how much independent information goes into a parameter estimate. In
this vein, it’s easy to see that you want a lot of information to go into parameter estimates to obtain
more precise estimates and more powerful hypothesis tests. So, you want many DF!
Los grados de libertad son la combinación del número de observaciones de un conjunto de

datos que varían de manera aleatoria e independiente menos las observaciones que están
condicionadas a estos valores arbitrarios.
En otras palabras, los grados de libertad son el número de observaciones puramente libres (que
pueden variar) cuando estimamos los parámetros.
Principalmente diferenciamos entre los estadísticos que utilizan parámetros poblacionales y

muestrales para saber sus grados de libertad. Comentamos las diferencias entre la media y
la desviación estándar cuando los parámetros son poblacionales o muestrales:
17. In each of these statements, tell whether descriptive or inferential statistics have been used.
5
a. By 2040 at least 3.5 billion people will run short of water (World Future Society). Inferential
b. Nine out of ten on-the-job fatalities are men (Source: USA TODAY Weekend ). Inferential
c. Expenditures for the cable industry were $5.66 billion in 1996 (Source: USA TODAY ). Descriptive
d. The median household income for people aged 25–34 is $35,888 (Source: USA TODAY ). Descriptive
e. Allergy therapy makes bees go away (Source: Prevention). Inferential
18. Classify each sample as random, systematic, stratified, or cluster.

a. In a large school district, all teachers from two buildings are interviewed to determine whether
they believe the students have less homework to do now than in previous years. Cluster
Sampling.
b. Every seventh customer entering a shopping mall is asked to select her or his favorite store.
Systematic Sampling.
c. Nursing supervisors are selected using random numbers to determine annual salaries. Simple
random sampling.
d. Every 100th hamburger manufactured is checked to determine its fat content. Systematic Sampling.
e. Mail carriers of a large city are divided into four groups according to gender (male or female) and
according to whether they walk or ride on their routes. Then 10 are selected from each group and
interviewed to determine whether they have been bitten by a dog in the last year. Stratified
sampling.
19. The average quantitative GRE scores for the top 30 graduate schools of engineering are listed. Construct a
grouped frequency distribution and a cumulative frequency distribution with 5 classes.
767 770 761 760 771 768 776 771 756 770 763 760 747 766 754 771 771 778 766 762 780 750 746 764
769 759 757 753 758 746
https://www.chegg.com/homework-help/questions-and-answers/760-11-gre-scores-top-ranked-engineering-schools-average-quantitative-
gre-scores-top-30-gr-q88567273
20. In a recent survey, 3 in 10 people indicated that they are likely to leave their jobs when the economy
improves. Of those surveyed, 34% indicated that they would make a career change, 29% want a new job
in the same industry, 21% are going to start a business, and 16% are going to retire. Make a pie chart for
the data.
21. Classify each as nominal-level, ordinal-level, interval-level, or ratio-level measurement.

a. Pages in the 25 best-selling mystery novels. Ratio
b. Rankings of golfers in a tournament. Ordinal
c. Temperatures inside 10 pizza ovens. Interval
d. Categories of magazines in a physician’s office (sports, women’s, health, men’s, news). Nominal
e. Salaries of the coaches in the NFL. Ratio
22. What is a “co-efficient of variation”?
23. Choose one of the 50 states at random.

a. What is the probability that it begins with M?
b. What is the probability that it doesn’t begin with a vowel?
https://www.chegg.com/homework-help/questions-and-answers/question-3-selecting-us-state-choose-one-50-states-random--sample-space-
type-answer-b-prob-q98498288
24. In building new homes, a contractor finds that the probability of a home buyer selecting a two-car garage
is 0.70 and of selecting a one-car garage is 0.20. Find the probability that the buyer will select no garage.
6
The builder does not build houses with three-car or more garages.
https://www.chegg.com/homework-help/mathzone-access-card-for-math-in-our-world-2nd-edition-chapter-11.7-problem-30e-solution-
9780077314118
25. The composition of the Senate of the 111th Congress is 41 Republicans 2 Independent 57 Democrats. A
new committee is being formed to study ways to benefit the arts in education.
a. If 3 Senators are selected at random to head the committee, what is the probability that they will
all be Republicans?
b. What is the probability that they will all be Democrats?
c. What is the probability that there will be 1 from each party, including the Independent?
https://www.chegg.com/homework-help/questions-and-answers/help-3-questions-please-sorry-shotty-photoshop-skills-think-get-point-
q19045788
7
26. Approximately 10.3% of American high school students drop out of school before graduation. Choose 10
students entering high school at random. Find the probability that
a. No more than two drop out
b. At least 6 graduate
c. All 10 stay in school and graduate
https://www.chegg.com/homework-help/questions-and-answers/approximately-103-american-high-school-students-drop-school-graduation-
choose-10-students--q87719579
27. Find the probability for each.

a. P(0<z<2.32)
b. P(z<1.65)
c. P(z >1.91)
https://www.chegg.com/homework-help/questions-and-answers/find-probability-p-0-q38586535
28. In the standard normal distribution, find the values of z for the 75th, 80th, and 92nd percentiles.
https://www.chegg.com/homework-help/questions-and-answers/need-answers-important-minitab-application-minitab-please-
answer-find-z-value-minitab-stan-q27069865?trackid=a04fcc798c3e&strackid=55831424652b
29. The average credit card debt for college seniors is $3262. If the debt is normally distributed with a
standard deviation of $1100, find these probabilities.
a. That the senior owes at least $1000
b. That the senior owes more than $4000
c. That the senior owes between $3000 and $4000
30. Here we have
31.
32. (a)
33. z-score for X = 1000 is
34.
35. So the probability that the senior owes at least $1000 is
36.
37. (b)
39.
8
40. So the probability that the senior owes more than $4000 is
41.
42. (c)
44.
46.
47. So required probability is
48.
https://www.chegg.com/homework-help/elementary-statistics-9th-edition-chapter-6.2-problem-11e-solution-9780078136337
49. For the following, find the variance and the standard deviation:
Eighty randomly selected batteries were tested to determine their lifetimes (in hours). The following
frequency distribution was obtained.
Class boundaries Frequency
62.5–73.5 5
73.5–84.5 14
84.5–95.5 18
95.5–106.5 25
106.5–117.5 12
117.5–128.5 6
https://www.chegg.com/homework-help/questions-and-answers/class-frequency-625-735-5-735-845-14-
845-955-18-955-1065-25-1065-1175-12-1175-1285-6-find--q79140215
Part 3. Statistical Application 1 (15%)
In a recent study of 35 ninth-grade students, the mean number of hours per week that they played video games was 16.6.
The standard deviation of the population was 2.8.
1. Find the 95% confidence interval of the mean of the time playing video games.
2. Calculate the margin of error.
9
a. The Confidence interval is (15.67, 17.53)
b. The margin of error is 0.93
https://www.chegg.com/homework-help/questions-and-answers/recent-study-35-ninth-grade-students-mean-number-hours-per-
week-played-video-games-166-sta-q103046161
10
Part 4: Statistical Application 2 (15%)
Imagine a scenario where you’ve been asked to be the gatekeeper (and paid very well for your duty) for a
series of scientific experiments to determine “truthfulness” in humans. A series of Subjects will approach you
and make a single statement- you are to determine the veracity of that statement (for simplicity, assume that
the only options available to you are to determine if a subject’s statement is “true” or false”- i.e.: there are
ONLY TWO OUTCOMES). REVIEW the Probability Equations!
Suppose that you are pre-equipped with knowledge of the following probabilities:
 82% percent of all subjects you will interview are “liars”
 75% percent of all “liars” will be identified
 40% of the time “truth-tellers” will be wrongly identified as liars
Please answer the following questions:
(a) What proportion of all subjects will you identify as “liars”?

(b) Of the people you determine to be “liars”, what is the probability that the person is, in fact, lying?
(c) What is the probability that you will make a correct decision?
(d) What is the probability that you will misidentify a “liar” (i.e., you mistakenly believed a “liar”)?
We are given here that:

P( liar ) = 0.82
P( + | liar ) = 0.75 and
P( + | not liar ) = 0.4
a) Using law of total probability,

P(+) = 0.75*0.82 + 0.4*0.18 = 0.687 is the required proportion here.
b) Using Bayes theorem, we have here:

P( liar | +) = 0.75*0.82 /0.687 = 0.8952 is the probability here.
c) P( correct decision ) here is computed as:

= 0.82*0.75 + 0.18*0.6 = 0.723 is the required probability here.
d) P( + and not liar )
= 0.4*0.18 = 0.072 is the required probability here.
11

Predictive Analysis Midterm - F22

Uploaded by

Copyright:

Available Formats

You might also like

Predictive Analysis Midterm - F22

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Predictive Analysis Midterm - F22

Uploaded by

Copyright:

Available Formats

DIVISION OF CRIME, JUSTICE &

SECURITY STUDIES PROGRAM

MS in Homeland Security Program Professor

INSTRUCTIONS: This Examination is to be completed by midnight October 15, 2022 by

Part 1: EXCEL Application (20%)

3. Define the following notations in data science:

4. How can we confidently generalize information of a population, gleaned from a sample?

5. What is the “gambler’s fallacy” in probabilistic reasoning?

6. Define the term “probability”.

7. Define “confidence level” in data science.

8. What is the difference between parametric and non-parametric tests?

¿Cómo se usan las pruebas t?

¿Y si tengo más de dos grupos?

10. Define “p-value” & “z score” and their relationship.

16. Define the term “degrees of freedom”.

Los grados de libertad son la combinación del número de observaciones de un conjunto de

Principalmente diferenciamos entre los estadísticos que utilizan parámetros poblacionales y

18. Classify each sample as random, systematic, stratified, or cluster.

21. Classify each as nominal-level, ordinal-level, interval-level, or ratio-level measurement.

22. What is a “co-efficient of variation”?

23. Choose one of the 50 states at random.

27. Find the probability for each.

30. Here we have

33. z-score for X = 1000 is

35. So the probability that the senior owes at least $1000 is

38. z-score for X = 4000 is

43. z-score for X = 4000 is

45. z-score for X = 3000 is

47. So required probability is

Class boundaries Frequency

Part 3. Statistical Application 1 (15%)

b. The margin of error is 0.93

(a) What proportion of all subjects will you identify as “liars”?

We are given here that:

a) Using law of total probability,

b) Using Bayes theorem, we have here:

c) P( correct decision ) here is computed as:

d) P( + and not liar )

= 0.4*0.18 = 0.072 is the required probability here.

You might also like