Predictive Analysis Midterm - F22

You might also like

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 11

DIVISION OF CRIME, JUSTICE &

SECURITY STUDIES PROGRAM

MS in Homeland Security Program Professor


Anthony M. Mazza, JD
anthony.mazza@UDC.edu
Predictive Analysis HLSC-560

INSTRUCTIONS: This Examination is to be completed by midnight October 15, 2022 by


an emailing to me at anthony.mazza@udc.edu AND to mazza.tony@gmail.com.

Student Name:

MIDTERM EXAM

Part 1: EXCEL Application (20%)

1. In an excel spreadsheet, calculate the (i) Final Grade for each student (NOTE: each exam is
equally weighted, but the Final Exam is the equivalent of two exams- i.e., its weighted
double); and the Final population (ii) Mean, (iii) Median, (iv), and (v) Standard Deviation for
the following data set of a Graduate Level statistics course
Exam 1 Exam 2 Exam 3 Exam 4 Final
Student Exam

Lonny 81 83 88 90 89
Charlie 87 80 87 91 89
Doris 90 88 92 92 93
Rita 76 71 85 90 87
Timmy 81 77 86 91 91
Ray 77 69 77 88 92
Donnie 64 60 71 89 87
Xavier 88 85 89 77 89
Millie 85 89 94 89 88
Alice 91 87 95 96 93

2. Using the same data set, create (i) a scatterplot of all FINAL calculated grades; a (ii) a single,
clustered bar graph illustrating EACH students’ grade performance; and (iii) a simple line
graph depicting student performance over time (i.e., collective progression from Exam 1
through the Final).

3. Using the same data set, create an ILLUSTRATED regression analysis comparing (i) the Top
Performing Student (Student A) to the Mean, and the (ii) Lowest Performing Student
(Student B) against the Mean- these may be illustrated in one graph (Student A & B v. Mean)
or separate graphs.
Part 2: Short Answer (50%)
Please provide the briefest explanation that FULLY answers the following Questions.

1. In data analysis, what do the terms (i) “frequency” and (ii) “mode” mean?

2. What are the three ways to link quantitative and qualitative data?

3. Define the following notations in data science:


a. α: alpha
b. µ: mu
c. β: beta
d. δ: sigma
e. δ2: sigma squared

4. How can we confidently generalize information of a population, gleaned from a sample?

5. What is the “gambler’s fallacy” in probabilistic reasoning?

6. Define the term “probability”.

7. Define “confidence level” in data science.


El nivel de confianza representa la capacidad teórica del análisis para producir
intervalos precisos si se es capaz de calcular muchos intervalos y conoce el valor del
parámetro de población. Para un intervalo de confianza específico de un estucio, el
intervalo contiene el valor de la población o no, no hay posibilidad de que existean
probabilidades diferentes de 0 o 1. Y no se puede elegir entre estas dos posibilidades
porque no se conoce el valor del parámetro de población.

8. What is the difference between parametric and non-parametric tests?

Las pruebas paramétricas asumen distribuciones estadísticas subyacentes a los datos. Por tanto, deben cumplirse algunas
condiciones de validez, de modo que el resultado de la prueba paramétrica sea fiable. Por ejemplo, la prueba t de Student para
dos muestras independientes será fiable solo si cada muestra se ajusta a una distribución normal y si las varianzas son
homogéneas.
Las pruebas no paramétricas no dehen ajustarse a ninguna distribución. Pueden por tanto aplicarse incluso aunque no se
cumplan las condiciones de validez paramétricas.

2
4200 Connecticut Avenue NW | Washington, DC 20008 | 202.274.5869 | udc.edu
9. What is a “t-test”?

Una prueba t (también conocida como prueba t de Student) es una herramienta para evaluar
las medias de uno o dos grupos mediante pruebas de hipótesis. Una prueba t puede usarse
para determinar si un único grupo difiere de un valor conocido (una prueba t de una
muestra), si dos grupos difieren entre sí (prueba t de muestras independientes), o si hay una
diferencia significativa en medidas pareadas (una prueba t de muestras dependientes o
pareada).

¿Cómo se usan las pruebas t?


Primero defina la hipótesis que quiera comprobar y determine un riesgo asumible de llegar a
conclusiones erróneas. Por ejemplo, al comparar dos grupos, podría estimar que sus medias
son iguales, y determinar una probabilidad aceptable de concluir que hubiera una diferencia
cuando no es el caso. A continuación, calcule la estadística de la prueba de sus datos y
compárela con un dato teórico de la distribución t. En función del resultado, o bien rechaza
o bien no puede rechazar su hipótesis nula. 

¿Y si tengo más de dos grupos?


No puede usar una prueba t. 

10. Define “p-value” & “z score” and their relationship.


11. La mayoría de las pruebas estadísticas comienzan al identificar una
hipótesis nula. La hipótesis nula para las herramientas de análisis de
patrón (conjunto de herramientas Análisis de patrones y conjunto de
herramientas Asignación de clusters) es la Aleatoriedad espacial completa
(CSR), ya sea de las entidades o de los valores asociados con esas
entidades. Las puntuaciones z y los valores p devueltos por las
herramientas de análisis de patrón le dicen si puede rechazar esa hipótesis
nula o no. Por lo general, ejecutará una de las herramientas de análisis de
patrón, con la esperanza de que la puntuación z y el valor p indiquen que
puede rechazar la hipótesis nula, ya que indicaría que en lugar de un
patrón aleatorio, sus entidades (o los valores asociados con las entidades)
exhiben clusterig o dispersión estadísticamente significativa. La presencia
de una estructura espacial, como el clustering, en el apaisado (o en los
datos espaciales), significa que algunos procesos espaciales subyacentes
están trabajando, y como geógrafo o analista de SIG, esto es por lo general
lo que más le interesa.

3
4200 Connecticut Avenue NW | Washington, DC 20008 | 202.274.5869 | udc.edu
12. El valor p es una probabilidad. Para las herramientas de análisis de
patrón, existe la probabilidad de que el patrón espacial observado se haya
creado mediante algún proceso aleatorio. Cuando el valor p es muy
pequeño, significa que es muy poco probable (pequeña probabilidad) que
el patrón espacial observado sea el resultado de procesos aleatorios, por lo
tanto puede rechazar la hipótesis nula. Puede preguntar: ¿Cuán pequeño
es suficientemente pequeño? Buena pregunta. Consulte la tabla y el
análisis a continuación.

13. Las puntuaciones z son desviaciones estándar. Si, por ejemplo, una
herramienta devuelve una puntuación z de +2,5, diría que el resultado son
desviaciones estándar de 2,5 Tanto las puntuaciones z como los valores p
se asocian con la distribución normal estándar como se muestra a
continuación.

14. Las puntuaciones z muy altas o muy bajas (negativas), asociadas con
valores p muy pequeños, se encuentran en las colas de la distribución
normal. Cuando ejecuta una herramienta de análisis de patrón de entidad y
produce valores p pequeños y una puntuación z muy alta o muy baja, esto
indica que es poco probable que el patrón espacial observado refleje el
patrón aleatorio teórico representado por su hipótesis nula (CSR).

15. Para rechazar la hipótesis nula, debe formar una opinión subjetiva con
respecto al grado de riesgo que desea aceptar por estar equivocado (por
rechazar falsamente la hipótesis nula). Por lo tanto, antes de ejecutar la
estadística espacial, usted selecciona un nivel de confianza. Los niveles de
confianza típicos son 90, 95 ó 99 por ciento. Un nivel de confianza del 99
por ciento sería el más conservador en este caso, lo que indica que no
desea rechazar la hipótesis nula a menos que la probabilidad de que el
patrón se haya creado mediante una opción aleatoria sea realmente
pequeña (menos de 1 por ciento de probabilidad).

16. Define the term “degrees of freedom”.


What are degrees of freedom in statistics? Degrees of freedom are the number of independent
values that a statistical analysis can estimate.  You can also think of it as the number of values that
are free to vary as you estimate parameters. DF encompasses the notion that the amount of
independent information you have limits the number of parameters that you can estimate. Typically,
the degrees of freedom equals your sample size minus the number of parameters you need to
calculate during an analysis. It is usually a positive whole number.

4
4200 Connecticut Avenue NW | Washington, DC 20008 | 202.274.5869 | udc.edu
Degrees of freedom is a combination of how much data you have and how many parameters you
need to estimate. It indicates how much independent information goes into a parameter estimate. In
this vein, it’s easy to see that you want a lot of information to go into parameter estimates to obtain
more precise estimates and more powerful hypothesis tests. So, you want many DF!

Los grados de libertad son la combinación del número de observaciones de un conjunto de


datos que varían de manera aleatoria e independiente menos las observaciones que están
condicionadas a estos valores arbitrarios. 
En otras palabras, los grados de libertad son el número de observaciones puramente libres (que
pueden variar) cuando estimamos los parámetros.

Principalmente diferenciamos entre los estadísticos que utilizan parámetros poblacionales y


muestrales para saber sus grados de libertad. Comentamos las diferencias entre la media y
la desviación estándar cuando los parámetros son poblacionales o muestrales:

17. In each of these statements, tell whether descriptive or inferential statistics have been used.

5
4200 Connecticut Avenue NW | Washington, DC 20008 | 202.274.5869 | udc.edu
a. By 2040 at least 3.5 billion people will run short of water (World Future Society). Inferential
b. Nine out of ten on-the-job fatalities are men (Source: USA TODAY Weekend ). Inferential
c. Expenditures for the cable industry were $5.66 billion in 1996 (Source: USA TODAY ). Descriptive
d. The median household income for people aged 25–34 is $35,888 (Source: USA TODAY ). Descriptive
e. Allergy therapy makes bees go away (Source: Prevention). Inferential

18. Classify each sample as random, systematic, stratified, or cluster.


a. In a large school district, all teachers from two buildings are interviewed to determine whether
they believe the students have less homework to do now than in previous years. Cluster
Sampling.
b. Every seventh customer entering a shopping mall is asked to select her or his favorite store.
Systematic Sampling.
c. Nursing supervisors are selected using random numbers to determine annual salaries. Simple
random sampling.
d. Every 100th hamburger manufactured is checked to determine its fat content. Systematic Sampling.
e. Mail carriers of a large city are divided into four groups according to gender (male or female) and
according to whether they walk or ride on their routes. Then 10 are selected from each group and
interviewed to determine whether they have been bitten by a dog in the last year. Stratified
sampling.

19. The average quantitative GRE scores for the top 30 graduate schools of engineering are listed. Construct a
grouped frequency distribution and a cumulative frequency distribution with 5 classes.

767 770 761 760 771 768 776 771 756 770 763 760 747 766 754 771 771 778 766 762 780 750 746 764
769 759 757 753 758 746
https://www.chegg.com/homework-help/questions-and-answers/760-11-gre-scores-top-ranked-engineering-schools-average-quantitative-
gre-scores-top-30-gr-q88567273

20. In a recent survey, 3 in 10 people indicated that they are likely to leave their jobs when the economy
improves. Of those surveyed, 34% indicated that they would make a career change, 29% want a new job
in the same industry, 21% are going to start a business, and 16% are going to retire. Make a pie chart for
the data.

21. Classify each as nominal-level, ordinal-level, interval-level, or ratio-level measurement.


a. Pages in the 25 best-selling mystery novels. Ratio
b. Rankings of golfers in a tournament. Ordinal
c. Temperatures inside 10 pizza ovens. Interval
d. Categories of magazines in a physician’s office (sports, women’s, health, men’s, news). Nominal
e. Salaries of the coaches in the NFL. Ratio

22. What is a “co-efficient of variation”?

23. Choose one of the 50 states at random.


a. What is the probability that it begins with M?
b. What is the probability that it doesn’t begin with a vowel?
https://www.chegg.com/homework-help/questions-and-answers/question-3-selecting-us-state-choose-one-50-states-random--sample-space-
type-answer-b-prob-q98498288

24. In building new homes, a contractor finds that the probability of a home buyer selecting a two-car garage
is 0.70 and of selecting a one-car garage is 0.20. Find the probability that the buyer will select no garage.

6
4200 Connecticut Avenue NW | Washington, DC 20008 | 202.274.5869 | udc.edu
The builder does not build houses with three-car or more garages.
https://www.chegg.com/homework-help/mathzone-access-card-for-math-in-our-world-2nd-edition-chapter-11.7-problem-30e-solution-
9780077314118

25. The composition of the Senate of the 111th Congress is 41 Republicans 2 Independent 57 Democrats. A
new committee is being formed to study ways to benefit the arts in education.
a. If 3 Senators are selected at random to head the committee, what is the probability that they will
all be Republicans?
b. What is the probability that they will all be Democrats?
c. What is the probability that there will be 1 from each party, including the Independent?

https://www.chegg.com/homework-help/questions-and-answers/help-3-questions-please-sorry-shotty-photoshop-skills-think-get-point-
q19045788

7
4200 Connecticut Avenue NW | Washington, DC 20008 | 202.274.5869 | udc.edu
26. Approximately 10.3% of American high school students drop out of school before graduation. Choose 10
students entering high school at random. Find the probability that
a. No more than two drop out
b. At least 6 graduate
c. All 10 stay in school and graduate
https://www.chegg.com/homework-help/questions-and-answers/approximately-103-american-high-school-students-drop-school-graduation-
choose-10-students--q87719579

27. Find the probability for each.


a. P(0<z<2.32)
b. P(z<1.65)
c. P(z >1.91)

https://www.chegg.com/homework-help/questions-and-answers/find-probability-p-0-q38586535

28. In the standard normal distribution, find the values of z for the 75th, 80th, and 92nd percentiles.
https://www.chegg.com/homework-help/questions-and-answers/need-answers-important-minitab-application-minitab-please-
answer-find-z-value-minitab-stan-q27069865?trackid=a04fcc798c3e&strackid=55831424652b

29. The average credit card debt for college seniors is $3262. If the debt is normally distributed with a
standard deviation of $1100, find these probabilities.
a. That the senior owes at least $1000
b. That the senior owes more than $4000
c. That the senior owes between $3000 and $4000

30. Here we have

31.

32. (a)

33. z-score for X = 1000 is

34.

35. So the probability that the senior owes at least $1000 is

36.

37. (b)

38. z-score for X = 4000 is

39.

8
4200 Connecticut Avenue NW | Washington, DC 20008 | 202.274.5869 | udc.edu
40. So the probability that the senior owes more than $4000 is

41.

42. (c)

43. z-score for X = 4000 is

44.

45. z-score for X = 3000 is

46.

47. So required probability is

48.

https://www.chegg.com/homework-help/elementary-statistics-9th-edition-chapter-6.2-problem-11e-solution-9780078136337
49. For the following, find the variance and the standard deviation:

Eighty randomly selected batteries were tested to determine their lifetimes (in hours). The following
frequency distribution was obtained.

Class boundaries Frequency

62.5–73.5 5
73.5–84.5 14
84.5–95.5 18
95.5–106.5 25
106.5–117.5 12
117.5–128.5 6

https://www.chegg.com/homework-help/questions-and-answers/class-frequency-625-735-5-735-845-14-
845-955-18-955-1065-25-1065-1175-12-1175-1285-6-find--q79140215

Part 3. Statistical Application 1 (15%)

In a recent study of 35 ninth-grade students, the mean number of hours per week that they played video games was 16.6.
The standard deviation of the population was 2.8.

1. Find the 95% confidence interval of the mean of the time playing video games.
2. Calculate the margin of error.
9
4200 Connecticut Avenue NW | Washington, DC 20008 | 202.274.5869 | udc.edu
a. The Confidence interval is (15.67, 17.53)

b. The margin of error is 0.93

https://www.chegg.com/homework-help/questions-and-answers/recent-study-35-ninth-grade-students-mean-number-hours-per-
week-played-video-games-166-sta-q103046161

10
4200 Connecticut Avenue NW | Washington, DC 20008 | 202.274.5869 | udc.edu
Part 4: Statistical Application 2 (15%)
Imagine a scenario where you’ve been asked to be the gatekeeper (and paid very well for your duty) for a
series of scientific experiments to determine “truthfulness” in humans. A series of Subjects will approach you
and make a single statement- you are to determine the veracity of that statement (for simplicity, assume that
the only options available to you are to determine if a subject’s statement is “true” or false”- i.e.: there are
ONLY TWO OUTCOMES). REVIEW the Probability Equations!
Suppose that you are pre-equipped with knowledge of the following probabilities:
 82% percent of all subjects you will interview are “liars”
 75% percent of all “liars” will be identified
 40% of the time “truth-tellers” will be wrongly identified as liars
Please answer the following questions:

(a) What proportion of all subjects will you identify as “liars”?


(b) Of the people you determine to be “liars”, what is the probability that the person is, in fact, lying?
(c) What is the probability that you will make a correct decision?
(d) What is the probability that you will misidentify a “liar” (i.e., you mistakenly believed a “liar”)?

We are given here that:


P( liar ) = 0.82
P( + | liar ) = 0.75 and
P( + | not liar ) = 0.4

a) Using law of total probability,


P(+) = 0.75*0.82 + 0.4*0.18 = 0.687 is the required proportion here.

b) Using Bayes theorem, we have here:


P( liar | +) = 0.75*0.82 /0.687 = 0.8952 is the probability here.

c) P( correct decision ) here is computed as:


= 0.82*0.75 + 0.18*0.6 = 0.723 is the required probability here.

d) P( + and not liar )

= 0.4*0.18 = 0.072 is the required probability here.

11
4200 Connecticut Avenue NW | Washington, DC 20008 | 202.274.5869 | udc.edu

You might also like