Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 26

Benjamin Disraeli- There are lies, damned lies, and then there are statistics

Heck found that 65% of people believed to have above average intelligence and other found
that over 80% of drivers’ considered themselves to be better than average.
Over 90% of professors thought they were above average.
- This is called illusory superiority
Harris= Medical science of homicide
- Medical science advances is responsible for declining murdering rate but increase in
violence. Same number of people assaulted but more of them survive
Easily things can become political
Something to consider is that murder rate and aggravated assault rate have very different
values: ranging from 5 to 10 and 100 to 4000. They can make it seem a lot more different than
it actually is
Stats show that murder rate and aggravated assault have very similar changes over time.
What does this all mean?
Researchers weren’t necessarily wrong and some populations have indeed benefited from
medical science but not enough to modify the national trend.
In terms of vaccine and autism, they used all very small sample sizes and as we know, bad
samples isn’t representative of the whole population.
We cannot grab/hold onto one research study that says what we want.
Sometimes repeated studies are not reasonable because they may be expensive or take a long
time.
Chapter 2
Statistics can be borely defined as the collection, organization, presentation, analysis, and
interpretation of numerical data. A little boring, so another definition could be the activity to
make sense out of nonsense and numerical information that is important to our lives.
Most often, we will be reading/listening/interpreting stats other people are providing us.
If you drink and smoke, you increase colon cancer by about 400 percent.
Baseline stats matter because if the baseline probability was 0.1 percent, then a 400 percent
increase made it only 0.4 percent, etc.
Data don’t lie, but they don’t always represent what people think they do.
What are data anyway?
Data is plural for datum.
- Data are simply information
- Data doesn’t always mean numbers, it can be from interviews, etc.
For this course, data represent numbers. Trick: data can be anything that can be represented by
numbers.
- Therefore, data can be people, crime, opinions, etc.
How we measure things change over time
A variable is something that can be measured that can change depending which object you are
observing.
Observations are what you are studying: peoples, places, organizations, things, movements of
things, etc.
Gender can vary from observation to observation, so it is a variable. If it cannot vary, then it is
not a variable, and we are not interested.
If we have two or more variables, we have a data set
By looking at the census data set, people who have post-secondary education have a lower
percentage of those with low income, lower employment, and higher averages dwelling values.
Population and samples:
Population is the entire group of things/people/places that you want to learn stuff about.
Most of the time, you are gathering data from a sample of whom you are confident represent
the population of interest.
A sample is just a portion of the population, but you need to know what your population is.
We define what our population is and subsequently how to sample from it.
Descriptive versus inferential statistics
Descriptive statistics literally just describes or provides characteristics of data where inferential
statistics has the purpose of making inferences about your data set.
DS- find out about your data (averages, medians, modes, etc.)
IS can be used to test hypothesis regarding your data to make generalizations of your sample to
population.
DS can be interesting on their own and sometimes this is all you need to effectively answer
research questions but IS allow you to test your relationships with your data and this allows you
to answer questions that you can’t use DS for and to then generalize your findings.
Reliability and validity.
Reliability refers to the ability to get the same answers to the same answers. It is not the same
as valid.
Validity is the concept that refers to how well your method measures what is actually supposed
to measure.
4 Primary types of reliability and five primary types of validity.
4 Type sof reliability: test-retest. Inter-rater, parallel forms, and internal consistency
Test-retest: Measuring the same thing repeatedly over time.
- Ex. Questionnare followed by it again some time later and see if the group gives the
same answers
Inter-rater reliability measures constancy across assessors.
- Ex. Multiple teachers marking the same paper and giving similar grades.
- Critical in mental health assessments
Parallel forms of reliability measures the similarity of two different forms of the same test.
- Ex. Multiple tests are supposed to measure knowledge in the same way.
- Re-taking an exam in a course and the instructor gives them a different exam
- Need to have the same person write all of the exams and see if they score similarly on
each one
Internal consistency measures how well different parts of a test provide the same information
- Ex. You want to understand overall measure of poverty in neighborhoods so you
measure education, home ownership, etc. These variables can all measure different
aspects of poverty but one alone may not be able to do it.
5 types of validity:
Content validity- Considers the extent your measure represents to all known dimension of what
you’re measuring
- IF youre measruring a trait that has 5 known components but only measuring 3 of them,
you don’t have content validity
Construct validity- Degree to which your measurement of a concept actually captures that
concept
- If you wanna measure misogyny you must show to correspond to it and not just a
characteristic of it, such as violence for example
Criterion validity - When your measure corresponds to another measure of the same concept
- Ex. Able to use your measure to predict the concept you are measuring. If researching
recividsm, create a measure of it, and predict who will reoffend.
Can also be achieved if your measure if highly correlated with another measure that is
known to be valid.
Inernal validity- How confident you are that there is a causal link between your concept and
what you are researching. Ex. Neighborhood unemployment leads to increase in crime.
External validity- Refers to how well your research findings can be applied in other contexts.
Are your results generalizable to other people? Is it relevant to all places or just specific to your
study location.
- It is usually shown later when more research gives more results.
Assumptions and causality
Ontological assumptions are our assumptions regarding the nature of reality. We have views in
which the world works. These include religious, scientific, or a combination of both. It is an
assumption of the nature of the world.
Epistemological assumptions relate to what we believe can be known. Can we actually know
anything? What can we really know? Can we measure the things we want to know?
- In many cases, we cannot see or measure how things actually are, but only how they
appear or manifest themselves.
Axiological assumptions really begin to get into the subjectivity of research. This is when we
make explicit assumptions of what we think is important or valuable enough to do research on
- Combines Ontological and episotemogical assumptions
Methodological assumptions are bigger assumptions of what methods can or should be used
and can be based on the methods you value or what you think will work.
Causality
We need to show that causality is at least possible from our research design
The concept of causality (sometimes referred to as statistical conclusion validity) cab
manipulate the world and how we do things, outcomes, situations, etc. If we can show that if
we change X, Y will change in a predictable way.
Three important things to show you have causality: covariation, temporal predictability and
eliminating alternative explanations. Covariation and temporal predictability are easier to prove
than ruling out alt. explanations.
Covariation simply means that you can show that two or more variables move in a predicable
way. You can show two or more things move together. Moving together means that if one
variable goes up, the other(s) go up, a positive relationship. Or if one goes down, the other goes
down.
Temporal predictability: If one variable, X, causes another variable, to change, then X has to
change before Y can change. Sometimes statistics results makes it seem as though things
happened in temporal order but you got to double check with your data and with data
visualization.
Alt. Explanations- Is there something else that causes the outcome? There is usually alternative
explanations for any temporal change in crime.
- For the example provided in the textbook, international crime drop has shown to exist
around the world and has it’s beginnings around 1990
- Crime has been falling in most places for 20-30 years
- So if you’re trying to evaluate intervention for crime drop, you need to consider it’s also
been dropping already.
Moral of the story is to always plot your data.
Research questions, hypotheses, theories, and more!
A research question is the question that guides your research. What is the relationship between
unemployment rate and criminal activity?
How do we come up with research questions? Coming up with a set of research questions is
better process: start with what you’re interested in, see what data are out there, consider
gathering or own data, adapt research question in terms of easier data collection, etc.
Hypothesis: a statement regarding some phenomenon in the world that you want to
understand; that statement, however, has not been proven or substantiated yet. It is simply
your research question phrased as a statement.
Just because hypotheses can be general, it is best when they have some specificity so that it can
be teste fairly easily.
A critical aspect of hypotheses is that they are falsifiable. In order for research to actually be
scientific, any hypothesis, theories, and models, must be tested and need to be falfisable.
Important to note that hypotheses can be falsified when they are first stated.
We can’t say a hypothesis is found to be true, but rather it’s either we cannot reject, or fail to
reject the hypothesis
- This is an acknowledgement that our research study may be wrong but also a statement
of modesty
A law is a piece of scientific knowledge that has been proven not to be falsified and are usually
in math equations: law of gravity, etc. Laws tend to be more specific situations
Theories are set of scientific laws that are brought together to explain what is happening.
A theory explains something in the world that has been proved to be true through repeated
testing of the theory.
Theories often begin with a hypothesis that is tested over and over again in multiple contexts.
Theories in social context are much more difficult to be shown to be as general theories in
social sciences. A good theory will provide us with information as to how and why things
happen, a change in X will lead to a change in Y
A model is a more simplified version of a theory. It is more simplified and is done to actually
test a theory. For example, a theory may be conceptually simply, but difficult to operationalize.
In summary, models do not have to completely accurate to be useful. When we are able to
measure things better, we test our theories better. A model can be a simplification of a theory.
Having a theory to work with when you want to develop policy to help people or the
environment. If you want to make change for our physical world, you need to have an idea of
how the world works so you can find something to change X so you can impact something else
Y.
Another important dimension of theory is for policy evaluation.
Chapter 3:
Data, what are they good for?
Theoriest srtart from hypotheses that are not rejected, laws are formulated and these are
brough ttogethe to form theories. We need to then take our constructs and then measure them
so we can test hypotheses.
Conceptulization is the way in which we produce ways to define contstructs in a way that we
can measure in research.
- This is an iterative process within the research world in which researchers go back and
forth and argue how to measure certain concepts
- Researchers may conceptualize a concept differently depending on the person and their
experience
Then we go from conceputlization to operationalization. This si then researchers develop a
set of ways that they will measure theoretical constructs. This could be variables.
Where does data come from?
Two types of data: Primary and secondary data.
Primary data are data that you collect and secondary data are data that someone else has
collected.
Primary data are collected and used by researchers and can be time-consuming and expensive.
A lot of things to consider and a lot of work to be done. However, this type of data gives the
most richness.
- Can help come up with new ways to measure constructs that no existing data has
attempted
Secondary data just means that they were collected by someone else.
- These data can be from someone’s primary data collection If you are working in the
same research are abut answering different questions, this is useful.
- Usually deemed easier.
- If it’s disorganized it’ll take a long time to study just like primary data
- Much more ready to be used when downloaded though
- A disadvantage is that you are at the mercy of previous researchers for variables and
constructs and may not measure everything you need
Two types of spatial data and implicity spatial data
Explicitly spatial data not only have the location of each observation known, but those
locations are directly analyzed or considered in the analysis. The data are described or
analyzed from a spatial perspective.
Implicitly spatial data are also measured in various places, but the location patterns is not
considered in the research
- E.g. crime rates for a city over time, water flow volume in rivers, etc.
Spatial scale of data or the units of analysis
We can think of this disctinction in context of spatial scale in analyseses as well. This can be
broken down into individual level data and aggregated data
Individual level data are the easiest to think about with people.
Rach observation of your data represents a person
- E.g. surveys of individual people in any context: social views, victimization, etc.
IO can also be things like a building, as long as the characteristics are characteristics of the
building itself and not about what is inside of it.
- Number of stories, distance from main road, age, so on
Spatially aggregated data are very common in social sciences.
- Census data are based on individual level data but the data are commonly released are
aggregated to census tracks
These SAD measure the number of people mumber of houses, average income, of people
who live there and so on.
SAD are great but there are dangers to this.
- Ecological fallacy: occurs when you make inferences at a level of aggregation smaller
than analyzed(e.g. Unemployment increases crime so you make an inference that
individuals who are unemployed will commit crimes)
- It may have nothing to do with the individuals living there and may have more
complicated reasons
- Atomistic fallacy: You assume that what is true of an individual is true of the entire
group
- Important to just don’t make inferences at a level of aggregation different from what
you analyzed
The modifiable areal unit problem is the second danger. When you change units of analysis,
your results can change. If you shift the neighbourhood boundary that it cincludes some of both
rich and poor areas, it may come across as being a middle-class neighborhood because it will be
averaged out. Problem here is that you can generate any result you want by changing the
boundaries
What can be done about the modifiable area unit problem? Nothing really. Just make sure to
use spatial aggregations of data to make sense.
Difficult because often times appropriate spatial aggregation ma change for each analysis. AS
WELL, quite often when you want to use one spatial aggregation because it makes most sense,
your data are only available at another aggregation. Two choices here: analyze the data at the
aggregation available to you, or go out and gather your own primary data so you can aggregate
your data the way you want.
Different types of numbers:
Discrete versus continuous.
Discrete numbers can be thought of as counts. The number of people, the number of political
ridings, number of houses. No such thing as half a person.
Continous variables exist on a continuum or an interval. Temparture, height, rainfall, and
weight.
Quantitative vs Qualitative numbers
Qualitative numbers represent constructs such as gender, marriage, land use, political affilation,
crime hot spot, and so on.
Quantititative numbers measure things/people/places we can count. This is the usual way we
think of numbers. We can then use these numbers to create other numbers.
When using spacially aggregated data with agreements over definitions, you can count
qualitative numbers
E.g. Individual level data and a social construct for a gender, you have a qual. Number for that
person. IF you have spatially aggregated data, you can add up the people and identify with who
identified with a certain gender, etc.
Point of the story is that it is context dependent.
Different levels of measurements: nominal, ordinal, interval, and ratio (NOIR).
Nominal variables are the qualitative numbers we discussed above.
- Represent different categories
- Gender, martial status, job industry, political affilation, and so on.
- They are exhaustive (nothing is left out (e.g. are you married?)
- Point is that everything can be assigned to something with nominal variables, or at least
they should be
Ordinal variables are based on an order or ranking of categories. As numbers increase, there
will be some way things are better or worse.
- E.g. Likert scale (1-5)
- However, it is important to note that you cannot know how much something is better
than the other
- Strongly ordered: Every observation has it’s own rank
- Weakly ordered: represent income ranges
Interval variables: The magnitude in difference between two observations means something.
The placement of zero is arbitrary.
- Fahrenheight and Celcius
Ratio variables are similar to interval variables but it actually has a true zero.
- Zero is technically possible
Let’s talk about some actual variables!
Need to normalize variables so we can compare it across different census tracts and see if that
matters for anything
Important to understand the context of your data.
In trying to calculate rates, can’t just take the number of crimes divided by the number of
people in a census trak. We need to take the number of theft of vehicle crimes divided by the
population at risk, multiplied by a scaler.
Chapter 4: Sampling
One way of gathering data is sampling! We sample because it can be really expensive in terms
of time and money to get information from a population.
WE sample then, to be able to make statements about the opulation with some given level of
certainty. We want our data to be reprserntative of the population.
Sampling is efficient and cost-effective. WE can get a lot of information from our samples rather
quickly and then analayze those data way before you would even get close to completeting a
census on the entire population.
We expect to have data that represents the population, but sometimes data can have some
odd or unexpected results. So, you gather a second data set to double-check. IF you do
sampling correctly, chances of getting two strange samples are really low
Accuracy vs precision
Accuracy is when you throw a dart. Precision is havning your darts land at hit the palce all the
time.
Save time and money by sanmpling instead of us using the entire population and get a few
sampls.
Sometimes if you know how to get it wrong, you can fix it. You can adjust numbers around to
make sure it ends up being accurate as well.
Sampling error:
The main goal of sampling is to have a sample that is represtrnative of the population. That
means the sample has the same characteristics of the population.
Samples are asmples, so it’s quite likely that errors are going to be present. What matters if if
those errors are big enough problem to matter. If you sample correctly, they won’t be. Larger
samples minimize sampling error.
Small samples are not just inmprecise but also inaccurate. We need a sample sizes that is
beyond a specific threshold. Depending on the population you’re sampling g from,. Even though
your precision will increase with a larger sample, your accuracy may not. There might be bias
built into your sampling methods, even random sampling or simply missing under-represented
groups.
- E.g. Homeless people not included in censuses
Types of probability sampling:
Random Sampling: Every entry has the same probability of any given entity being the sample.
Larger the sample, the more precise any set of random samples will be. Larger your sample the
more representative your sample will be of the rest of the population.
Systematic sampling is a fairly common method of doing sampling. You sample every kth entity
and then obtain information from that entity. So there is a 1 in kth chance of being selected
- E.g. Every 50th person in a phone book
Stratified sampling is actually a fairly common sampling method.
- Strata: identifiable group of places, people, things, and so on in a target population
- Context of are and census: neighborhoods, etc.
- Then you divide the data into corresponding groups and then then within each of these
strata you would random or sysmetical sample within each strata.
Why would you want to do stratified sampling? Larger smaples may not be enough. You can
miss people even with a big sample. This is when proportional and disproportional
sampling comes in.
Disproportional sampling -> Interested in a particular group and want more of them in your
data set
Proportional sampling -> Want to have same population of all sub-groups in your sample
The great thing about sampling is that by having most people from this subgroup, even
though each is weighted less, you get so much more information
Cluster Sampling - You only sample within the strata at random as well. This is typically used
when you have a large stud reigion/population and a limited budget
- This can be problematic if the strata you picked is not representative
- E.g. you can either sample from all minority groups (stratified sampling) you could
randomly select a handful of minority groups and them sample within those. The
problem is that minority group that is randomly selected may have experiences of
discrimination or or less than others.
Spatial sampling is used for spatially continuous variables. Best examples are precipitation,
pollution, temperature, and so on. You can measure temperaute in one place and in another.
There is also hybrid approaches to sampling. You can combine sampling methods (like stratified
and cluster sampling) in a way that works for your research and budget.
Some types of non-probability sampling:
For popuations that are difficult to collect, you may collect data using non-probability sampling
Opportunity sampling: Sample from easily accessible individuals.
- Go to a traffic lot and talk to people who will talk to you
Using this type of sampling can be instructive for preliminary research but you must be cautious
if you are making inferences as you cannot say that your data represents a larger population.
Snowball sampling:
More common in qualitative research but can help gain more quantitative data as well.
Researcher makes initial contact with a small group of individuals by building trust. These
people tend to be small and /or unwilling to fill out sruveys with first establishing trust.
- Sex workers, drugs, organized crime, etc.
It is called snowballing sampling because you start small (even if it’s one individual) and then
move on to more and more people by referrals until you either speak with everyone or enough
information.
Quota sampling -Targets individuals who appear to be friendly and accessible, willing to answer
a quick survey. Used in market research and opinion olls and can be proportionately
representative in different strata but are subject to biased to more accessible people.
The sampling listed above is not generalizable to the population
Last important note is to remember you make generalization to a population you make sure is
the population that the sample is from.
Important to note that you cannot generally make generaizations about the population today.
Rather, you are making generalizations regarding a population when it was collected. So we
need to continue to update research and replication.
Chapter 5:
Important step is to look at your data before analyzing it.
Visualizing your data in a numerical fashion
Often times people accidentally misput their data and that creates errors, sometimes big and
sometimes big.
You can check your results with other results that have been used for the data.
What can you do if you have an error? If you have the raw data, go back and check that.
Knowing the median, minimum, maximum, and average
Side Note: High crime areas are rare, and most places have low crime.
Visualizing your data in a graphical fashion:
The two most common forms of graphing data are graphing one variable at a time and graphing
two variables at a time to see if there is a relationship.
The first form of graphing, one variable at a time, is a visual method to point out any errors or
strange values and to understand what exactly you’re looking at
- Histogram is probably one of the most common graphs you will see.
How do you decide how many vertical bars (or bins) do you have in a histogram? Which
hranges should you use for each of the bars in the histogram
- How smooth do you want the histogram to look like?
- If there are too many bins it looks dumb
- If there are too little we won’t get enough information regarding the distribution of our
variable
Almost all census tracts in Vancouver have total property crime counts less than 605. Important
to note is that we are using counts of crime here to visualize our data. Most often, we would
use crime rates, normalized with the population, for analyses.
Histograms are great when we want to visualize continuous variables, or at least interval or
ratio variables.
What if we have ordinal and nominal data?
- These graphs measure counts or percentages of individual values, rather than ranges.
- There will typically not be as many categories with these techniques
- The first is a bar graph and second is apie chart
Bar chart represents property crime rate classifications
Each slice of pie in the pie chart represents the frequency of each category. It’s good for
representing categorical data (nominal or ordinal), but if the number of categories are too high,
it gets hard to read. Making too many categories and making it too specific can only complicate
your analyses
Graphs that show relationships between two variables are called scatterplots
Scatterplot shows some type of correlation between low income and governmental assistance
Important to remember that government assistance may be unemployment insurance itself
however, most often, government assistance does not represent social support and
corresponding poverty. So it makes sense for low income and governmental support to be
correlated
Positive relationship -> Two or more variables that move in the same direction
You can describe one variable decreasing as another variable decreasing as a positive
relationship.
Negative relationship - Values of the variables move in opposite directions
Unemployment rate increases the average dwelling values in a census tract
Why doesn’t the average dwelling value fall bellow 200,000 for two of them and is close to
600,000 for the other?
- Vancouver is an expensive city even in the poorest neighborhoods
- Going through gentrification and being turned into residential neighborhoods
- People get displaced. If the costs of everything around you increases substantially, you
may not be able to afford to live in your neighborhood anymore
Relationships can start off positive and then become negative, and vice vrsa: this would be
called a quadratic relationship, shaped like the letter U or upside down U
We need to be careful when we plot data and find relationships like this. It may be fine, but we
are trying too hard to find a pattern that simply might not be there.
Areas with highest percentages of post-secondary education are in the neighborhoods of
Kitsilano and in the West end.
We need to approach these explanations with caution. Just because an explanation makes
sense does not mean that it is true, and vice versa.
Relationships are sometimes complex. A lot of statistical techniques that are covered will
assume linear relationships, but it is very easy to find nonlinear relationships.
Key takeaway is to know your data and relationships.
How to lie and mislead with your data
It is possible to frame things in a way that most people won’t question what you have said
because you have numbers behind what you are stating.
First example: Discussing change in values in a variable. If a study found that consuming
product X leads to an increase in risk of Y by 400 percent, what would you like t odo? Probably
stop using x. You need to ask at least two questions. What is the BASELINE PROBABILITY of Y.
More specifically, you would ask what are chanced of the horrible thing happening without X?>
What if risk of Y is 0.5 percent? You are only at 2% risk now.
Question 2: Are there any benefits of consuming X that I should consider
- Want to know if good outweighs the bad
Important to remember that this applies to good outcomes as well. Many medications that
provide health benefits but have terrible side effects and are just not worth the risk.
Another lie: Let’s say you want convince people that your product is the one people should
choose. How do you do it? 4 out of 5 doctors recommend product X! This may be misleading
depending on how you ask the question. Did the company really ask a bunch of experts?
Probably not. Go out and ask a bunch of experts to list their top 5 favorite products. If your
product makes it into that list of 5, 4 out of 5 times, then you can make that claim. How many
product choices are there to even choose from? It isn’t lying but it defs is misleading
Another one: Correlation is not causation! Correlation means that two variables move together
in a predictable way. If one goes up, the other oen goes down.
The first claim about the whole government assistance unemployment correlation is that giving
people social assistance makes them poor. Many people assume this because of the correlation
theses two variables have.
The second claim is that government assistance payments given to those who need social
supports lead to a cycle of poverty.
One reasons why we may have a spurious correlation is that both of the variables we are
looking at are rare effects from something else. Maybe there is even a third variable there!
Key takeaway is to keep asking questions until the answer makes sense.
Modifiable area unit problem (change in special scale of your analysis) misleading: You can
change it to get any aggregation you want. Jerrymandering is a method of changing the
aggregation/scale to give a different result. It isn’t a lie, but it kind of is.
A similar and last aspect of lying is changing a scale on a graph. For example, changing the scale
on the vertical axis may make things appear more likely that it’s happening or that it doesn’t
happen as much at a quick glance.
Purpose of this chapter is to be aware of what you hear and what you see when you are
presented data.
Chapter 6: Mapping data
Some characteristics of good maps:
Every graph needs a title. It also needs a north arrow, scale (bar), and a legend. North arrow is
simply so you know which way is up. Scale bar is so people can know the size of the area you
are looking at. Most people use a scale bar. The legend is critical because otherwise who knows
what is going on? Legends can represent categories or nominal variables and can also represent
ordinal, interval, and ratio data.
What happens when you change your legend?
The legend has great importance that can even impact your results. High crime (red) is in
downtown area and the eastside. From this high crime area, crime decreases as you get further
away from residential neighborhoods in the city. This is usually referred to as distance decay.
If you changed the red color to different variations of green. Light green is now higher crime.
This makes the graph look less dramatic than what it actually is.
You don’t want to have too many legend categories in the legend. This is because it makes it
difficult for us to meaningfully interpret the map.
Nice thing with 7 legend categories is that you can see some variation in residential
neighborhoods. All the maps listed are technically correct. The only thing that changes is the
representation.
Modifiable area unit problem in action!
MAUP is generic term for the potentially problematic changes that occur when you use
different scales of analysis in your research. You can switch from census tracts to official
neighborhoods scales which would generate different maps that indicate different stories.
Why does this happen?
Even though the legend wasn’t manipulated, legend categories will be different and will have
different thresholds. Crime that was smaller in the census tracts are now aggregated into
official neighborhoods. So if there was a CT that has a lot of crime that was surrounded by
tracts with little crime, now the entire neighborhood could look like it has crime.
- This is ecological and atomistic fallacy
It’s all about changing the scale of analysis: census tracts to official city neighborhoods.
Cartograms change the shape of each spatial unit based on other variables.
Interesting thing with cartograms is that you can see the variation across spatial units with
same legend categories. They are quite common in political science and the reporting of
elections to show how smaller places with large populations impact things.
Anyways, changing the spatial unit analysis, changing the number of legend categories, and
changing the ranges within the legend categories can alter the representation to tell different
stories.
- Ask questions about why data are presented in the manner that they are presented.
Different levels of measurement. You can map nominal and ordinal variables as well. If you do
not have many categories in your ordinal variable, it make things like number of legend
categories less subjective and harder to manipulate.
Nominal variables can be mapped as well. Because the variable is nominal and is not really
susceptible to manipulation with legend categories. Changing the scale of analysis can impact
the presentation of nominal variables. Manipulation of maps can be done with any variable that
is explicity defined as spatial (you know its location)
Hot spot Maps
These dumbass maps are everywhere. Hot spot maps are often instructive but are often
misused.
A value is calculated foe every area on your map. Then, the study area is divided into a number
of different grid cells. You place a grid over your research. It’s similar to selecting the size of bins
for a histogram. You have your location point or area and specify the bandwith. Uou select all
the events with the bandwith in your location, they calculate a value called a kernel.
Important to notice legend categories labelled low density to high density.
Problems with hot spots:
These maps are most often calculated only using the data of interest (e.g. crime events).
Downtown is obviously filled with a lot of people who definitely go at night and on the weekend
so obviously more crime will happen. You have to normalize the hotspot that has the
equivalent of a population at risk, to make it like a crime rate.
Simply basing a map on one set of events is called a single kernel density. Remember that if
you’re looking at a hot spot map, you are likely looking at a single kernel density, so interpret
that way unless it says it’s a dual kernel.
Second issue is that kernel density generates a surface from a set of points. So, you need to ask
is if you are studying something that can happen everywhere.
Quite probable that if you’re making a hotspot map, you will assign a positive value to an are
athat has no crime, but has some crime nearby.
In conclusion, hotspots are not good for having a detailed analysis of places.
Data visualization is important aspect of any statistical analysis.

Chapter 7:
Probability:
‘The chances of’, ‘how likely is’, or ‘I bet’, you were talking about probability. Sometimes
predications are almost certain: you drive or take the bus to campus because you expect it will
be faster than walking
The further you get away from campus, the more likely this is to be true.
Fundemental question: What is the likelihood that one outcome will occur, relative to all other
possible outcomes?
A probability is a number between zero and one, where zero refers to an event that never
occurs, and one to an event that definetly occurs.
Probabilities are central to both sampling and hypothesis testing.
Three components of Probability Theory:
We often not always choose the option with the highest probability. We will figure out the
probability of the thing we want to happen and then decide if it is worth to proceed.
First term is deterministic process: a pattern (over time, space, and individuals) that has
certainity. If yo know predictor varaibles, you can use it to predict what will happen perfectly.
- Some people argue that deterministic processes do not exist and we aren’t clever
enough to measure things properly
Second process is a random process: Each possible outcomes has an equal probability. You
can’t predict the outcome from taking a random guess from the possible values.
- E.g. Six sided dice having equal opportunity of landing on each side
- 1/6 no matter what (often called theoretical probability)
A stochastic process is a combination of the determinsitc process and the random process.
- E.g. a ladied die. It has extra mass on one side of the sides to make one of the numbers
emerge more often. Therefore, while you think you are working with a random process,
there is actually an underlying factor that makes one of the more numbers likely to
happen
Working with a stochastic process is what most people do who practice statistics. If different
things were deterministic or andom, there would be less work for statisticans.
Trying to get better predictions can lead to the development of new statistical techniques.
Medical and health literautres are particuilarly concerned about the risk: smoking creases the
risk of disease by this much. They will never say that some activity will give you a disease,
because there is a random component to it.
If the probability of something happening is close to zero, doubling it won’t be a big deal. If the
probability is already 5 percent, doubling the probability is much scarier
An experiment is an action or process that generates observations, or data we can analyze.
Continually rolling a die 50 times to get a set of numbers, for example.
A sample space is the set of all possible outcomes of the experiment. An event is any collection
from the space space. This could be rolling a die 20 times, or it could be going to a hospital and
recording 3000 births. An outcome is one of the observations obtained from the experiment.
Think of sample space as your population, an event as a sample, and outcome as one of your
individual observations.
A little bit of set theory and probability
Set theory is just a branch of mathematics that defines relationships between sets of
things/numbers/ and so on. A set is yet another word for an event.
S refers to sample space. 4 terms we need to define
Union- Joining two or more sets together. The mathematical notation for this is
- AUB
- Includes all of the observations from both sets whether they overlap or not
An intersection is set of outcomes that are in all the sets considered
- A n B refers to all the outcomes that are in BOTH A and B. This is the area where both
the circles overlap.
The complement is set of all outcomes in sample space.
- Complement of set A is the set of all outcomes in the sample space that are not
contained in A.
Two types of probability: Theoretical and empirical
Theoretical probability is calculated through deduction or the math of things
- For a fair coin, the theoretical probability is 0.5 (chances of outcome happening/number
of outcomes that can happen)
Empirical probability is different as it is based on data gathered in an experiment.
- Flipping a coin 500 times and it lands on heads 262 times. 0.524
As the number of outcomes in the experiment increases, the empirical probability will converge
to the theoritcial probability. This is called the Law of Large Numbers.
As the number of outcomes increase, the sample average converges to the population average
(theortiical probability)
General rules of probability
First, any given event A, P(A) >= 0. This means that probabilities are not negative. As well,
probability of something occurring in spample space is 1. Third, if the sets are mutally exclusive,
the probability of the union of those events occurring in sum of the probabilities for individual
events.
If two sets do not overlap, you can simply add up probabilities because you don’t have to worry
about double counting. Fourth, the probability of an event that cannot occur is zero. Fifth, the
probability of an event must occur is 1. Sixth, every probability is a number between 0 and 1.
Probability distributions:
Normal distributions
Karl Pearson and other guy named Gaussian found it. It is simply a distrubition of numbers that
are normally distributed.
Central limit theorem states that if you repeatedly sample a veraible from it’s distribtuon, it will
take more of a form of a normal distribution.
- Sampling distribution is not depedent on the population distribution
It also has a number of characteriscis that is useful to know. It is bell shaped, unimodal,
symmetrical, and the mode, median, and mean are equal
Unimodal means that it only has one mode.
Bimodal distribution is not uncommon and usually occurs when you have two groups of
observations in one data set.
Level of kurtosis- how tall the ND is
Standard normal distribution. WE can transform every normal distribution into a standard
normal distribution o make statistical testing easy to interpret.
We can do this by converting every observation into a z-score. This is what we get is the
standard normal distribution. The average is converted to zero and every other value
indications how far away the average mean each observation is. When the z-score is equal 1 to,
the observation is said to be one STD away from the mean.
Because we have standardized each observation as a z-score, we can easily figure out how
common a value is based on where it is located.
68% of observations are within one standard deviation of the mean. If the mean is the most
common value than we know the majority of observations are not that far away from the
mean.
2+ and -2% represents 95% of all the observations in your data set.
Standard normal distributions is the default because statistical life is much easier when these
conditions are met.
In order to idetnfy the percentage of values in between z-scores, we are supoosed to know the
population standard deviation and have a greater than 30 sample.
If your sample is less than 30 you should use t distribution. You should always use it if you using
an estimated standard deviation
T value is depedent on sample size.
Chapter 8: Descriptive statistics:
Going to talk about measures of central tendency (mean, mode, median), measures of
dispersion (range, inter-quartile range, variance, standard deviation) and measures of
distribution shape (skew and kurtosis)
Lets say we have f(x). As we know, it occurs frequently in the middle of the distribution.
Meaures of central tendency:
Because of the shape of the distributions shaped like this, we can calculate a number of
measures of central tendency. It is defined as representing a typical, or expected value: mode,
median and mean.
THE MODE! The mode is simply the value that occurs most frequently (the value that has the
highest FREQUENCY).
Using the mode doesn’t really make sense for a bimodal distribution.
If you’re going to report the mode, there should really be only one.
If the data you are describing is nominal, the category containing the greatest number of
observations is the mode. Mode is the ONLY central tendency you can use for nominal data.
Sometimes people put and gather data that are ordinal, interval, or ratio and put them into
categories.
This is done sometimes for confidential reasons. They do it because those ewho gather data
want to know information about different groups of people.
Even though the data you are describing is usually considered interval or ratio, it is represented
in categories: measuring how many places/people/things fall within each of the categories.
In these cases, the best measure of central tendency is the modal class and see which one
appears frequently.
Sometimes the mode doesn’t provide much information about the data.
Consider a variable with 8 categories: land use, residential, commercial, industrial government,
and institutional, mixed, parks, and agricultural.
Age ranges could be constructed such that the same number of people are in each category.
So, when should you use the mode? First, when variables are measured at the nominal level.
Second, use the mode wen you want to report the most common
value/score/type/classification. And third, probably as important as anything else, when you lok
at your data, and know that one classification is occurring more frequently than others.
THE MEDIAN!!!
The median is used for interval and ratio data. The media just the middle value from a set of
ranked observations.
This is easy to find if you have an odd number of observations, but what about an even number
of observations? Find the middle point between the two middle numbers.
When is the best time to use the median? You should NOT use it for nominal data. You can use
it for ordinal data but only if there are odd number because as we know, we don’t know the
difference between rankings, only that there are rankings.
So, interval and ratio data is probably best for this.
Another time when you would want to use the median is when you have highly skewed
observations.
In a skewed observation, the number of observations to the left of the mode/peak is not the
same as the numbers to the right. Therefore, the median, mode, and mean are all different and
are not unimodal.
When data is positively skewed,, there are more observations to the right of the mode. With
more observations to the right of the mode, the middle value is now to the right of the mode.
For negatively skewed, the median is to the left of the mode because there are more values
with smaller numbers.
The mean (technically arithmetic mean because we use arithemetic to calculate it rather than
transforming numbers), is the average in way mot people use the term. It is the most widely
used measure of central endency. It is expected because it is is the most frequent.
It should only be used for ratio or interval data.
As we discussed, the mean and median are going to be the same symmetrical and unimodal
distributions, but generally they are going to be very close when data are fairly dispersed.
If data is put into categories, you can still calculate the mean with a different formula.
You can also use the formula to calculcate the average weighted mean, with a simple
modification. This is for cases with disproportional sampling and we gotta do it because we’re
interested in the subpopulation.
You can even get data in cateogries that are ranges but when you do this you assume that data
are evenly distributed within each category.
Another situation, which you probably shouldn’t where you can use it is when there are
precisely defined ranges.
- E.g. Age and income (0-20, 21-40, 41-60, etc.)
In short, be transparent regarding the type of measure of central tendency.
Measures of dispersion:
MOD will tell us where individuals/places/things are positioned on the disbrution relative to a
measure of central tendency.
We need measures of dispersion- measures describing how dispersed the data are that
captures variety or diverstity in the data.
The greater diversion of a variable, the greater the range of scores will be and the greater
difference, and will generate less useful data. Topics will include range, quantiles, deviation
from the mean, variance, and standard deviation
The range is the cruedest measure of variability: It simply measures the distance between the
lowest and highest value observations in your variable. It is heavily influenced by extreme
values.
However, sometimes with data, there are outliers who have extreme values that can mess up
the range.
One method to deal with outliers is through quantiles: equal portions of data. We will see
quartiles (4 groups) or quintiles (5 groups).
25th percentile = 25 percent of the observations are below this number
50th percentile = 50 percent of the observations are below this number
75th percentile = 75% of the observations are below this number.
25 is lower quartile, 50th percent is the median, and the 75th percentile is the upper quartile.
A statistic that is often reported in this context is the interquartile range (75 th-25th) (greater
range = more variability)
Deviations from the mean:
Di is the distance between observation I and its mean. Useful in determine how typical an
observation is. It is obtained by subtracting an individual observation value from the mean. This
concept can help us calculate the average or mean deviation from the mean
We can calculate the sum of absolute value distance from the mean divided by the total
number of observations. In other words, we don’t care if the distance between an observation
and its mean are positrive and negative because we only care about how much disepersion
there is.
Once we calculate the average deviation from the mean, we can’t really tell if that’s big yet so
we gotta compare it to ther deviations from the mean.
From a statistical standpoint, if we are basing all of these calculations on a normal distribution,
the sum of all the squared deviations msut sum to zero. Because the normal distribution is
unimodal for every deviation on one side of the distrubition. This is known as the degrees to
freedom.
These measures of dispersions, couple with the mean, provides a lot of information regarding
the variable we are considering. For example, if a variable we are looking at has a mean =100
and a standard deviation of 5, we can be quite confident that any given observation will have a
value that is close to 100. But if the deviation is 25, we cannot be so confident.
Standard deviation is a unit of measurement to describe observations on/in the ND, comparing
two observations to each other using their STD scores
How do we get normalized standard deviations? We calculate the z-score
Coefficient of variation:
Where CV is the coefficient of variation.
The CV shows the standard deviation relative to the mean and multiplying the CV by 100 allows
us to interpret the statistics a little bit easier. It isn’t used very much.
Measures of distribution shape:
The last descriptive statistics we will discuss here are skewness and kurtosis. Rarely report these
statitics, but we will test form them, espiecally skewness, because we often assume our
distributions do not have any skew. Skewness is is the degree of symmetry in the distribution of
observations. Kurtosis is the degree of flatness or peaked-ness of distribution. In short, skew is
important to us, but kurtosis is not.
Skewness occurs when the distributon is not symmetrical.
Death rates are an example of a left-skewed variable.
When the distribution is positively skewed, the mean is greater than the median. If a
distribution is negatively skewed, the mean is less than the median.
We care about skewness because a lot of the times our data isn’t normally distributed.
In conclusion, descriptive statistics are literally used to describe your data and provide a
summary of it.
An example using the textbook data:
Let’s go over some of these concepts using the data provided in the textbook.

You might also like