Lecture 3 - What Is Big Data Analytics - GEOG 3226 - 2023

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 48

Lecture 3

What is big data analytics?


: Overview

GEOG 3226
Urban Data Science

Instructor: Professor Jinhyung Lee


Overview

• Part 1: What is big data analytics?

• Part 2: Four types of big data analytics

• Part 3: Popular analysis methods for big data analytics


Part 1:

What is big data analytics?


What is big data analytics?
• Until now, we have learned where
we can find, how we can collect
and store the big data

• Nonetheless, simply possessing


the big data doesn’t help solve real-
world problems

Source: utimaco.com
• Big data only becomes useful when
we do something with it

• Therefore, the question now


becomes:
How do we analyze the big data to
extract useful information,
knowledge, and insights?

Urban Data Science GEOG 3226


How do we analyze the big data to extract
useful information, knowledge, and insights?

Placing knowledge into a


framework that allows the
knowledge to be applied to
different contexts.

Information being understood,


useful for explaining something or
proving insights.

Data related to each


other through a context,
providing useful story.

Basic, discrete facts We are


about something
HERE!
Source: Wikipedia
Rowley, J., 2007. The wisdom hierarchy: representations of the DIKW hierarchy. Journal of information science, 33(2), pp.163-180.

Urban Data Science GEOG 3226


FOUR important questions

• You have big urban data. However, you don’t know where to start.
Then these four questions can be good corner stones!

What happened? Crime hotspot Descriptive analytics

Why did this happen? Diagnostic analytics

What might happen in the future? Predictive analytics

What should we do next? Prescriptive analytics

Urban Data Science GEOG 3226


Part 2:

Four types of big data analytics


Type 1: _____ analytics

• Question: What happened? Crime hotspot

• Descriptive analytics is the process of


using current and historical data to identify
trends and relationships

• It is commonly called the simplest form of


data analysis since it describes trends and
relationships but doesn’t dig deeper

Source: techtarget.com
• Descriptive analytics is relatively
accessible and easy-to-use

Urban Data Science GEOG 3226


Type 1: Descriptive analytics
• Example
Statistics Canada presents the simple & readable summary statistics and key
indicators of 2021 Canadian Census

Urban Data Science GEOG 3226


BTW – where can I get Canadian census?
• CHASS – Canadian Census Analyser

Urban Data Science GEOG 3226


https://datacentre.chass.utoronto.ca/census/
BTW – where can I get Canadian census?
• cancensus R package (recommended)

Urban Data Science GEOG 3226


https://github.com/mountainMath/cancensus
BTW – what if I am interested in US and other countries?
• Social Explorer (you have to access it on campus)
This is also useful for quick “visual” descriptive analytics

Urban Data Science GEOG 3226


https://www.socialexplorer.com/explore-maps
Benefits of descriptive analytics

• Understanding Past Behavior: By analyzing historical data, organizations


can understand past performance, customer behavior, and other key metrics over
different time frames (census data in 1996 versus 2006 versus 2021)

• Basis for Planning and Strategy: Understanding past trends and patterns
provides valuable insights that can be the foundation for strategic planning and
decision-making

• Before moving to more advanced analytics, it's essential to ensure that


the data being used is accurate and reliable. Descriptive analytics can help identify
anomalies or inconsistencies in data

• Visual Understanding: Descriptive analytics tools often include visualization


capabilities, allowing data to be represented in graphs and other visual formats that
make trends and patterns more evident and comprehensible (social explorer)

• BASICALLY - Descriptive analytics is a crucial first step in data


analytics
Urban and serves as the foundation for more in-depth analysis
Data Science GEOG 3226
Type 2: _____ analytics

• Question: Why did this happen? Crime hotspot

• Diagnostic Analytics is one of the more


advanced types of big data analytics. It is the
process of using data to determine the
causes of trends and correlations between
variables

• It can be viewed as a logical next step after


using descriptive analytics to identify trends

• Diagnostic analysis can be done manually,


using an algorithm, or with statistical software

• You can resort to this analytics when we want Source: techtarget.com


to know the reason behind a particular
problem

Urban Data Science GEOG 3226


Type 2: Diagnostic analytics

• There are three important concepts related to Diagnostic analytics

1. Hypothesis testing

2. Correlation versus causation

3. Diagnostic regression analysis (revisit this in the Week 11)

Urban Data Science GEOG 3226


Concept 1: Hypothesis testing

• Hypothesis testing is the statistical process of proving or disproving an assumption

• When applied in diagnostic analytics, _____ are oriented toward the past
• Example: “I assume this month’s decline in COVID-19 infections
was caused by recent increase in vaccination.”

• But _____ can be oriented toward the future when applied


in predictive or prescriptive analytics
• Example: “If we build more parks around a housing estate, the mental health of
residents living there will improve.”

Urban Data Science GEOG 3226


Concept 2: Correlation versus causation
• When exploring relationships between variables, it’s important to be aware of
the distinction between correlation and causation

• If two or more variables are correlated, their directional movements are


related. If two variables are _____ correlated,
it means that as one goes up or down, so does the other

• Alternatively, if two variables are _____ correlated,


one variable goes up while the other goes down

Income

Education

Urban Data Science GEOG 3226


Source:
www.simplypsychology.org
Concept 2: Correlation versus causation

• IMPORTANT: The key in diagnostic analytics is remembering that


just because two variables are correlated,
it doesn’t necessarily mean one caused the other to occur

• Correlation is when two factors (or variables) are related,


but one does not necessarily cause the other;

Causation is when one factor (or variable) causes another

• Example: Next slide

Urban Data Science GEOG 3226


Correlation versus causation
Money spent on
pets and number of
lawyers in California
each year are highly
correlated!

However, money
spent on pets and
number of lawyers
does not cause the
other.

😳?
Source: vox.com

• While determining causation is ideal, correlation can still offer the insight
needed
Urban Data to make sense of your data and use it to make good decisions
Science GEOG 3226
Diagnostic regression analysis (Week 11)
• Some relationships between variables require in-depth examination by
_____ analysis, which can be used to determine the relationship between

1) two variables (single linear regression) or 2) three or more variables (multiple regression)

• The relationship is expressed by a mathematical equation that translates to


the slope of a line that best fits the variables’ relationships

Crime Median Population


household density
Education
income

• Regression can provide insights into the structure of the relationship between
variables and provides measures of how well the data fit that relationship

• Such insights can provide extremely valuable for analyzing historical trends
(diagnostic analytics) and developing forecasts (predictive analytics)
Urban Data Science GEOG 3226
Example of diagnostic analytics: Correlation
• Example: Starbucks

• Based on the found correlation between the “feels-like” temperature and the
number of Frappuccino sold,
Starbucks is predicting the sales volume and do the localized marketing efforts
Urban Data Science GEOG 3226
https://mediaspace.esri.com/media/t/1_wcl6obys (9:20)
Benefits of diagnostic analytics

• Root Cause Analysis: Diagnostic analytics allows analysts to identify the root
causes of observed outcomes, helping them understand the underlying factors
contributing to a specific result

• Improved Decision-Making: By understanding the reasons behind specific


outcomes, organizations can make more informed decisions in the future (starbucks)

• Identifying Patterns and Relationships: Diagnostic tools often allow


users to spot patterns and correlations in complex datasets, which can be valuable in
understanding interdependencies

• Enhanced Insights: While descriptive analytics provides an overview,


diagnostic analytics dives deeper, offering a more granular understanding of events
and trends
Urban Data Science GEOG 3226
Type 3: _____ analytics

• Question: what might happen in the future?


Crime hotspot

• Predictive analytics is the use of data to predict


future trends and events. It uses historical data to
forecast potential scenarios that can help drive
strategic decisions

• Predictive analysis can be conducted based on


Source: techtarget.com
data from the past either manually or by machine-
learning algorithms

Urban Data Science GEOG 3226


Example of predictive analytics: Soccer

Source: theathletic.com,
soccertake.com

• XThreat (expected threat) - the likelihood of a shot occurring


within the next 10 seconds if the pass is completed
• XPass (expected pass completion)

• Clubs use analytics to predict future performance, simulate game scenarios, and
make strategic decisions.

• Big data becomes a significant part of soccer analytics!

Clubs
•Urban startScience
Data to use artificial intelligence and machine learning for data analysis,
GEOG 3226
providing deeper insights and more accurate predictions.
Benefits of predictive analytics

• Predictive analytics can help provide reliable and accurate


forecast of the future

• By using predictive analytics, it is possible to take preemptive actions


to stave off bigger problems that may cause greater
social & economic losses

• Government can also rely on predictive analytics to better understand


residents’ needs and create more livable urban environment for everyone

Urban Data Science GEOG 3226


Type 4: _____ analytics

• Question: what should we do next?


Crime hotspot

• By considering all relevant variables,


prescriptive analytics yields recommendations
for the future

• Machine-learning algorithms are often used in


prescriptive analytics to parse through big data

• Despite the recommendations made by


Source: techtarget.com
algorithms,
human discernments are still required

Urban Data Science GEOG 3226


Example:
Algorithmic recommendation

• On social media, if you finish a video, that’s


a strong indicator that you’re interested.
Videos are then ranked to determine how
likely you’ll be interested in each video and
delivered to each unique ‘For You’ feed.”

• This prescriptive analytics use case can


make for higher customer engagement https://static.googleusercontent.com/media/rese
arch.google.com/en//pubs/archive/45530.pdf
rates, increased customer satisfaction, and
the potential to retarget customers with ads
based on their behavioral history.

Urban Data Science


https://newsroom.tiktok.com/en-us/how-tiktok-
GEOG 3226
recommends-videos-for-you Source: blog.hootsuite.com
Benefits of prescriptive analytics

• Actionable Insights: Unlike other forms of analytics that might stop at


providing insights, prescriptive analytics goes a step further by suggesting
concrete actions to achieve desired results

• Improved Decision-Making: By offering actionable recommendations


based on data, decision-makers can make more informed choices backed by
analytical evidence

• Proactive Approach: Instead of just reacting to events,


businesses/organizations can proactively manage situations by acting on the
recommendations provided by prescriptive analytics

Urban Data Science GEOG 3226


Summary
• Four types of big data analytics: from the perspective of questions being asked

Source: medium.com

Urban Data Science GEOG 3226


Part 3:

Popular analysis methods for


big data analytics
The workflow of the big data analytics project

We are talking about this part!

After Exam 1

Source: Big Data and Social Science : Data Science Methods and Tools for
Research and Practice, edited by Ian Foster, et al., CRC Press LLC, 2020

Urban Data Science GEOG 3226


Five most important data analytics approaches

• 1. Exploratory data analysis (EDA)

• 2. Visualization (our main topic for Weeks 7 and 8)

• 3. Text analysis

• 4. Machine learning

• 5. Network analysis

Urban Data Science GEOG 3226


Method 1: Exploratory data analysis (EDA)

• _____ is the process of conducting preliminary exploration of the


collected data in order to reveal patterns and anomalies through the
usage of summary statistics and data visualization methods

• Two main interests:

1. Understanding Data Characteristics

2. Discovering Patterns and Relationships

Urban Data Science GEOG 3226


Understanding data characteristics
• Distribution of Data: EDA often involves visualizing the distribution of data for
individual variables using histograms, box plots, or kernel density plots.
This helps identify the central tendency, spread, and shape of the data distribution

• Identification of _____: Outliers can heavily influence statistical tests and models.
EDA helps in detecting and, if necessary, addressing these outliers

Outliers Outliers

Urban Data Science GEOG 3226


Understanding data characteristics
• Summary Statistics: Calculating measures like mean, median, mode, variance,
standard deviation, and other statistics provide insights into the dataset's characteristics

Colonial rulers

Colonial subjects

Transport
poverty

Kim, Y., Lee, J., Kim, J., & Nakajima, N. (2021). The Disparity in Transit Travel Time
between Koreans and Japanese in 1930s Colonial Seoul. Findings.

Urban Data Science GEOG 3226


Discovering patterns and relationships
• Correlations: EDA often investigates the relationships between pairs of variables
to identify if any significant correlations exist

• To quantify the relationships among variables, we can use Pearson’s correlation


coefficient, which is a measure of the linear association between two variables

-1 indicates a perfectly _____ linear correlation between two variables


0 indicates no linear correlation between two variables
1 indicates a perfectly _____ linear correlation between two variables

Urban Data Science GEOG 3226


What is ESDA?

• _____ is a set of techniques to


1) visualize, 2) summarize, and
3) understand spatial patterns and
relationships in spatial data

• Main purpose
– Seeing what the data can tell us
before diving into the formal
modeling or hypothesis testing task

• Simple mapping, graphs, and


descriptive statistics are used

Source: https://towardsdatascience.com/exploratory-spatial-data-analysis-esda-spatial-autocorrelation-7592edcbef9a

Urban Data Science GEOG 3226


Method 2: Data visualization (Weeks 7 & 8)
• EDA and Data visualization are closely related

• Data visualization: Represent data in a meaningful, visual way that users can
interpret and easily comprehend. Provide an accessible way to see and
understand trends, outliers, and patterns in data

Urban Data Science GEOG 3226


Stiles, J., Kar, A., Lee, J., & Miller, H. J. (2023). Lower volumes, higher speeds: Changes to crash type, timing, and severity on urban
roads from COVID-19 stay-at-home policies. Transportation research record, 2677(4), 15-27.
Method 3: Text analysis

• First of all, we need to understand that most big data is unstructured

• Unstructured data includes information from sources like documents, e-mails,


SNS posts, blogs, videos and satellite imagery

Urban Data Science GEOG 3226


Method 3: Text analysis
• Text analytics is the process of
analyzing unstructured text, extracting
relevant information, and transforming
it into structured information that can
then be leveraged in various ways

• The analysis and extraction


processes take advantage of
techniques that originated in
computational linguistics, statistics,
and other computer science
disciplines
Source: https://datasciencedojo.com/
• Natural language processing (NLP)
• Knowledge discovery
• Data mining, information retrieval

Urban Data Science GEOG 3226


Example: Understanding Place Identity with Generative AI
Results

Methods Jang, K. M., Chen, J., Kang, Y., Kim, J., Lee, J., & Duarte, F. (2023). Understanding Place Identity
with Generative AI. arXiv preprint arXiv:2306.04662.

Urban Data Science GEOG 3226


Method 4: Machine learning

Urban Data Science GEOG 3226


Artificial intelligence (AI)
• Definition:
AI refers to the capability of a machine to
imitate intelligent human behavior. It
encompasses a wide range of technologies,
including machine learning and deep learning

• Real-world example:
Voice Assistants like Apple's Siri.

When you ask Siri about the weather,


you're interacting with AI

These assistants understand and process your Imitating human assistant


voice commands, search for information, or Source: apple.com

control other smart devices in your home

Urban Data Science GEOG 3226


Machine learning (ML)
• Definition:
ML is a subset of AI where machines are trained to learn from data.
Instead of being explicitly programmed to perform a task, they use data to make
predictions or decisions without being specifically programmed for that output

• Real-world example:
Platforms like Netflix or Spotify suggest movies, shows, or songs based on your previous
choices. Machine learning algorithms analyze your viewing or listening habits (and those
of similar users) to recommend content you might enjoy

Urban Data Science GEOG 3226


Source: datasciencecentral.com
Deep learning (DL)
• Definition:
Deep learning is a subset of ML inspired by the
structure and function of the human brain,
specifically neural networks. It uses multiple
layers (hence "deep") of these networks to
analyze various factors of data

• Real-world example:
Photo Tagging Features on platforms like
Facebook. When you upload a photo, you
might notice the platform automatically
recognizing and tagging people in the image

This is due to deep learning models trained on


Source: https://medium.com/hackernoon/building-a-facial-
recognition-pipeline-with-deep-learning-in-tensorflow-66e7645015b8

millions of images to recognize facial features


and associate them with the right individuals

Urban Data Science GEOG 3226


Method 5: Network analysis
• A _____ refers to a structure representing a group of objects/people and
relationships between them. It is also known as a graph in mathematics

• Networks are comprised of _____, which represent entities that can be


connected to one another, and of _____ that represent the relationships
connecting nodes

Nodes can be people, countries, cities, airports, etc.

Edges mean connections

Urban Data Science GEOG 3226


Method 5: Network analysis
• For instance, if we are studying a social relationship
between Facebook users,
nodes are target users and
edges are relationships such as friendships between
users or group memberships. In Twitter, edges can be
following/follower relationships

• Centrality measure (the importance of a node)


is particularly useful!

• Network centrality measure is used for detecting:

- How well-used a road is in a transport network?


- How influential a person is in a social network?
- How important a web page is?

Urban Data Science GEOG 3226


Next week

• Application of big data analytics

– Diver deeper into network analysis

– How to use big, real-time data for urban human mobility and accessibility analysis?

Urban Data Science GEOG 3226

You might also like