Professional Documents
Culture Documents
EECS6893 BigDataAnalytics Lecture1
EECS6893 BigDataAnalytics Lecture1
www.stanford.edu/~cdel/2014.asplos.quasar.pdf
• Massive Parallelism
• Huge Data Volumes Storage
• Data Distribution
• High-Speed Networks
• High-Performance Computing
• Task and Thread Management
• Data Mining and Analytics
• Data Retrieval
• Machine Learning
• Data Visualization
7 E6893 Big Data Analytics – Lecture 1: Overview © 2017 CY Lin, Columbia University
Contrasting Approaches in Adopting High-Performance Capabilities
http://wikibon.org/wiki/v/Big_Data_Vendor_Revenue_and_Market_Forecast_2013-2017
9 © CY Lin, Columbia University
E6895 Advanced Big Data Analytics – Lecture 1
Why Big Data now?
• High-Volume
➔ • High-Velocity
• High-Variety
➔ Artificial
Intelligence
10 E6893 Big Data Analytics – Lecture 1: Overview © 2017 CY Lin, Columbia University
1997
211 E6895 Advanced Big Data Analytics – Lecture 1 © CY Lin, Columbia University
Jeorpady
2011 — 1997
212 E6895 Advanced Big Data Analytics – Lecture 1 © CY Lin, Columbia University
2015
14 E6893 Big Data Analytics – Lecture 1: Overview © 2017 CY Lin, Columbia University
15 E6893 Big Data Analytics – Lecture 1: Overview © 2017 CY Lin, Columbia University
Human brain is a graph/network of 100B nodes and 1T edges.
memory
• Graph Database:
• Large-Scale
Native Store
• Sapphire Big Data Analytics Open Source Applications: Create a Big Data open
source toolsets for various industries (and disciplines)
▪ Textbook:
-- None, but reference book(s) and/or articles/papers will be provided each lecture.
18 E6893 Big Data Analytics – Lecture 1: Overview © 2017 CY Lin, Columbia University
Course Grading
▪ 3 Homeworks: 40%
-- Individual work; Language Requirement: C/C++, Java, JavaScript, Python
-- Report and source code
19 E6893 Big Data Analytics – Lecture 1: Overview © 2016 CY Lin, Columbia University
Other Issues
▪ Professor Lin:
▪ Office Hours:
Thursday after the class: 9:40pm – 10:00pm (SIPA 417, lecture room)
▪ Contact: c.lin@columbia.edu
20 E6893 Big Data Analytics – Lecture 1: Overview © 2016 CY Lin, Columbia University
Reading Reference for Lecture 1
22 E6893 Big Data Analytics – Lecture 1: Overview © 2017 CY Lin, Columbia University
Reference Book
23 E6893 Big Data Analytics – Lecture 1: Overview © 2017 CY Lin, Columbia University
5 Example Big Data Use Case Categories
824 E6895 Advanced Big Data Analytics – Lecture 1 © CY Lin, Columbia University
Big Data Examples -- Application Use Cases
1. Expertise Location
2. Recommendation
3. Commerce
4. Financial Analysis
5. Social Media Monitoring
6. Telco Customer Analysis
7. Healthcare Analysis
8. Data Exploration and Visualization
9. Personalized Search
10. Anomaly Detection
11. Fraud Detection
12. Cybersecurity
13. Sensor Monitoring (Smarter another Planet)
14. Cellular Network Monitoring
15. Cloud Monitoring
16. Code Life Cycle Management
17. Traffic Navigation
18. Image and Video Semantic Understanding
19. Genomic Medicine
20. Brain Network Analysis
21. Data Curation
22. Near Earth Object Analysis
25 E6893 Big Data Analytics – Lecture 1: Overview © 2017 CY Lin, Columbia University
Category 1: 360º View
Recommendation
item
Enhancing:
user
Graph Visualizations
Dynamic networks
of 400,000+
IBMers:
– On BusinessWeek four times, including being the Top Story of Week, April 2009 Shortest Paths
– Help IBM earned the 2012 Most Admired Knowledge Enterprise Award Social Capital
– Wharton School study: $7,010 gain per user per year using the tool Bridges
– In 2012, contributing about 1/3 of GBS Practitioner Portal $228.5 million savings andHubs
benefits
Expertise Search
– APQC (WW leader in Knowledge Practice) April 2013:
Graph Search
“The Industry Leader and Best Practice in Expertise Location” Graph Recomm.
27 © CY Lin, Columbia University
E6895 Advanced Big Data Analytics – Lecture 1
Finding and Ranking Expertise – Social Network Analysis
▪ Decades of Social Science studies demonstrates that (social) network structure is the key indicator determining a
person's influence, organizational operation efficiency, social capital to get help, potential to be successful, etc.
▪ Who are the key bridges? Who have the most connections? How do these experts cluster?
▪ Analogy – Google founders utilized the concept of network analysis on webpages to create ranking.
Independent
experts on
healthcare
His self-described
expertise
The public interest
groups he is in
My various paths to Tom. SmallBlue can show the paths to any colleagues up to 6-degree away
How many
people in my
personal
networks?
Analyzing existing
social networks of
What types of unique every employee That
colleagues my friend Chris can makes it possible to
help me connect to? find the shortest path
to any colleague..
C
B
D
C
R
B
CB
D DR
R
CF + SP
IF Network
TIF
Info Flow
CF + SP
Early adopter IF
Late adopter
TIF
Tests:
– 1 month
Innovators – 586
new docs
Early adopters – 1,170
users
Number of recommended users
▪ Data Source:
– Relationships among 7594
companies, data mining from
NYT 1981 ~ 2009
Team
Account team Design team
Person
Sociology Healthcare
CS
Info
EE Improve
Sensor
SNA
Detected as top 1
anomaly in Sandy Outperform
Tweets existing approaches
by up to 180%
(IJCAI 13)
41 © CY Lin, Columbia University
E6895 Advanced Big Data Analytics – Lecture 1
Dynamics of Information Graphs in Social Media
42
E6895 Advanced Big Data Analytics – Lecture 1 58
© CY Lin, Columbia University
Visual Sentiment and Semantic Analysis
First work in the literature on automatic visual sentiment analysis
Build Sentiment
Ontology
MISTY WOODS
Train Classifiers
Select
Adj-Noun Pairs
Discover Performance
SAD Filtering
sentiment
EYES
words
Training from 6 million tags
Experiment on Sentiment
Detection Accuracy
on Twitter
Detection results of “crazy car” (100% accuracy, 5 out of 5 correct) Text 0.43
Visual 0.70
T+V 0.72
profiles
− Build analytics applications (e.g. personalized System G Analysis
advertisement) based on the extracted
BigInsights
customer social profiles
Enhancing:
headache
chill migraine
high fever
stomachache
cough
Graph
Communities
http://systemg.ibm.com/apps/whisper/
index.html
http://systemg.ibm.com/apps/whisper/index.html
SocialHelix: Visualizaiton of
Sentiment Divergence in
Social Media
http://systemg.ibm.com/apps/socialhelix/index.html
52 © CY Lin, Columbia University
E6895 Advanced Big Data Analytics – Lecture 1
Use Case 9: Graph Search
ranking re-ranking
Interest / social network
based content
recommendations
Info-Socio
networks Graph analysis query context
Normal:
Attacker:
(1)Clique-like
Near-Star
(2)Two-way links
Detecting DoS
attack
Graph Visualizations
Emails
Graph analysis
Instant Messaging
Social sensors
Web Access Behavior analysis Detection,
Click streams capturer Multimodality Prediction
Executed Processes
Feed subscription Semantics analysis Analysis &
Printing Exploration
Copying Database access Psychological Interface
analysis
Log On/Off
Unstable
Planning Mental status
Attack
56 © CY Lin, Columbia University
75 E6895 Advanced Big Data Analytics – Lecture 1
Multi-Modality Multi-Layer Understanding of Human
● Structure Learning
● Evolutionary Behavioral Modeling & Prediction
Cognition
Layer
Semantics
Layer
Concept
Layer
Feature
Layer
Sensor
Layer
HR records, Travel records, Transmitted images,
Badge/Location records, speech content,
Phone records, Mobile records video content
Available existing data
future additions?
57 : observations : hidden states
E6895 Advanced Big Data Analytics – Lecture 1 © CY Lin, Columbia University
Example of Graphical Analytics and Provenance
Markov Latent Bayesian
Network Network Network
1 in Top #21-#50, and 2 in Top #51-#100. Performer 2 did not report results. Performer 3 reported: 3 of the 12 cases Top
#50-#100, 6 cases Top #101-#500, and 3 cases beyond Top #501.
59 © CY Lin, Columbia University
E6895 Advanced Big Data Analytics – Lecture 1
Use Case 11: Fraud Detection for Bank
Network Ego Net
Info Flow Features
Normal:
Attacker:
(1)Clique-like
Near-Star
(2)Two-way links
Detecting DoS
attack
Bayesian
Network
Varying over
KPI time series (e.g., ? time
Causality
server performance/
load, network analyzer
performance/load)
KPI (a time series)
(potential) pairwise
relationship (e.g., causality)
Graph Visualizations
Bayesian Network
* 3 timesteps * 63 variables
* 3.9 avg states * 4.0 avg
indegree
* 16,858 CPT entries
Junction Tree
* 67 cliques
* 873,064 PT entries in cliques
Varying over
KPI time series Causality ? time
(e.g., server analyzer
performance/load,
network KPI (a time series)
performance/load) (potential) pairwise
relationship (e.g., causality)
Select KPI pairs (sampling)→ Test link existence → Estimate unsampled links based on history
65 → Overall graph E6895 Advanced Big Data Analytics – Lecture 1 © CY Lin, Columbia University
Category 5: Data Warehouse Augmentation
Graph
application Graph
application
Graph objects
Graph objects
Vertex Attribute
Correspondence Transformation
ARG s ARG t
70 © CY Lin, Columbia University
E6895 Advanced Big Data Analytics – Lecture 1
Use Case 19: Graph Matching for Genomic Medicine
75 E6893 Big Data Analytics – Lecture 1: Overview © 2017 CY Lin, Columbia University
76 E6895 Advanced Big Data Analytics – Lecture 1 © CY Lin, Columbia University
77 E6893 Big Data Analytics — Lecture 1 © CY Lin, 2017 Columbia University
Advanced Topic 2: Robo-Advisor
Market Data Analysis and Investment Targets
Advanced Dynamic ‘Know Your Customer’
Optimized Personalized Investment Strategy
Bank-Customer Interaction Strategy
High
High End Customers(Private Bank /
Mass Affluent Special Investment Services)
Upper Middle
Targeted Customers (Consumer Bank
Services) : $15K - $1M
Middle (Customer #: 30M~50M in China)
78 E6893 Big Data Analytics – Lecture 1: Overview © 2017 CY Lin, Columbia University
What is Robo-Advisor?
Robo-Advisor is a new type of wealth
management service. Based on the risk level ▪ Non-biased
and investment goals provided by the
investor, and it uses a series of ‘smart
algorithm’ to calculate the optimal investment ▪ Low investment threshold
suggestions.
▪ Low starting entry money
Robo-advisors directly managed about $19 billion
as of December 2014. By 2020 the global ▪ Low agent fee
assets under management of robo-advisers is
forecast to grow to an estimated US$255B.
Features:
• Strongly depend on technology,
algorithm and financial theory
Harry Markowitz的现代资产组合理论
79 E6895 Advanced Big Data Analytics — Lecture 1 © CY Lin 2017, Columbia University
Advanced Topic 3: Knowledge Graphs
80 E6893 Big Data Analytics – Lecture 1: Overview © 2017 CY Lin, Columbia University
Advanced Topic 4: Advanced Visualization and Platforms
• Visual Exploration of Large Graph in Immersive Environment
• Computer Vision Enhanced Immersive Environment
• Mobile Vision on iOS devices
• Behavior Analysis on iOS devices
• Explainable ML: Visualization of Training Process of Deep Learning
• Explainable ML: Visual Analytics of Interactive Machine Learning
• Autonomous Learning: from Text to Vision
• Autonomous Learning: from Vision and Text to Knowledge
• Machine Reasoning with Large-Scale Bayesian Networks
• Strategic Planning with Game Theoretic Machines
• ML translation to an AI accelerator platform (TensorFlow)
• ML translation to an AI accelerator platform (Caffe)
• Software Tools on Neurosynaptic Chip
• Mapping Suitable Applications on Quantum Computing
81 E6893 Big Data Analytics – Lecture 1: Overview © 2017 CY Lin, Columbia University