Professional Documents
Culture Documents
MITRES LL 005F12 Lec3
MITRES LL 005F12 Lec3
MITRES LL 005F12 Lec3
Jeremy Kepner
This work is sponsored by the Department of the Air Force under Air Force Contract
#FA8721-05-C-0002. Opinions, interpretations, recommendations and conclusions are
those of the authors and are not necessarily endorsed by the United States Government.
D4M-1
Outline
• Introduction
– Webolution
– As is, is OK
– D4M
• Technologies
• Results
• Demo
• Summary
D4M-2
Primordial Web
Kepner & Beaudry 1992, Visual Intelligence Corp (now GE Intelligent Platforms)
Language:
• Browser GUI? HTTP for files? Perl for analysis? SQL for data?
• A lot of work just to view data.
• Won’t catch on.
D4M-3
Cambrian Web
Browser (html): http put Server (http): SQL Database (sql):
Language:
• Browser GUI? HTTP for files? Perl for analysis? SQL for data?
• A lot of work to view a little data.
• Won’t catch on.
D4M-4
Modern Web
Game (data): http put Server (http): java Database (triples):
Language:
• Game GUI! HTTP for files? Perl for analysis? Triples for data!
• A lot of work to view a lot of data.
• Great view. Massive data.
D4M-5
Modern Web
Game (data): http put Server (http): java Database (triples):
Language:
• Game GUI! HTTP for files? Perl for analysis? Triples for data!
• A lot of work to view a lot of data. Missing middle.
• Great view. Massive data.
D4M-6
Future Web?
Game (data): graphs Server (files): graphs Database (triples):
graphs graphs
portal portal
Language:
• Game GUI! Fileserver for files! D4M for analysis! Triples for data!
• A little work to view a lot of data. Securely.
• Great view. Massive data.
D4M-7
D4M: “Databases for Matlab”
Query:
Alice
Bob E
Cathy D
David
Earl
A D4M query returns a sparse matrix
or graph from Cloudbase…
D4M-8
Outline
• Introduction
• Technologies
– Hardware
– Cloud software
– Associative Arrays
• Results
• Demo
• Summary
D4M-9
What is LL Grid?
Service Nodes Compute Nodes Cluster Switch
Network
Users Storage
Resource Manager
FAQs
Configuration
Server
Web Site
LLAN
Courtesy of The MathWorks, Inc. Used with permission. MATLAB and the L-shaped membrane logo are registered trademarks of
The MathWorks, Inc. Other product or brand names may be trademarks or registered trademarks of their respective holders.
D4M-10
Why is LLGrid easier to use?
Universal Parallel Matlab programming
Amap = map([Np 1],{},0:Np-1); Jeremy Kepner
D4M-12
The Big Four Cloud Ecosystems
Enterprise Supercomputing
IaaS PaaS
- Interactive - High performance
- On-demand - Parallel Languages
- Elastic - Scientific computing
PaaS SaaS
- Java - Indexing
- Map/Reduce - Search
- Easy admin - Security
Big Data DBMS
IaaS PaaS
- Interactive - High performance
- On-demand - Parallel Languages
- Elastic - Scientific computing
PaaS SaaS
- Java - Indexing
- Map/Reduce - Search
- Easy admin - Security
Big Data DBMS
IaaS PaaS
- Interactive - High performance
- On-demand - Parallel Languages
- Elastic - Scientific computing
MapReduce
PaaS SaaS
- Java - Indexing
- Map/Reduce - Search
- Easy admin - Security
Big Data DBMS
B
A
High Level Composable API: Array
C
Algebra
D4M (“Databases for Matlab”)
E
Distributed
Distributed Database: Database/
Distributed File
Accumulo/HBase (triple store) System
• Combining Big Compute and Big Data enables entirely new domains
D4M-16
Hadoop Architecture Overview
Hadoop cluster
D4M-17
Hadoop: Strengths and Weaknesses
• What works well
– Distributed processing of large data
Indexing log files
Sorting data
– Multi-user environments
Not easy to provide fair-share control on their use of Hadoop cluster
– Non-Java programmers
Takes time to learn the parallel programming API for Java
D4M-18
LLGrid_MapReduce Architecture
2
Scheduler
D4M-19
Outline
• Introduction
• Technologies
– Hardware
– Cloud software
– D4M
• Results
• Demo
• Summary
D4M-20
Multi-Dimensional Associative Arrays
• Extends associative arrays to 2D and mixed data types
A('alice ','bob ') = 'cited '
or A('alice ','bob ') = 47.0
cited
alice alice bob
D4M-21
Composable Associative Arrays
• Key innovation: mathematical closure
– all associative array operations return associative arrays
• Introduction
• Technologies
• Results
– Benchmark performance
– Facet search
– Management and monitoring
• Demo
• Summary
D4M-24
Stats Diagram
D4M-25
UNCLASSIFIED // FOR OFFICIAL USE ONLY
Stats Diagram
A = double(logical(T(r,:)))
A = A(:,'src_ip/ *,domain/ *,dest_ip/ *,')
D4M-27
Count
www.llbean.com
www.cellstores.com
nepatriots.mpl.miisolutions.net
i.cdn.turner.com
app.pohmtk.com
ad.doubleclick.net
ac.stapleslink.com
• Very easy to get elementary count info necessary for finding clutter and anomalies
D4M-28
UNCLASSIFIED // FOR OFFICIAL USE ONLY
Covariance
dest_ip domain src_ip
dest_ip
domain
src_ip
D4M-29
D4M-30
Facet Search Algorithm
Alice
Carl
Bob
IMF
UN
DC
NY
A(x,y) : SNxM R
a.txt
b.doc • Facets y1=UN, y2=Carl
c.pdf
d.htm • Documents that contain both
e.ppt
f.txt A(:,y1) & A(:,y2)
g.doc
1 2 12 • Entity counts in the above set of
documents obtained via matrix multiply
D4M-31
Outline
• Introduction
• Technologies
• Results
• Demo
• Summary
D4M-32
Reuters Corpus V1 (NIST)
1996-08-20 to 1997-08-19 (Released 2000-11-03
• 810,000 Reuters news blurbs
• Picked 70,000 and found 13,000 entities
• A is a 70Kx13K associative array with
500K entries
D4M-35
Example Code & Assignment
• Assignment
– None
D4M-36
MIT OpenCourseWare
https://ocw.mit.edu
For information about citing these materials or our Terms of Use, visit: https://ocw.mit.edu/terms.