EUC1502 Module5 Big-Data

Analysis of Big Data
and other sources

Outline
1. Introduction to big data

2. A survey on tools
3. Data storage in depth
4. Data processing
5. Practice:
a. Word count with Spark
b. Graph analysis with Neo4J
Outline

4. Data processing
5. Practice:
Introduction to Big Data
Introduction to Big Data
There are different working areas in big data:
● Data storage
● Data processing
● Data mining
● Data visualisation
● Business Intelligence Systems
Outline

4. Data processing
5. Practice:
A Survey on Tools
- Data storage
DOCUMENTS KEY/VALUE COLUMNS GRAPHS
MongoDB Riak Google Bigtable FlockDB

CouchDB Voldemort HBase OrientDB
Riak Redis Cassandra AllegroGraph
Memcached Sybase IQ Neo4J
Membase Hypertable
DynamoDB
A Survey on Tools
- Data processing
ADQUISITION STORAGE ANALYSIS
BATCH HDFS commands HDFS MapReduce

Scoop HBase Spark, SparkQL
Flume Hive
Pig
Cascading
STREAMING Flume Kafka Storm

Kestrel Trident
RabbitMQ Spark Streaming
AWS SQS Samza
HYBRID Lamda, Kappa, Summingbird, Lambdoop, Apache Flik

A Survey on Tools
- Data mining
SPSS Weka Rapid Miner Mahout

OPEN
Gate NLTK KMine OpenNN
Scikit-learn Carrot2 R Torch
RapidMiner IBM Watson SAS Entreprise Statistica Data

PROPIETARY
Miner Miner
Oracle Data Microsoft LIONSolver ClaraBridge

Miner Analysis Services
A Survey on Tools
- Data visualisation
Vis.js D3.js
CartoDB Plot.ly
Tableau QlikView
R HighCharts
A Survey on Tools
- Business Intelligence
Pentaho Actuate
SpagoBI JasperReports
Tableau QlikView
Palo Tactic
IBM Cognos MicroStrategy
Microsoft PowerBI Plot.ly

Outline

4. Data processing
5. Practice:
Data Storage in Depth
- SQL vs. NoSQL
SQL databases limitations:
● Fixed structure and integrity restrictions

● Ineficiency with large number of insertions,
modifications, deletions
● High complexity to model real-life relationships
NoSQL databases:
● NoSQL = Not only SQL

● Store large volumes of data in small units of time
- NoSQL types
There are basically four types of NoSQL databases, although some of

them share characteristics from more than one type:
● Document oriented: The basic unit is the document (e.g.

XML, json, …)
● Key/Value: Any object identified by a key and described by
a set of attributes (values). Also known as hash warehouses
● Column oriented: Data are stored around tables with
families of predefined columns, propitiating OLAP operations
● Graph databases: Not only store objects but also
relationships among them shaping graphs of information
- Document oriented
● The basic unit is the document

● A document can have an arbitrary number of fields
● Each field can be of different type and size
● Each field can store multiple values
● Examples of documents are XML, JSON, or similar
● Document databases do not need a fixed schema of document
● Each document can have different fields than other documents in
the database
● Security is assigned at document level
● Full-text search capabilities with high performance
- Document oriented
● JSON document example

● Unlike key/value model, id is
part of the document
● Full-text search is provided in
the whole document
- Document oriented
- Key/value warehouses
● Warehouses where store any kind of information of any type

● Objects are identified by a unique key
● Objects are defined by an arbitrary set of attributes
● There is neither structure nor restrictions
● They are also known as hash warehouses
- Key/value warehouses
- Column oriented
● Unlike SQL databases organised as rows, column-oriented

databases are organised around columns
● Tables are defined as families of columns
● It is easy to implement OLAP operations
○ Drill, roll, slice&dice, pivot
- Column oriented
- Graph databases
Relational databases lack relationships
Bob’s friends
What about big data?

Alice’s friends-of-friends
- Graph databases
NoSQL databases also lack relationships
Relationships can be emulated by aggregated fields, but:

- They should be maintained (update and delete)
programmatically.
- Aggregated links are not reflexive: there is no point
backward (e.g. to know who bought a product).
- Graph databases
A graph is a collection of vertices representing entities and

edges representing the relationships among them.
In a property graph both nodes and relationships can have

properties.
Graph data model means that data are modelled such a graph.
A (property) graph database is an online database management

system with Create, Read, Update and Delete methods that
expose a (property) graph data model.
- Graph databases
Property graph
Relationship with a property which

value is “Follows”
Node with a property which value

is “Harry”
- Graph databases
Cypher is an expressive graph database query language.
Cypher is designed to be easily read and understood by

developers, database professionals and business stakeholders.
The key of Cypher is that enables to find data that matches a

specific pattern, following our intuition to describe graphs using
diagrams.
- Graph databases
Nodes
Relation type Separation among

and direction subgraphs
- Graph databases
The simplest query:
- a START clause followed by a MATCH and a RETURN clauses

- Graph databases
- START: specifies the starting point(s) in the graph (e.g.

nodes or relationships)
- MATCH: describes the specification by example, using
characters to represent nodes and relationships, in order to
draw the data we are interested in.
- RETURN: defines the nodes, relationships and/or attributes
that should be returned.
- Graph databases
OTHER CYPHER CLAUSES
- WHERE: provides criteria for filtering.

- CREATE (UNIQUE): for the creation of nodes and relationships.
- DELETE: removes nodes, relationships and properties.
- SET: sets property values to nodes and relations.
- FOREACH: allows to perform an updating action for a list of
elements.
- UNION: merges results from different queries.
- WITH: allows to pipe results from one query to the next.
- Graph databases
Outline

4. Data processing
5. Practice:
Data Processing
- Types
BATCH STREAMING
VOLUME VELOCITY
HYBRID
● Batch processing for large volumes of information (e.g. ADN
sequentiation)
● Streaming processing for rapid generated data (e.g. Twitter)
● Hybrid processing for large volumes rapidly generated (e.g. in-depth
analysis of Twitter tweets)
Data Processing
- Processing steps
DATA DATA DATA

ADQUISITION STORAGE ANALYSIS
Data Processing
In-depth analysis of a Twitter stream
- Types
- Retrieve and store
- Evolution
https://www.youtube.com/watch?v=YrqMEn-5Pi8
- Words and topics

- Labelling
- Hashtags
- People
- Locations
- Brands
- Polarity, stance
- Users, relationships
- Gender, age
- Author profile
- ...
tweets/second tweets/minute tweets/hour tweets/day

Data Processing
- Batch processing
Map/Reduce paradigm:
● Map: The Map process divides the data into subsets and sends them to each
process node in key-value format <K, V>
● Reduce: Each node returns the result in key-list of values format <K, L (V)>
and they are combine to produce the final result
Example of counting words in a text:
● Map: A line of text is sent to each node, where the key K is the line number,
and the value V is the line of text <nline, text>. The result of the task is a list
of pairs <word, 1> for each word in the text.
● Reduce: It collects all the outputs of Map processes as pairs <key, value> or
<word, 1>, and it is responsible for grouping them in pairs <word,
occurrence> by adding the ones of each word
Data Processing
- Batch processing
Data Processing
- Batch processing
function Map (key, values) {

for each word w in values {
return (w, 1)
function Reduce (word, list_of_values)
}
{
}
for each value v in list_of_values {
total += v
}
return (word, total)
}
Data Processing
- Batch processing
ADQUISITION STORAGE PROCESSING

Data Processing
- Stream processing
© autoritas Cosmos-intelligence
Data Processing
- Stream processing
ADQUISITION STORAGE PROCESSING
KESTREL trident
Data Processing
- Hybrid processing
Data Processing
- Hybrid processing
SUMMINGBIRD
Outline

4. Data processing
5. Practice:
References
● Graph Databases. Ian Robinson, Jim Webber and Emil Eifrem. O’Reilly.
http://neo4j.com/books/graph-databases/
● Social Network Data Analytics. Charu C. Aggarwal. Springer.
http://www.springer.com/us/book/9781441984616
● Networks, Crowds and Markets: Reasoning about a Highly Connected World.

David Easly and Jon Kleinberg. Cambridge University Press.
https://www.cs.cornell.edu/home/kleinber/networks-book/
References
● Aggargal, C. C. (2011). Social network data analytics. Springer

● Banker, K. (2012). Mongodb in action. Manning Publications
● Chang, F., Dean, J., Ghemawat, S., Hsieh, W. C., Wallach, D.A., Burrows, M., Chandra, T., Fikes, A., Gruber, R. E.
(2008). Bigtable: a distributed storage system for structured data. ACM Transactions on Computer Systems
● Dixon, J. (2015). Pentaho, hadoop and data lakes. James Dixon’s Blog
● Harrington, P. (2012). Machine learning in action. Manning Publications
● Hashem, I. A. T., Yaqoob, I., Anuar, N. B., Mokhtar, S., Gani, A., & Khan, S. U. (2015). The rise of “big data” on
cloud computing: Review and open research issues. Information Systems
● Hewitt, E. (2011). Cassandra: the definitive guide. O’Reilly
● Jones, O. M., Robinson, A. (2009). Scientific programming and simulation using r. Taylor & Francis Group
● Lam, C. (2011). Hadoop in action. Manning Publications
● Leskovec, J., Rajaraman, A., Ullman, J. D. (2014). Mining of massive datasets. Stanford University Press
● Owen, S., Anil, R., Dunning, T., Friedman, E. (2013). Mahout in action. Manning Publications Co.
● Snijders, C.; Matzat, U.; Reips, U.D. (2012). Big data: big gaps of knowledge in the field of interent. International
Journal of Internet Science
● Stanton, J. (2012). An introduction to data science. Syracuse University
● Witten, I. H., Frank, E., Hall, M. A. (2011). Data mining. Practical machine learning tools and techniques. Morgan
Kaufmann Publishers

EUC1502 Module5 Big-Data

Uploaded by

Copyright:

Available Formats

You might also like

EUC1502 Module5 Big-Data

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

EUC1502 Module5 Big-Data

Uploaded by

Copyright:

Available Formats

Analysis of Big Data

and other sources

1. Introduction to big data

1. Introduction to big data

There are different working areas in big data:

1. Introduction to big data

DOCUMENTS KEY/VALUE COLUMNS GRAPHS

MongoDB Riak Google Bigtable FlockDB

ADQUISITION STORAGE ANALYSIS

BATCH HDFS commands HDFS MapReduce

STREAMING Flume Kafka Storm

HYBRID Lamda, Kappa, Summingbird, Lambdoop, Apache Flik

SPSS Weka Rapid Miner Mahout

Gate NLTK KMine OpenNN

Scikit-learn Carrot2 R Torch

RapidMiner IBM Watson SAS Entreprise Statistica Data

Oracle Data Microsoft LIONSolver ClaraBridge

IBM Cognos MicroStrategy

Microsoft PowerBI Plot.ly

1. Introduction to big data

SQL databases limitations:

● Fixed structure and integrity restrictions

● NoSQL = Not only SQL

There are basically four types of NoSQL databases, although some of

● Document oriented: The basic unit is the document (e.g.

● The basic unit is the document

● JSON document example

● Warehouses where store any kind of information of any type

● Unlike SQL databases organised as rows, column-oriented

Relational databases lack relationships

What about big data?

NoSQL databases also lack relationships

Relationships can be emulated by aggregated fields, but:

A graph is a collection of vertices representing entities and

In a property graph both nodes and relationships can have

A (property) graph database is an online database management

Relationship with a property which

Node with a property which value

Cypher is an expressive graph database query language.

Cypher is designed to be easily read and understood by

The key of Cypher is that enables to find data that matches a

Relation type Separation among

The simplest query:

- a START clause followed by a MATCH and a RETURN clauses

- START: specifies the starting point(s) in the graph (e.g.

OTHER CYPHER CLAUSES

- WHERE: provides criteria for filtering.

1. Introduction to big data

DATA DATA DATA

- Words and topics

tweets/second tweets/minute tweets/hour tweets/day

Example of counting words in a text:

function Map (key, values) {

ADQUISITION STORAGE PROCESSING

ADQUISITION STORAGE PROCESSING

1. Introduction to big data

● Social Network Data Analytics. Charu C. Aggarwal. Springer.