Professional Documents
Culture Documents
1.big Data Intro
1.big Data Intro
AcademyOfData.com
My favourite definition…
Hbase in Action book, Nick Dimiduk & A. Khurama
“Big Data is fundamentally different way of thinking about the data and how it’s used to
drive business value. In the past we had OLTP - transactions recording and OLAP systems.
Traditionally not much was done to understand the reasons behind the transactions or
what factors contributed to these taking place the way they did. Or come up with insights”
2004 2008-2009
Oren Etzioni - Farecast Google Flu Trends
2007 2011
2000
1997 Wired brings Big Data to Eric Schmidt: every 2 days the
Peter Lyman & Hal Varian:
Michael Lesk, estimates masses through: The Data amount of data created = the
1.5 billion GB of storage,
12k PB of data in the world Deluge Makes the amount of data created until 2003
equivalent to 250
and increasing 10X each Scientific Model Obsolete. 12 million RFID tags sold
MB/person.
year 13 billion connected devices
Google search is launched
AcademyOfData.com
Big Data :
high (and affordable) storage and computation needed but also a mindset shift
“Data became a raw material of business, a vital economic input, used to create a new form
of economic value. Data was not longer regarded as static or stale, whose usefulness was
finished once the purpose for it was achieved. “
Big Data: a revolution that will transform how we live, work and think, Kenneth Cukier &
Vicktor Mayer - Schonberger
“Big Data refers to things one can do at a large scale that cannot be done at smaller one, to
extract new insights or create new forms of value, in ways that change the markets,
organizations, the relationships between citizens and governments and more”
AcademyOfData.com
The V’s
Velocity:
1 min = 100h Youtube, 200 million
emails sent, 20 million photos
viewed & 30k uploaded on Flickr,
300k tweets sent, 2.5 million
queries on Google
Volume:
By 2020, we will have 50 times
the amount of data as that we had
in 2011.
Airplanes: 2.5 billion TB each year.
Self driving cars: 2 PB / year
- Trend forecasting: social media posts and web browsing habits, fraud analysis
- Sentiment analysis
Data sources
New York Stock Exchange generates about 4-5 terabytes of data/day.
Facebook hosts more than 240 billion photos, growing at 7 petabytes per month.
The large Hadron Collider near Geneva produces around 30 petabytes of data/year.
Digital streams produced by individuals: e.g. Microsoft’s Research MyLifeInBits project - experiment where
an individual’s interactions (email, photos,calls..) were captured electronically and stored for later access.
The data gathered was 1GB/month.
Machine generated data: logs, RFID readers, sensors, vehicle GPS traces, retail transactions.
AcademyOfData.com
Semi structured data (CSV, XML, JSON, noSQL) : 5-10% of the data
Unstructured data: ~80% of the data (e-mail messages, word processing documents, videos, photos, audio
files, presentations, webpages, logs, ..)
- may have an internal structure, they are still considered « unstructured » because the data they contain
doesn’t fit neatly in a database
- “Unstructured” actually means most of the times flexibly structured data. Data with flexible structure
can have fields added, changed or removed over time and even vary amongst concurrently ingested
records.
AcademyOfData.com
- Eliminate I/O for columns that are not part of query. So works well for queries which require only
subset of columns.
- Buffers rows in memory so even with MapReduce needs memory for writers to buffer each block.
- Provides better compression as similar data is grouped together in columnar format.
- File sizes smaller than row-oriented ( e.g. a timestamp would record start and delta’s)
Row based storage (e.g. CSV) format loading specific columns could be done in 2 ways and both are not
optimal:
- Loading complete rows into memory from the disk and then applying functions by filtering
certain columns in the loaded data.
- Specifically seeking to the blocks where the desired column data is available for each row.
Since disk seeks are very expensive in terms of performance, this is something which could be
really undesirable.
AcademyOfData.com
Data Analytics
Batch: running the query in a scheduled way. You already know what your questions are, these are
standard questions you want to be answered regularly. Batch is a constraint created by the type of data,
mostly huge volume of data sitting still. Extremely useful for exploratory analytics.
Interactive: means that you don’t know your questions in advance (Interactive is related to the speed of
asking questions). You ask ad-hoc questions. Then, based on the answer, you ask the next question. Here,
speed of execution it’s a must.
Streaming: Streaming means processing data as it comes in (is related to speed of data movement) . Data
is continuously flowing and you want to get insights as they are happening.
Still Data but Sparse (a.k.a. NoSQL): This is needed where data is large, but is in very small datasets
increments and spread out. Also the data is constantly updated. NoSQL is the one which fits the need here.
Big Data questions
1. Different formats of data - how we unify them?
2. Where to store the data and should we store all of it?
3. How to store it?
4. How to search info in all the data?
5. How to make sense of it?
6. How to take decisions in real time based on data?
How much money we have and how critical is this data? Hadoop
Consistency at all times? => SQL => too expensive and not so scalable? => noSQL
Lambda Architecture
Big Data and the change in the data models
- Data is unstructured
- Data is not correlated
- No schema for the entire data
- Huge amount of data, unpredictable in growth sometimes
Data driven models not ok any more => Query driven model =>
NoSQL = more than one storage mechanism that could be used based on the needs of the application:
key/value stores, document databases, graph databases;
Document databases: an advanced form of a key value store where the value part store is not a blob but a
well known document format. Format of the document are usually: XML, JSON, BSON.
- Documents are self-describing, hierarchical tree data structures which can consist of maps,
collections, and scalar values.
- RavenDB, Apache CouchDB, MarkLogic and MongoDB (rich query language)
NoSQL databases categories
Column Family databases: Column family based database are an evolution of the key value store where
the value part contains a collection of columns. Each row in a column family has a key and associates an
arbitrary number columns with it. This is useful for accessing related data together.
Graph databases: A graph database is one which uses a graph structure to store data. Graph database
enable you to store entities and establish relationships between these entities. Designed for data whose
relations are well represented as a graph and has elements which are interconnected, with an
undetermined number of relations between them.
CAP Theorem
(Brewer’s theorem for distributed systems)