Download as pdf or txt
Download as pdf or txt
You are on page 1of 21

Big Data Intro

AcademyOfData.com

What is Big Data?


BigData Borat : everything that doesn’t fit in Excel (2013)??
AcademyOfData.com

My favourite definition…
Hbase in Action book, Nick Dimiduk & A. Khurama

“Big Data is fundamentally different way of thinking about the data and how it’s used to
drive business value. In the past we had OLTP - transactions recording and OLAP systems.
Traditionally not much was done to understand the reasons behind the transactions or
what factors contributed to these taking place the way they did. Or come up with insights”

To come up with these insights though you need loads of data!!!


AcademyOfData.com

History of Big Data and ….

1999 2001 2008 2013


Big Data term mentioned: Doug Lainey, Gartner, World’s servers process 9.57 Amount of stored information in
storing huge amounts of defines the 3 V’s (volume, zettabytes (9.57 trillion the world is estimated: 1200
data with no idea how to velocity, variety) that will gigabytes) of information: exabytes, less than 2% being
analyze it; become the basic definition 12GB/person/day; non-digital
IoT term: growing no of of Big Data. 14.7 exabytes of new information
devices online are produced this year.
2003
Michael Lewis, Moneyball
The art of winning an unfair
game

2004 2008-2009
Oren Etzioni - Farecast Google Flu Trends
2007 2011
2000
1997 Wired brings Big Data to Eric Schmidt: every 2 days the
Peter Lyman & Hal Varian:
Michael Lesk, estimates masses through: The Data amount of data created = the
1.5 billion GB of storage,
12k PB of data in the world Deluge Makes the amount of data created until 2003
equivalent to 250
and increasing 10X each Scientific Model Obsolete. 12 million RFID tags sold
MB/person.
year 13 billion connected devices
Google search is launched
AcademyOfData.com

Big Data :
high (and affordable) storage and computation needed but also a mindset shift

2003 Moneyball, 2004 Farecast , 2008 Google flu trend

“Data became a raw material of business, a vital economic input, used to create a new form
of economic value. Data was not longer regarded as static or stale, whose usefulness was
finished once the purpose for it was achieved. “

Big Data: a revolution that will transform how we live, work and think, Kenneth Cukier &
Vicktor Mayer - Schonberger

“Big Data refers to things one can do at a large scale that cannot be done at smaller one, to
extract new insights or create new forms of value, in ways that change the markets,
organizations, the relationships between citizens and governments and more”
AcademyOfData.com

The V’s
Velocity:
1 min = 100h Youtube, 200 million
emails sent, 20 million photos
viewed & 30k uploaded on Flickr,
300k tweets sent, 2.5 million
queries on Google

Volume:
By 2020, we will have 50 times
the amount of data as that we had
in 2011.
Airplanes: 2.5 billion TB each year.
Self driving cars: 2 PB / year

Variety: 90% of the data that is


generated by organisation is
unstructured data.
AcademyOfData.com

Big Data examples


Predicting trends (apply math to huge quantities of data in order to infer probabilities):

- Trend forecasting: social media posts and web browsing habits, fraud analysis
- Sentiment analysis

Personalization: web site personalization & recommendations for shoppers

Optimizing waste of products

Tesco, Walmart, Macy’s…. Only the big ones?

Honest Coffee (Revive Vending) - unmanned coffee shops in UK

TFL is personalizing subway advertising based on mobile traveler profile.


AcademyOfData.com

Big Data examples


- What do stores sell when Hurricane Sandy approaches? Flashlights and emergency equipment? Also
Strawberry Pop Tarts.
- Sales of a product declining? No time to stay and analyze 1-2 months worth of data and customers
behavior. Realtime analytics showed the product had a pricing error.
- Novelty cookies - monitoring sales. One store, no sales => The cookies were not even on display.
- Forecasting which trees are going to collapse in NY.
AcademyOfData.com

Data sources
New York Stock Exchange generates about 4-5 terabytes of data/day.

Facebook hosts more than 240 billion photos, growing at 7 petabytes per month.

Ancestry.com, the genealogy site, stores around 10 petabytes of data.

The Internet Archive stores around 18.5 petabytes of data.

The large Hadron Collider near Geneva produces around 30 petabytes of data/year.

Digital streams produced by individuals: e.g. Microsoft’s Research MyLifeInBits project - experiment where
an individual’s interactions (email, photos,calls..) were captured electronically and stored for later access.
The data gathered was 1GB/month.

Machine generated data: logs, RFID readers, sensors, vehicle GPS traces, retail transactions.
AcademyOfData.com

Data formats - Unstructured/Semi-structured/Structured


Structured data (SQL format) : 5-10% of the data

Semi structured data (CSV, XML, JSON, noSQL) : 5-10% of the data

- does have some organizational properties that make it easier to analyze.

Unstructured data: ~80% of the data (e-mail messages, word processing documents, videos, photos, audio
files, presentations, webpages, logs, ..)

- may have an internal structure, they are still considered « unstructured » because the data they contain
doesn’t fit neatly in a database
- “Unstructured” actually means most of the times flexibly structured data. Data with flexible structure
can have fields added, changed or removed over time and even vary amongst concurrently ingested
records.
AcademyOfData.com

File Formats - it’s not just CSV any more


Self describing formats

- schema or structure is embedded in the data itself.


- metadata such as element names, data types, compression/encoding scheme used (if any), statistics,
and a lot more.
- Databases: Cassandra, Hbase
- Data formats: Parquet, Avro, JSON, XML
Schema Evolution

- What is: adding/renaming/removing columns


- In real life, data is always in flux
- New data added to an event stream
- Business has changed the field/column name
- Data Formats that can evolve: Avro & Parquet (only adding columns at the end though)
AcademyOfData.com

File Formats - Row/Columnar stores


Columnar based Storage format (e.g. Parquet)

- Eliminate I/O for columns that are not part of query. So works well for queries which require only
subset of columns.
- Buffers rows in memory so even with MapReduce needs memory for writers to buffer each block.
- Provides better compression as similar data is grouped together in columnar format.
- File sizes smaller than row-oriented ( e.g. a timestamp would record start and delta’s)

Row based storage (e.g. CSV) format loading specific columns could be done in 2 ways and both are not
optimal:

- Loading complete rows into memory from the disk and then applying functions by filtering
certain columns in the loaded data.
- Specifically seeking to the blocks where the desired column data is available for each row.
Since disk seeks are very expensive in terms of performance, this is something which could be
really undesirable.
AcademyOfData.com

Data Analytics
Batch: running the query in a scheduled way. You already know what your questions are, these are
standard questions you want to be answered regularly. Batch is a constraint created by the type of data,
mostly huge volume of data sitting still. Extremely useful for exploratory analytics.

Interactive: means that you don’t know your questions in advance (Interactive is related to the speed of
asking questions). You ask ad-hoc questions. Then, based on the answer, you ask the next question. Here,
speed of execution it’s a must.

Streaming: Streaming means processing data as it comes in (is related to speed of data movement) . Data
is continuously flowing and you want to get insights as they are happening.

Still Data but Sparse (a.k.a. NoSQL): This is needed where data is large, but is in very small datasets
increments and spread out. Also the data is constantly updated. NoSQL is the one which fits the need here.
Big Data questions
1. Different formats of data - how we unify them?
2. Where to store the data and should we store all of it?
3. How to store it?
4. How to search info in all the data?
5. How to make sense of it?
6. How to take decisions in real time based on data?

How much money we have and how critical is this data? Hadoop

Real time analytics => Spark

Consistency at all times? => SQL => too expensive and not so scalable? => noSQL
Lambda Architecture
Big Data and the change in the data models
- Data is unstructured
- Data is not correlated
- No schema for the entire data
- Huge amount of data, unpredictable in growth sometimes

Data driven models not ok any more => Query driven model =>

Built a model to answer questions and decide on the needed data


NoSQL - not another SQL / not only SQL / no SQL / ?
It all started with a hashtag #nosql and a meetup to discuss new databases (2009);

NoSQL = more than one storage mechanism that could be used based on the needs of the application:
key/value stores, document databases, graph databases;

● Not using the relational model


● Running well on clusters, distributed database system => simpler horizontal scaling
● Mostly open-source
● Commodity hardware
● Schema-less
NoSQL databases categories
Key-value databases: pairs of keys and values (blobs), on disk or memory storage

- Memcache, Riak, Redis, Amazon DynamoDB, Couchbase


- Key-Values stores would work well for shopping cart contents, or individual values like a landing
page URI, or a default account number.
- Great performance and can be easily scaled

Document databases: an advanced form of a key value store where the value part store is not a blob but a
well known document format. Format of the document are usually: XML, JSON, BSON.

- Documents are self-describing, hierarchical tree data structures which can consist of maps,
collections, and scalar values.
- RavenDB, Apache CouchDB, MarkLogic and MongoDB (rich query language)
NoSQL databases categories
Column Family databases: Column family based database are an evolution of the key value store where
the value part contains a collection of columns. Each row in a column family has a key and associates an
arbitrary number columns with it. This is useful for accessing related data together.

- Apache Cassandra, HBASE and Hypertable


- Works on data warehouses and business intelligence, customer relationship management (CRM),
Library card catalogs, Time Series etc.

Graph databases: A graph database is one which uses a graph structure to store data. Graph database
enable you to store entities and establish relationships between these entities. Designed for data whose
relations are well represented as a graph and has elements which are interconnected, with an
undetermined number of relations between them.
CAP Theorem
(Brewer’s theorem for distributed systems)

You might also like