Big Data and Storage Systems

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 32

Module 1

Introduction to Big Data


& Distributed Storage
11
Mandatory Reading before Next Class


1. Scalable problems and memory-bounded speedup, Sun
and Ni, JPDC, 1993
2. The Google File System, Sanjay Ghemawat Howard
Gobioff Shun-Tak Leung, ACM SOSP, 2003

12
Big Data Concepts
What, Where, Why?

13
What is Big Data?

14
Image credits: http://www.seekbig.in/1128-tnpsc-economics-questions/
The term is fuzzy … Handle with care!

Wordle of “Thought Leaders’” definition of Big Data, © Jennifer Dutcher, 2014 15


https://datascience.berkeley.edu/what-is-big-data/
Data Generation View


“Big data refers to the approach to data of
‘collect now, sort out later’…The low cost of
storage and better methods of analysis mean
that you generally don’t need to have a specific
purpose for the data in mind before you collect
it.”

Rohan Deuskar, CEO and Co-Founder, Stylitics

Wordle of “Thought Leaders’” definition of Big Data, © Jennifer Dutcher, 2014


https://datascience.berkeley.edu/what-is-big-data/
16
Image Credits: https://community.uservoice.com/wp-content/uploads/benefits-of-effective-questions-800x448-300x168.jpg
Data Systems View


“Big data is when your business wants to use data to
solve a problem, answer a question, produce a
product, etc., but the standard, simple methods
break down on the size of the data set, causing time,
effort, creativity, and money to be spent crafting a
solution to the problem that leverages the data
without simply sampling or tossing out records.”
John Foreman, Chief Data Scientist, MailChimp

Wordle of “Thought Leaders’” definition of Big Data, © Jennifer Dutcher, 2014


https://datascience.berkeley.edu/what-is-big-data/
17
Image Credits: http://heae.tk/question/
Data Analysis View


“While the use of the term is quite nebulous …
I’ve understood “big data” to be about analysis
for data that’s really messy or where you don’t
know the right questions or queries to make —
analysis that can help you find patterns,
anomalies, or new structures amidst otherwise
chaotic or complex data points.”

Philip Ashlock, Chief Architect, Data.gov


Wordle of “Thought Leaders’” definition of Big Data, © Jennifer Dutcher, 2014
https://datascience.berkeley.edu/what-is-big-data/
18
Image Credits:http://www.clipartpanda.com/categories/question-and-answer-clipart
So…What is Big Data?

Data whose characteristics


exceeds the capabilities of
conventional algorithms, systems
and techniques to derive useful
value. https://www.oreilly.com/ideas/what-is-big-
data

19
Image Credits: https://community.uservoice.com/wp-content/uploads/benefits-of-effective-questions-800x448-300x168.jpg

So, where does Big Data
come from?

20
Desktop & Mobile Web Users

https://gs.statcounter.com/platform-market-share/desktop-
mobile-tablet/worldwide/#monthly-200901-202111
21
Facebook Active Users

Almost 3 billion users each month


https://www.statista.com/statistics/264810/number-of-monthly- 22
active-facebook-users-worldwide/
Instagram Users across the World (Oct 2021)

July 2020

https://www.statista.com/statistics/578364/countries-with-most- 23
instagram-users/
Internet Activity during COVID

The Global Internet Phenomena Report: COVID-


24
19 Spotlight. May 2020, Sandvine
Big Data and Science

25
https://www.sciencemag.org/news/2017/07/ai-changing-how-we-do-science-get-glimpse
3000 1600
Monthly UPI Transactions 1400
2250 1200

FinTech: UPI Transactions


1000
1500 800
600
750 Demonitization 400
200
0 0
Dec-14 May-16 Sep-17 Feb-19 Jun-20
Monthly NCPI Retail Payment Transactions Amount (Rs. in Billion) Volume (in Millions)

9000 25000

8000

7000 20000

Value (Billions of Rs.)


Volume (Millions)

6000
15000
5000

4000
10000
3000

2000 COVID 5000


1000

0 0

Volume (Millions) Value (Billions of Rs.) 26


https://www.npci.org.in/statistics
Government: GSTN

https://www.gstn.org/
27
IoT and Big Data

Number of IoT devices at 7 billion in 2018


(excludes smartphones, tablets, laptops or fixed line phones) 28
https://iot-analytics.com/
29
http://www.ibmbigdatahub.com/infographic/four-vs-big-data
Big Data Landscape
Resilience and Vibrancy: The 2020 Data & AI Landscape, Matt Turck,
https://mattturck.com/data2020/
The 2021 Machine Learning, AI and Data (MAD) Landscape,
https://mattturck.com/data2021/

30
Data Science
Methods
Inter-disciplinary (AI, ML)

domain at the
intersection of data
analysis methods, Data
Big Data Systems Science Big Data
and data-driven Applica-
tions
(Systems,
Platforms
applications. )

31
Data Analysis Lifecycle

• Acquire Data
Acquire • Sensors, Web logs & crawls, Transactions

• Define Analysis & Analytics


Goal • Trends, Clusters, Outliers, Classification

• Translate to Scalable Applications


• Develop algorithms, Map to abstractions,
Process Implement on Platforms

32
OPEN SOURCE PLATFORMS
Top Paying Technologies ☺
Topics you are learning in this module are the top-3
paying Databases and Frameworks globally!

https://survey.stackoverflow.co/2022/#top-paying-technologies-other-frameworks-and-libraries
Big Data Platform Stack, Spark Flavor

Different Abstractions

Process Data

Manage System

Store Data

38
https://www.oreilly.com/library/view/data-analytics-with/9781491913734/ch04.html
Additional Reading



A Survey of Big Data Research, H Fang, et al., IEEE Network,
September/October 2015,
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4617656/
▷ Beyond the hype: Big data concepts, methods, and analytics, A. Gandomi
and M. Haider, International Journal of Information Management, Volume 35,
Issue 2, 2015, https://doi.org/10.1016/j.ijinfomgt.2014.10.007
▷ Uncertainty in big data analytics: survey, opportunities, and
challenges. R.H. Hariri, et al. J Big Data 6, 44 (2019).
https://doi.org/10.1186/s40537-019-0206-3

39
Distributed Systems
Helping Data Science Scale

40
Vertical and Horizontal Scaling

▷ How do you get getter performance for an application?


○ Better algorithm
○ Better programming
○ Better hardware
▷ Vertical Scaling (Scale up)
○ Adding more hardware resources to an existing machine
■ Faster CPU, more cores, more memory
○ Simpler. Little to no (multi-threaded or loosely coupled)
application constraint
○ Upper bound of hardware limits
▷ Horizontal Scaling (Scale Out)
○ Adding more number of machines
○ Cumulatively more number of CPU cores, memory
○ Applications need to be designed to use multiple machines
41
Moore’s Law
Dennard Scaling
Scale Up (Faster CPUs/Clockspeed)

Jeffrey Funk, NUS


42
Scale Up (More CPU cores)
▷ Multi/Many-core processing
○ Multiple cores on single machine AMD EPYC Zen

■ Shared local memory space ARM big.LITTLE

■ Cores may be heterogeneous


○ Independent execution on each core
■ A single thread/process does not run faster
○ Multiple processes on a desktop/laptop Nvidia
CUDA
○ Multi-threaded applications on a server cores

▷ Has a flavor of scale out


○ Applications may need to be redesigned to exploit parallelism
○ May be able to increase to 100s of cores

https://www.nextplatform.com/2017/06/20/competition-returns-x86-servers-epyc-fashion/
https://www.macrumors.com/guide/m1/
https://community.arm.com/developer/ip-products/processors/b/processors-ip-blog/posts/arm-dynamiq-technology-for-the-next-era-of-compute
Apple M1
43

You might also like