Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 28

By Dr.

Monica Apte
Ph.D,MCA,PGDCSA,BS.c(Maths)
Cambridge university certified,
Oracle WDP,IBM -DB2,Big data Certified
Dr. Monica Apte
Syllabus of big data
Unit Name Topic Name Topic Description

UNIT-I Introduction 1.1 Types of Digital Data


UNIT-I Introduction 1.2 Introduction to Big Data
UNIT-I Introduction 1.3 Big Data Analytics
UNIT-I Introduction 1.4 History of Hadoop
UNIT-I Introduction 1.5 Apache Hadoop
UNIT-I Introduction 1.6 Analyzing
UNIT-I Introduction 1.7 Data with Unix tools
UNIT-I Introduction 1.8 Analyzing Data with Hadoop
UNIT-I Introduction 9. Hadoop Streaming
UNIT-II HDFC 2.1 The Design of HDFS
UNIT-II HDFC 2.2 HDFS Concepts
UNIT-II HDFC 2.3 Command Line Interface
UNIT-II HDFC 2.4 VLAN Trucking Protocol
UNIT-II HDFC 2.5 Hadoop file system interfaces
UNIT-II HDFC 2.6 Scope resolution operator
UNIT-II HDFC 2.7 Data flow
UNIT-II HDFC 2.8 Data structures
UNIT-III Map Reduce 3.1 Anatomy of a Map Reduce
UNIT-III Map Reduce 3.2 Failures, Job Scheduling
UNIT-III Map Reduce 3.3 Shuffle and Sort
UNIT-III Map Reduce 3.4 Task Execution
UNIT-III Map Reduce 3.5 Map Reduce Types
UNIT-III Map Reduce 3.6 Map Reduce Features
UNIT-IV Hadoop Eco System 4.1 Introduction to PIG
UNIT-IV Hadoop Eco System 4.2 Execution Modes of Pig
UNIT-IV Hadoop Eco System 4.3 Comparison of Pig with Databases
UNIT-IV Hadoop Eco System 4.4 Grunt
UNIT-IV Hadoop Eco System 4.5 Pig Latin
UNIT-V HIVE 5.1 Hive Shell
UNIT-V HIVE 5.2 Hive Services
UNIT-V HIVE 5.3 Hive Metastore
UNIT-V HIVE 5.4 HiveQL
5.5 Tables
   

Dr. Monica Apte


• Assignment 1
• Assignment 2
• (Surprise tests I & II)
• Surprise Test 1
• Surprise Test 2
• Quiz
• Case study 
• End-term Examination – Question Paper- NA

Dr. Monica Apte


INTRODUCTION

What is big
Introduction data


How much Where
data it takes does big
to call a big data come
data
from

Where is
the big Who are
the users of
data trend big data
going

Dr. Monica Apte


• Big data is a term that describes the large volume of data –
• The data can be
1. Structured
2. Semi-structured
3. Unstructured

Dr. Monica Apte


Structured data -
Table: Employee_details

Emp_id Emp_name Emp_age

101 M.Raghu 34

102 Rakesh Pund 31

103 Meera Bagchi 45

Dr. Monica Apte


Table: Employee_address

Emp_id Emp_houseno Emp_city Emp_pincode

101 Shastri colony Pune 411096

102 Flat 304, Pethe road Nashik 411075

103 House no .123, near Mumbai 445687


oyster bunglow

Dr. Monica Apte


Structured data-
Table - Student

Roll No (Primary key) Name Division Marks


1 Suhani A 78
2 Neha A 69
3 Shobhit B 64
4 Khushi C 71
5 Manasvi D 63
6 Samuel A 81

Dr. Monica Apte


Structured data eg. - RDBMS

Dr. Monica Apte


Semistructured data –
Text data files ready for analysis stored in excel and XML files

Dr. Monica Apte


Un-structured data

Dr. Monica Apte


Dr. Monica Apte
• Earlier decade computer data was stored in-
1. Floppy disc
2. CD’s
3. Hard discs
• RAM capacity was 512 MB
• But now data storage capacity is very high
• As we have very high amount of data

Dr. Monica Apte


• The capacity of hard disc is in
1. Peta byte
2. Zeta byte

• Also we have a cloud storage

Dr. Monica Apte


How we measure a big data
• Cronological order of data
1. Bits – 0 and 1
2. Bytes – 8 Bits
3. Kilo Bytes (Kb) – 1024 byte
4. Mega bytes (Mb) – 1024 KB
5. Giga Bytes -1024 MB
6. Tetra Bytes - 1024 GB
7. Penta bytes (Pb) -1024 TB
8. Exabyte(Eb) -1024 PB
9. Zettabyte (Zb) - 1024 EB
10. Yotabyte (Yb) -1024 ZB

Dr. Monica Apte


• General term – 1 grain of dal is 1 bit
• 8 grains of dal = 1 byte
• 1 cup of dal = 1Kilo byte
• Megabyte – 8 bags of dal
• Giga byte – 3 trucks of data
• Tetrabyte = 2 container of ship
• Pentabyte = one city
• Exabyte – Whole Asia
• Zetabyte = fills all ocean
• Yota byte – A earth size
• And is data increasing every day

Dr. Monica Apte


Big Data
• Analyst predicts that there will be more than 5200 gigabytes of data
for every person in the world
• On average , people send 5000 lakh tweets per day
• Walmart processes one million customer transaction per hour
• Amazon sells 600 items /second
• On an average each person who uses email receives 88 emails per day
and send 34
• This add up to more than 20,000 crores emails each day
• Master card process 7400 crores of transaction per year
Dr. Monica Apte
• Commercial airline make about 5,800 flights per day

Dr. Monica Apte


Dr. Monica Apte
• Volume –
• Organisation collect data from a variety of sources
• Including business transaction‘s , social media and information from
sensor or machine to machine data.
• To store this voluminous data we have a new technology called Hadoop
Dr. Monica Apte
• The prominent feature of any dataset is its size.
• Volume refers to the size of data generated and stored in a Big Data
system.
• The size of data in the petabytes and exabytes range.
• These massive amounts of data necessitate the use of advanced
processing technology—
• far more powerful than a typical laptop or desktop CPU.
• As an example of a massive volume dataset
• think about
• Linked in , Instagram or Twitter.

Dr. Monica Apte


• People spend a lot of time posting pictures, commenting, liking posts,
playing games, etc.
• With these ever-exploding data
• There is a huge potential for analysis
• Finding patterns

Dr. Monica Apte


Variety
• Variety entails the types of data that vary in format and how it is
organized and ready for processing.
• Big names such as Facebook, Twitter, Pinterest, Google Ads, CRM
systems produce data that can be
1. Collected
2. stored
3. And subsequently analyzed.

Dr. Monica Apte


Velocity
• The rate at which data accumulates also influences whether the data is
classified as big data or regular data.
• Much of this data must be evaluated in real-time; therefore, systems
must be able to handle the pace and amount of data created.
• The processing speed of data means that there will be more and more
data available than the previous data,
• But it also implies that the velocity of data processing needs to be just
as high.

Dr. Monica Apte


Value
• Value is another major issue that is worth considering.
• It is not only the amount of data that we keep or process that is
important.
• It is also data that is
• Valuable
• Reliable
• Data that must be saved, processed, and evaluated to get insights.

Dr. Monica Apte


Veracity
• Veracity refers to the trustworthiness and quality of the data.
• If the data is not trustworthy and/or reliable, then the value of Big
Data remains unquestionable.
• This is especially true when working with data that is updated in real-
time.
• Therefore, data authenticity requires checks and balances at every
level of the Big Data collecting and processing process.

Dr. Monica Apte


• The world around us is continuously changing; we now live in a data-
driven era.
• From social media posts to the pictures we upload, big data
applications are everywhere.
• Since Big Data is being created on a massive scale, it could become an
important asset for many companies and organizations, helping them
to come up with new insights and enhance their businesses

Dr. Monica Apte


Dr. Monica Apte

You might also like