Introduction To Big Data and Hadoop

You might also like

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 10

Big data is the term for a collection of data sets so large and complex that it becomes difficult to process

using on-hand database management tools or traditional data processing applications. The challenges include capture, storage, search, sharing, transfer, analysis and visualization.

Current social and economic changes like sharing data spontaneously, instantaneously and constantly using social networking which enables us to connect across boundaries creates Big data More application are used to extract values/data to achieve personal and profession goals by individual and organization creates Big data

Big data is any attribute that challenge constraints of a system capability or business need
Best example is 10MB presentation which cannot be shared across our team via email is a Big data for us.

Google processes 20 PB a day (2008) Facebook has 2.5 PB of user data + 15 TB/day (4/2009) eBay has 6.5 PB of user data + 50 TB/day (5/2009) CERNs Large Hydron Collider (LHC) generates 15 PB a year By 2050 Data generated will 50 times of current data

Big data can be divided into three types,


Structured data
Transaction details, System logs, Etc.,

Non-Structured data
Social Networking Data, Weather data etc.,

Semi-structured data
XML Files

Apache Hadoop is an open-source software framework for storage and large scale processing of data-sets on clusters of commodity hardware. Hadoop was created by Doug Cutting and Mike Cafarella in 2005. Cutting, who was working at Yahoo at the time, named it after his son's toy elephant Scalable: It can reliably store and process petabytes. Economical: It distributes the data and processing across clusters of commonly available computers (in thousands) and hence Efficient: By distributing the data, it can process it in parallel on the nodes where the data is located. Reliable: It automatically maintains multiple copies of data and automatically redeploys computing tasks based on failures.

GOOGLE

APACHE HADOOP

Google map reduce Hadoop Map reduce Big Table HBASE Google File System Hadoop Distributed File system

You might also like