Professional Documents
Culture Documents
UNIT-1
UNIT-1
(AUTONOMOUS)
Accredited by NAAC with‘A’ Grade & NBA (Under Tier - I), ISO 9001:2015 Certified Institution
Approved by AICTE, New Delhi and Affiliated to JNTUK, Kakinada
L.B. REDDY NAGAR, MYLAVARAM, NTR DIST., A.P.-521 230.
hodcse@lbrce.ac.in, cseoffice@lbrce.ac.in, Phone: 08659-222933, Fax: 08659-222931
DEPARTMENT OF COMPUTER SCIENCE &ENGINEERING
UNIT-1
Introduction to Big data
TOPICS:-
Types of Digital Data, Classification of Digital Data, Characteristics of Data, Evolution of Big
Data, Definition of Big Data, Challenges with Big Data, What is Big Data?, Other Characteristics of Data
Which are not Definitional Traits of Big Data, Why Big Data?, analyzing Data with Unix tools, Analyzing
Data with Hadoop, Hadoop Streaming, Hadoop Echo System.
Semi-Structured Data:-
Semi-structured data is data that does not conform to a data model but has some structure.
It lacks a fixed or rigid schema. It is the data that does not reside in a rational database but that have
some organizational properties that make it easier to analyze. With some processes, we can store
them in the relational database.
Characteristics of semi-structured Data:
Data does not conform to a data model but has some structure.
Data can not be stored in the form of rows and columns as in Databases
Semi-structured data contains tags and elements (Metadata) which is used to group data
and describe how the data is stored
Similar entities are grouped together and organized in a hierarchy
Entities in the same group may or may not have the same attributes or properties
Does not contain sufficient metadata which makes automation and management of data
difficult
Size and type of the same attributes in a group may differ
Due to lack of a well-defined structure, it can notused by computer programs easily
Sources of semi-structured Data:
E-mails
XML and other markup languages
Binary executables
TCP/IP packets
Zipped files
Integration of data from different sources
Web pages
Unstructured Data:-
Unstructured data is the data which does not conforms to a data model and has no easily
identifiable structure such that it can not be used by a computer program easily. Unstructured data
is not organised in a pre-defined manner or does not have a pre-defined data model, thus it is not a
good fit for a mainstream relational database.
Characteristics of Unstructured Data:
Data neither conforms to a data model nor has any structure.
Data can not be stored in the form of rows and columns as in Databases
Data does not follows any semantic or rules
Data lacks any particular format or sequence
Data has no easily identifiable structure
Due to lack of identifiable structure, it can notused by computer programs easily
Sources of Unstructured Data:
Web pages
Images (JPEG, GIF, PNG, etc.)
Videos
Memos
Reports
Word documents and PowerPoint presentations
Surveys
Advantages of Unstructured Data:
Its supports the data which lacks a proper format or sequence
The data is not constrained by a fixed schema
Very Flexible due to absence of schema.
Data is portable
It is very scalable
It can deal easily with the heterogeneity of sources.
These type of data have a variety of business intelligence and analytics applications.
Disadvantages Of Unstructured data:
It is difficult to store and manage unstructured data due to lack of schema and structure
Indexing the data is difficult and error prone due to unclear structure and not having pre-
defined attributes. Due to which search results are not very accurate.
Ensuring security to data is difficult task.
Problems faced in storing unstructured data:
It requires a lot of storage space to store unstructured data.
It is difficult to store videos, images, audios, etc.
Due to unclear structure, operations like update, delete and search is very difficult.
Storage cost is high as compared to structed data.
Indexing the unstructured data is difficult.
2. Unstructured Data :
It is defined as the data in which is not follow a pre-defined standard or you can say that
any does not follow any organized format. This kind of data is also not fit for the
relational database because in the relational database you will see a pre-defined manner
or you can say organized way of data. Unstructured data is also very important for the
big data domain and To manage and store Unstructured data there are many platforms to
handle it like No-SQL Database.
Examples –
Word, PDF, text, media logs, etc.
3. Semi-Structured Data :
Semi-structured data is information that does not reside in a relational database but that
have some organizational properties that make it easier to analyze. With some process,
you can store them in a relational database but is very hard for some kind of semi-
structured data, but semi-structured exist to ease space.
Example –
XML data.
Features of Data Classification :
The main goal of the organization of data is to arrange the data in such a form that it
becomes fairly available to the users. So it’s basic features as following.
Homogeneity – The data items in a particular group should be similar to each
other.
Clarity – There must be no confusion in the positioning of any data item in a
particular group.
Stability – The data item set must be stable i.e. any investigation should not affect
the same set of classification.
Elastic – One should be able to change the basis of classification as the purpose
of classification changes.
Characteristics of Data(4’v):-
Volume
Volume refers to the unimaginable amounts of information generated every second
from social media, cell phones, cars, credit cards, M2M sensors, images, video, and
whatnot. We are currently using distributed systems, to store data in several locations
and brought together by a software Framework like Hadoop.
Facebook alone can generate about billion messages, 4.5 billion times that the “like”
button is recorded, and over 350 million new posts are uploaded each day. Such a huge
amount of data can only be handled by Big Data Technologies
Variety
As Discussed before, Big Data is generated in multiple varieties. Compared to the
traditional data like phone numbers and addresses, the latest trend of data is in the form
of photos, videos, and audios and many more, making about 80% of the data to be
completely unstructured
Structured data is just the tip of the iceberg.
Veracity
Veracity basically means the degree of reliability that the data has to offer. Since a
major part of the data is unstructured and irrelevant, Big Data needs to find an alternate
way to filter them or to translate them out as the data is crucial in business
developments
Value
Value is the major issue that we need to concentrate on. It is not just the amount of data
that we store or process. It is actually the amount of valuable, reliable and trustworthy
data that needs to be stored, processed, analyzed to find insights. You can get a better
understanding with the Azure Data Engineering certification.
Velocity
Last but never least, Velocity plays a major role compared to the others, there is no
point in investing so much to end up waiting for the data. So, the major aspect of Big
Dat is to provide data on demand and at a faster pace.
KEY TAKEAWAYS
Big data is a great quantity of diverse information that arrives in increasing volumes and
with ever-higher velocity.
Big data can be structured (often numeric, easily formatted and stored) or unstructured
(more free-form, less quantifiable).
Nearly every department in a company can utilize findings from big data analysis, but
handling its clutter and noise can pose problems.
Big data can be collected from publicly shared comments on social networks and websites,
voluntarily gathered from personal electronics and apps, through questionnaires, product
purchases, and electronic check-ins.
Big data is most often stored in computer databases and is analyzed using software
specifically designed to handle large, complex data sets.
Challenges in Handling Big Data:-
Challenge #1: Insufficient understanding and acceptance of big data
Solution: To ensure big data understanding and acceptance at all levels, IT departments need to
organize numerous trainings and workshops.
Solution: If you are new to the world of big data, trying to seek professional help would be the
right way to go.
Solution: Resorting to data lakes or algorithm optimizations (if done properly) can save
money and migrating to cloud platforms.
Challenge #4: Complexity of managing data quality (will all the data be useful For analysis?)
Solution: Compare data to the single point of truth Match records and merge them, if they relate
to the same entity.
solution’s architecture.
Challenge #6: Tricky process of converting big data into valuable insights
(There is a dearth of skilled professionals who possess a high level of proficiency in data
science.)
Solution:The idea here is that you need to create a proper system of factors and data sources,
whose analysis will bring the needed insights, and ensure that nothing falls out of scope.
Challenge #8: The other challenge is to decide on the period of retention of Bigdata
Hadoop Streaming:-
It is a utility or feature that comes with a Hadoop distribution that allows developers or
programmers to write the Map-Reduce program using different programming languages like
Ruby, Perl, Python, C++, etc. We can use any language that can read from the standard
input(STDIN) like keyboard input and all and write using standard output(STDOUT). We all
know the Hadoop Framework is completely written in java but programs for Hadoop are not
necessarily need to code in Java programming language. feature of Hadoop Streaming is
available since Hadoop version 0.14.1.
How Hadoop Streaming Works
In the above example image, we can see that the flow shown in a dotted block is a basic
MapReduce job. In that, we have an Input Reader which is responsible for reading the input
data and produces the list of key-value pairs. We can read data in .csv format, in delimiter
format, from a database table, image data(.jpg, .png), audio data etc. The only requirement to
read all these types of data is that we have to create a particular input format for that data with
these input readers. The input reader contains the complete logic about the data it is reading.
Suppose we want to read an image then we have to specify the logic in the input reader so that
it can read that image data and finally it will generate key-value pairs for that image data.
If we are reading an image data then we can generate key-value pair for each pixel where the
key will be the location of the pixel and the value will be its color value from (0-255) for a
colored image. Now this list of key-value pairs is fed to the Map phase and Mapper will work
on each of these key-value pair of each pixel and generate some intermediate key-value pairs
which are then fed to the Reducer after doing shuffling and sorting then the final output
produced by the reducer will be written to the HDFS. These are how a simple Map-Reduce job
works.
Now let’s see how we can use different languages like Python, C++, Ruby with Hadoop for
execution. We can run this arbitrary language by running them as a separate process. For that,
we will create our external mapper and run it as an external separate process. These external
map processes are not part of the basic MapReduce flow. This external mapper will take input
from STDIN and produce output to STDOUT. As the key-value pairs are passed to the
internal mapper the internal mapper process will send these key-value pairs to the external
mapper where we have written our code in some other language like with python with help of
STDIN. Now, these external mappers process these key-value pairs and generate intermediate
key-value pairs with help of STDOUT and send it to the internal mappers.
Similarly, Reducer does the same thing. Once the intermediate key-value pairs are processed
through the shuffle and sorting process they are fed to the internal reducer which will send
these pairs to external reducer process that are working separately through the help of STDIN
and gathers the output generated by external reducers with help of STDOUT and finally the
output is stored to our HDFS.
This is how Hadoop Streaming works on Hadoop which is by default available in Hadoop. We
are just utilizing this feature by making our external mapper and reducers. Now we can see
how powerful feature is Hadoop streaming. Anyone can write his code in any language of his
own choice.
Note: Apart from the above-mentioned components, there are many other components too that are
part of the Hadoop ecosystem.
All these toolkits or components revolve around one term i.e. Data. That’s the beauty of Hadoop
that it revolves around data and hence making its synthesis easier.
HDFS:
HDFS is the primary or major component of Hadoop ecosystem and is responsible for storing
large data sets of structured or unstructured data across various nodes and thereby maintaining
the metadata in the form of log files.
HDFS consists of two core components i.e.
1. Name node
2. Data Node
Name Node is the prime node which contains metadata (data about data) requiring
comparatively fewer resources than the data nodes that stores the actual data. These data nodes
are commodity hardware in the distributed environment. Undoubtedly, making Hadoop cost
effective.
HDFS maintains all the coordination between the clusters and hardware, thus working at the
heart of the system.
YARN:
Yet Another Resource Negotiator, as the name implies, YARN is the one who helps to manage
the resources across the clusters. In short, it performs scheduling and resource allocation for the
Hadoop System.
Consists of three major components i.e.
1. Resource Manager
2. Nodes Manager
3. Application Manager
Resource manager has the privilege of allocating resources for the applications in a system
whereas Node managers work on the allocation of resources such as CPU, memory, bandwidth
per machine and later on acknowledges the resource manager. Application manager works as an
interface between the resource manager and node manager and performs negotiations as per the
requirement of the two.
MapReduce:
By making the use of distributed and parallel algorithms, MapReduce makes it possible to carry
over the processing’s logic and helps to write applications which transform big data sets into a
manageable one.
MapReduce makes the use of two functions i.e. Map() and Reduce() whose task is:
1. Map() performs sorting and filtering of data and thereby organizing them in the form of
group. Map generates a key-value pair based result which is later on processed by the
Reduce() method.
2. Reduce(), as the name suggests does the summarization by aggregating the mapped data. In
simple, Reduce() takes the output generated by Map() as input and combines those tuples
into smaller set of tuples.
PIG:
Pig was basically developed by Yahoo which works on a pig Latin language, which is Query based
language similar to SQL.
It is a platform for structuring the data flow, processing and analyzing huge data sets.
Pig does the work of executing commands and in the background, all the activities of
MapReduce are taken care of. After the processing, pig stores the result in HDFS.
Pig Latin language is specially designed for this framework which runs on Pig Runtime. Just the
way Java runs on the JVM.
Pig helps to achieve ease of programming and optimization and hence is a major segment of the
Hadoop Ecosystem.
HIVE:
With the help of SQL methodology and interface, HIVE performs reading and writing of large
data sets. However, its query language is called as HQL (Hive Query Language).
It is highly scalable as it allows real-time processing and batch processing both. Also, all the
SQL datatypes are supported by Hive thus, making the query processing easier.
Similar to the Query Processing frameworks, HIVE too comes with two components: JDBC
Drivers and HIVE Command Line.
JDBC, along with ODBC drivers work on establishing the data storage permissions and
connection whereas HIVE Command line helps in the processing of queries.
Mahout:
Mahout, allows Machine Learnability to a system or application. Machine Learning, as the name
suggests helps the system to develop itself based on some patterns, user/environmental
interaction or on the basis of algorithms.
It provides various libraries or functionalities such as collaborative filtering, clustering, and
classification which are nothing but concepts of Machine learning. It allows invoking algorithms
as per our need with the help of its own libraries.
Apache Spark:
It’s a platform that handles all the process consumptive tasks like batch processing, interactive
or iterative real-time processing, graph conversions, and visualization, etc.
It consumes in memory resources hence, thus being faster than the prior in terms of
optimization.
Spark is best suited for real-time data whereas Hadoop is best suited for structured data or batch
processing, hence both are used in most of the companies interchangeably.
Apache HBase:
It’s a NoSQL database which supports all kinds of data and thus capable of handling anything of
Hadoop Database. It provides capabilities of Google’s BigTable, thus able to work on Big Data
sets effectively.
At times where we need to search or retrieve the occurrences of something small in a huge
database, the request must be processed within a short quick span of time. At such times, HBase
comes handy as it gives us a tolerant way of storing limited data