Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

BASIC OF BIG DATA ANALYTICS

UNIT-1
Data:

Data is a collection of details in the form of either figures or texts or symbols,


or descriptions etc.
Data contains raw figures and facts. Information unlike data provides
insights analyzed through the data collected.

Data has 3 characteristics:

1. Composition: The composition of data deals with the structure of data, i.e;
the sources of data, the granularity, the types and nature of data as to
whether it is static or real time streaming.
2. Condition: The condition of data deals with the state of data, i.e; “Can one
use this data as is for analysis?” or “Does it require cleaning for further
enhancement and enrichment?” data?”
3. Context: The context of data deals with “Where has this data been
generated?”. “Why was this data generated?”, “How sensitive is this
data?”, “What are the events associated with this”.

Digital data and different types of digital data


The data that is stored using specific machine language systems which can be
interpreted by various technologies is called digital data.
Eg. Audio, video or text information

Digital Data is classified into three types:


1. Structured Data:-
This is the data which is in an organized form, for example in rows
and columns.
No of rows called Cardinality and No of columns called Degree
of a relation
Sources: Database, Spread sheets, OLTP systems.

Working with Structured data:


- Storage: Data types – both defined and user defined help with the
storage of structured data
- update, delete: Updating, deleting, etc. is easy due to structured form
- Security: can be provided easily in RDBMS.
- Indexing /Searching: Data can be indexed based not only on a
text string but other attributes as well. This enables streamlined search
- Scalability (horizontal/vertical): Scalability is not generally an issue with
increase in data as resources can be increased easily.
- Transaction Processing (Atomicity, Consistncy, Integrity, Durability)
Fig. Sample representation of types of digital data

2. Semi-Structured Data:

This data which doesn’t conform to a data model but has some structure.
Metadata for this data is available but is not sufficient.
Sources: XML, JSON, E-mail

Characteristics:
- inconsistent structure.
- self describing (label/value pairs)
- schema information is blended with data values
- data objectives may have different attributes not known before

Challenges:
 Storage cost: Storing data with their schemas increases cost
 RDBMS: Semi-structured data cannot be stored in existing RDBMS as
data cannot be mapped into tables directly
 Irregular and partial structure: Some data elements may have extra
information while others none at all
 Implicit structure: In many cases the structure is implicit.
 Interpreting relationships and correlations is very difficult
 Flat files: Semi-structured is usually stored in flat files which are
difficult to index and search
 Heterogeneous sources: Data comes from varied sources which is difficult
to tag and search.

3. Unstructured Data:

 This is the data which does not conform to a data model or is not in a
form which can be used easily by a computer program.
 About 80–90% data of an organization is in this format.
 Sources: memos, chat-rooms, PowerPoint presentations, images,
videos, letters, researches, white papers, body of an email, etc.

Characteristics:
 Does not confirm to any data model
 Can’t be stored in the form of rows and columns
 Not in any particular format or sequence
 Not easily usable by the program
 Doesn’t follow any rule or semantics
Challenges:
 Storage space: Sheer volume of unstructured data and its
unprecedented growth makes it difficult to store. Audios, videos,
images, etc. acquire huge amount of storage space
 Scalability: Scalability becomes an issue with increase in
unstructured data
 Retrieve information: Retrieving and recovering unstructured data are
cumbersome
 Security: Ensuring security is difficult due to varied sources of data
(e.g. e-mail, web pages)
 Update/delete: Updating, deleting, etc. are not easy due to the
unstructured form
 Indexing and Searching: Indexing becomes difficult with increase in
data.
 Searching is difficult for non-text data
 Interpretation: Unstructured data is not easily interpreted by
conventional search algorithm
 Tags: As the data grows it is not possible to put tags Manually
 Indexing: Designing algorithms to understand the meaning
of the document and then tag or index them accordingly is difficult.

Big Data:
To get a handle on challenges of big data, you need to know what the word "Big Data" means. When we
hear "Big Data," we might wonder how it differs from the more common "data." The term "data" refers
to any unprocessed character or symbol that can be recorded on media or transmitted via electronic
signals by a computer. Raw data, however, is useless until it is processed somehow.

The Five ‘V’s of Big Data:


Big Data is simply a catchall term used to describe data too large and complex to store in traditional
databases. The “five ‘V’s” of Big Data are:

 Volume – The amount of data generated


 Velocity - The speed at which data is generated, collected and analyzed
 Variety - The different types of structured, semi-structured and unstructured data
 Value - The ability to turn data into useful insights
 Veracity - Trustworthiness in terms of quality and accuracy

Big Data Case Study


As the number of Internet users grew throughout the last decade, Google was challenged with how to
store so much user data on its traditional servers. With thousands of search queries raised every second,
the retrieval process was consuming hundreds of megabytes and billions of CPU cycles. Google needed
an extensive, distributed, highly fault-tolerant file system to store and process the queries. In response,
Google developed the Google File System (GFS).

Google File System (GFS):


GFS architecture consists of one master and multiple chunk servers or slave machines. The master
machine contains metadata, and the chunk servers/slave machines store data in a distributed fashion.
Whenever a client on an API wants to read the data, the client contacts the master, which then responds
with the metadata information. The client uses this metadata information to send a read/write request to
the slave machines to generate a response.

The files are divided into fixed-size chunks and distributed across the chunk servers or slave machines.
Features of the chunk servers include:

 Each piece has 64 MB of data (128 MB from Hadoop version 2 onwards)


 By default, each piece is replicated on multiple chunk servers three times
 If any chunk server crashes, the data file is present in other chunk servers

 GFS is a file system designed to handle batch workloads with lots of data. The system is
distributed: multiple machines store copies of every file, and multiple machines try to
read/write the same file. GFS was originally designed for Google’s use case of searching and
indexing the web.

Components of GFS
A group of computers makes up GFS. A cluster is just a group of connected computers. There could be
hundreds or even thousands of computers in each cluster. There are three basic entities included in any
GFS cluster as follows:

 GFS Clients: They can be computer programs or applications which may be used to request files.
Requests may be made to access and modify already-existing files or add new files to the system.
 GFS Master Server: It serves as the cluster’s coordinator. It preserves a record of the cluster’s actions in
an operation log. Additionally, it keeps track of the data that describes chunks, or metadata. The chunks’
place in the overall file and which files they belong to are indicated by the metadata to the master server.
 GFS Chunk Servers: They are the GFS’s workhorses. They keep 64 MB-sized file chunks. The master
server does not receive any chunks from the chunk servers. Instead, they directly deliver the client the
desired chunks. The GFS makes numerous copies of each chunk and stores them on various chunk
servers in order to assure stability; the default is three copies. Every replica is referred to as one.

Overall Design
The system consists of a single master machine and many chunkservers. Appropriately named, these
chunkservers store parts of files called chunks (64MB data blocks), which are identified by a 64-bit
chunk handle.

The following diagram shows how the client interacts with the master and other chunkservers.
The 4 steps illustrate a simple read operation. In step 1, the client asks the master for metadata: where
are the replicas and locations where my data is stored? The metadata is returned in step 2, and then in
step 3, the client asks for one of the replicas (usually the closest one) for the actual data, which is
transferred in step 4.

Every client has to interact with the master. If the master started forwarding data to clients, bandwidth
would be used up quickly, and the system couldn’t serve many requests. By returning metadata,
interactions with the master are quick, while interactions with chunkservers are longer (which is fine,
since clients are probably talking to different chunkservers).

Writing Data in GFS:


Each piece of data must be replicated a configurable number of times (by default, 3) for availability. The
diagram below shows what happens when a client writes data.

1. The client asks the master to write data.


2. The master responds with replica locations where the client can write.
3. The client finds the closest replica and starts forwarding data. (When a replica starts receiving
data, it immediately forwards it to the next closest replica to speed up the process. This
continues until all replicas have the data.)
4. The client asks the primary replica to commit the data.
5. The primary commits and asks the secondaries to do the same. If multiple changes happen
concurrently, it also sends the order to commit changes.
6. The secondary replicas respond to the primary.
7. The primary responds to the client and reports errors.

Another valid question is: how are the replicas chosen?

It has two objectives: maximize reliability/availability and maximize network utilization. For
reliability, data is replicated not only across machines, but also across server racks. Other criteria such as
equal disk space utilization try to ensure that no machine/rack is overloaded with requests. And when
these metrics get too skewed (e.g. a server rack is much more utilized than other racks), the master will
perform rebalancing, which includes reshuffling chunks across machines and racks.

Features of GFS

 Namespace management and locking.


 Fault tolerance.
 Reduced client and master interaction because of large chunk server size.
 High availability.
 Critical data replication.
 Automatic and efficient data recovery.
 High aggregate throughput.

Advantages of GFS

1. High accessibility Data is still accessible even if a few nodes fail. (replication) Component
failures are more common than not, as the saying goes.
2. Excessive throughput. many nodes operating concurrently.
3. Dependable storing. Data that has been corrupted can be found and duplicated.

Metadata
The master stores three major types of metadata: the file
and chunknamespaces, the mapping from files to chunks,
and the locations of each chunk’s replicas. All metadata is
kept in the master’s memory. The first two types (names
paces and file-to-chunkmapping) are also kept persistent by
logging mutations to an operation log stored on the mas
ter’s local diskand replicated on remote machines. Using
a log allows us to update the master state simply, reliably,
and without risking inconsistencies in the event of a master
crash. The master does not store chunklocation informa
tion persistently. Instead, it asks each chunkserver about its
chunks at master startup and whenever a chunkserver joins
the cluster.

Chunk Size
Chunksize is one of the key design parameters. We have
chosen 64 MB, which is much larger than typical file sys
tem blocksizes. Each chunkreplica is stored as a plain
Linux file on a chunkserver and is extended only as needed.
Lazy space allocation avoids wasting space due to internal
fragmentation, perhaps the greatest objection against such
a large chunksize

Big Data and characteristics of Big Data

Big Data is high-volume, velocity, and variety information assets that demand
cost-effective, innovative forms of information processing for enhanced insight and
decision making.
Characteristics(V’s):

1.Volume: It refers to the amount of the data. The size of the data is being
increased from Bits to Yottabytes.
Bits-> Bytes-> KBs-> MBs-> GBs-> TBs-> PBs-> Exabytes-> Zettabytes->
Yottabytes

There are different sources of data like doc, pdf, YouTube, a chat
conversation on internet messenger, a customer feedback form on an online retail
website, CCTV coverage and weather forecast.

The sources of Big data:


1. Typical internal data sources: data present within an organization’s
firewall.
Data storage: File systems, SQL (RDBMSs- oracle, MS SQL server, DB2,
MySQL, PostgreSQL etc.) NoSQL, (MangoDB, Cassandra etc) and so on.
Archives: Archives of scanned documents, paper archives, customer correspondence
records, patient’s health records, student’s admission records, student’s
assessment records, and so on.
2. External data sources: data residing outside an organization’s Firewall.
Public web: Wikipedia, regulatory, compliance, weather, census etc.,
3. Both (internal + external sources)
Sensor data, machine log data, social media, business apps, media and docs.

2.Variety: Variety deals with the wide range of data types and sources of data.
Structured, semi-structured and Unstructured.
Structured data: From traditional transaction processing systems and RDBMS,
etc.
Semi-structured data: For example Hypertext Markup Language (HTML),
eXtensible Markup Language (XML).
Unstructured data: For example unstructured text documents, audios,
videos, emails, photos, PDFs , social media, etc.

3.Velocity: It refers to the speed of data processing. we have moved from the days
of batch processing to Real-time processing.

4.Veracity: Veracity refers to biases, noise and abnormality in data. The key
question is “Is all the data that is being stored, mined and analysed meaningful
and pertinent to the problem under consideration”.
5.Value: This refers to the value that big data can provide, and it relates directly
to what organizations can do with that collected data. It is often quantified as the
potential social or economic value that the data might create.

Challenges of Big Data:


1. Data today is growing at an exponential rate. Most of the data that we have today has
been generated in the last two years. The key question is : will all this data be useful
for analysis how will separate knowledge from noise.
2. How to host big data solutions outside the world.
3. Challenges with respect to capture, storage, search, sharing, transfer, analysis,
privacy violations and visualization.
4. Scale : The storage of data is becoming a challenge for everyone.
5. Security: The production of more and more data increases security and privacy
concerns.

6. Continuous availability: How to provide 24X7 support


7. Data quality: Inconsistent data, duplicates, logic conflicts, and missing data all result
in data quality challenges.

Storage

With vast amounts of data generated daily, the greatest challenge is storage (especially when the data is
in different formats) within legacy systems. Unstructured data cannot be stored in traditional databases.

Processing

Processing big data refers to the reading, transforming, extraction, and formatting of useful information
from raw information. The input and output of information in unified formats continue to present
difficulties.

Security

Security is a big concern for organizations. Non-encrypted information is at risk of theft or damage by
cyber-criminals. Therefore, data security professionals must balance access to data against maintaining
strict security protocols.

Finding and Fixing Data Quality Issues

Many of you are probably dealing with challenges related to poor data quality, but solutions are
available. The following are four approaches to fixing data problems:

 Correct information in the original database.


 Repairing the original data source is necessary to resolve any data inaccuracies.
 You must use highly accurate methods of determining who someone is.

Scaling Big Data Systems

Database sharding, memory caching, moving to the cloud and separating read-only and write-active
databases are all effective scaling methods. While each one of those approaches is fantastic on its own,
combining them will lead you to the next level.

Big Data Analytics


Big data analytics describes the process of uncovering trends, patterns, and correlations in large amounts
of raw data to help make data-informed decisions.

These processes use familiar statistical analysis techniques—like clustering and regression—and apply
them to more extensive datasets with the help of newer tools.

Big data has been a buzz word since the early 2000s, when software and hardware capabilities made it
possible for organizations to handle large amounts of unstructured data.

Since then, new technologies—from Amazon to smartphones—have contributed even more to the
substantial amounts of data available to organizations. With the explosion of data, early innovation
projects like Hadoop, Spark, and NoSQL databases were created for the storage and processing of big
data.

This field continues to evolve as data engineers look for ways to integrate the vast amounts of complex
information created by sensors, networks, transactions, smart devices, web usage, and more. Even now,
big data analytics methods are being used with emerging technologies, like machine learning, to
discover and scale more complex insights.

How Big Data Analytics Works


Big data analytics refers to collecting, processing, cleaning, and analyzing large datasets to help
organizations operationalize their big data.

1. Collect Data

Data collection looks different for every organization. With today’s technology, organizations can
gather both structured and unstructured data from a variety of sources — from cloud storage to mobile
applications to in-store IoT sensors and beyond. Some data will be stored in data warehouses where
business intelligence tools and solutions can access it easily. Raw or unstructured data that is too diverse
or complex for a warehouse may be assigned metadata and stored in a data lake.

2. Process Data

Once data is collected and stored, it must be organized properly to get accurate results on analytical
queries, especially when it’s large and unstructured. Available data is growing exponentially, making
data processing a challenge for organizations.

One processing option is batch processing, which looks at large data blocks over time. Batch
processing is useful when there is a longer turnaround time between collecting and analyzing data.

Stream processing looks at small batches of data at once, shortening the delay time between collection
and analysis for quicker decision-making. Stream processing is more complex and often more expensive.

3. Clean Data

Data big or small requires scrubbing to improve data quality and get stronger results; all data must be
formatted correctly, and any duplicative or irrelevant data must be eliminated or accounted for. Dirty
data can obscure and mislead, creating flawed insights.

4. Analyze Data

Getting big data into a usable state takes time. Once it’s ready, advanced analytics processes can turn big
data into big insights. Some of these big data analysis methods include:

 Data mining sorts through large datasets to identify patterns and relationships by identifying
anomalies and creating data clusters.
 Predictive analytics uses an organization’s historical data to make predictions about the future,
identifying upcoming risks and opportunities.
 Deep learning imitates human learning patterns by using artificial intelligence and machine
learning to layer algorithms and find patterns in the most complex and abstract data.
Q) Give Real-time applications of Big Data Analytics.

1. Banking and Securities Industry:

 This industry also heavily relies on Big Data for risk analytics, including;
anti-money laundering, demand enterprise risk management, "Know Your
Customer," and fraud mitigation.
 The Securities Exchange Commission (SEC) is using Big Data to monitor
financial market activity. They are currently using network analytics and
natural language processors to catch illegal trading activity in the financial
markets.

2. Communications, Media and Entertainment Industry: Organizations


in this industry simultaneously analyze customer data along with behavioral
data to create detailed customer profiles that can be used to:
 Create content for different target audiences
 Recommend content on demand
 Measure content performance
Eg.
 Spotify, an on-demand music service, uses Hadoop Big Data analytics,
to collect data from its millions of users worldwide and then uses the analyzed
data to give informed music recommendations to individual users.
 Amazon Prime, which is driven to provide a great customer
experience by offering video, music, and Kindle books in a one-stop- shop,
also heavily utilizes Big Data.

3. Healthcare Sector:
 Some hospitals, like Beth Israel, are using data collected from a cell phone
app, from millions of patients, to allow doctors to use evidence- based
medicine as opposed to administering several medical/lab tests to all
patients who go to the hospital.
 Free public health data and Google Maps have been used by the University of
Florida to create visual data that allows for faster identification and efficient
analysis of healthcare information, used in tracking the spread of chronic
disease.

4. Education:
 The University of Tasmania, An Australian university with students has
deployed a Learning and Management System that tracks, among other
things, when a student logs onto the system, how much time is spent on
different pages in the system, as well as the overall progress of a student
over time.
 On a governmental level, the Office of Educational Technology in the
U. S. Department of Education is using Big Data to develop analytics to
help correct course students who are going astray while using
online Big Data certification courses. Click patterns are also being used to
detect boredom.

5. Government:
 In public services, Big Data has an extensive range of applications,
including energy exploration, financial market analysis, fraud detection,
health-related research, and environmental protection.
 The Food and Drug Administration (FDA) is using Big Data to detect and
study patterns of food-related illnesses and diseases.

6. Insurance Industry:
Big data has been used in the industry to provide customer insights for
transparent and simpler products, by analyzing and predicting customer behavior
through data derived from social media, GPS-enabled devices, and CCTV
footage. The Big Data also allows for better customer retention from insurance
companies.

7. Transportation Industry:
Some applications of Big Data by governments, private organizations, and
individuals include:
 Governments use of Big Data: traffic control, route planning,
intelligent transport systems, congestion management (by predicting traffic
conditions)
 Private-sector use of Big Data in transport: revenue management,
technological enhancements, logistics and for competitive advantage (by
consolidating shipments and optimizing freight movement)

8. Energy and Utility Industry:


Smart meter readers allow data to be collected almost every 15 minutes as
opposed to once a day with the old meter readers. This granular data is
being used to analyze the consumption of utilities better, which allows for
improved customer feedback and better control of utilities use.

You might also like