Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 10

Module - 1

1. Structured, Semi Structured and Unstructured Data

What is structured data?


 Structured data is generally tabular data that is represented by columns and rows in a database.
 Databases that hold tables in this form are called relational databases.
 The mathematical term “relation” specifies a formed set of data held as a table.
 In structured data, all row in a table has the same set of columns.
 SQL (Structured Query Language) programming language used for structured data.
What is structured data?
 Structured data is generally tabular data that is represented by columns and rows in a database.
 Databases that hold tables in this form are called relational databases.
 The mathematical term “relation” specifies a formed set of data held as a table.
 In structured data, all row in a table has the same set of columns.
 SQL (Structured Query Language) programming language used for structured data.

2) What is Semi-structured Data


 Semi-structured data is information that doesn’t consist of Structured data (relational database) but still has some structure
to it.
 Semi-structured data consists of documents held in JavaScript Object Notation (JSON) format. It also includes key-
value stores and graph databases.
3) What is Unstructured Data
 Unstructured data is information that either is not organized in a pre-defined manner or does not have a pre-defined data
model.
 Unstructured information is a set of text-heavy but may contain data such as numbers, dates, and facts as well.
 Videos, audio, and binary data files might not have a specific structure. They’re assigned to unstructured data

Approaches to storing and managing data


Schema-on-write and schema-on-read are two different approaches to storing and managing data. Schema-on-write means that
the schema, or structure, of the data, is defined when the data is written to the database. Schema-on-read means that the
schema is defined when the data is read from the database.

Structured data is typically stored using schema-on-write. This is because the schema is known in advance and can be used to
optimize the storage and performance of the data. Unstructured data is typically stored using schema-on-read. This is because the
schema is not known in advance and may need to be changed frequently.

Which approach is best for a particular application depends on the specific needs of the application.

Structured Data vs Unstructured Data vs Semi-Structured:

Structured data is stored in a predefined format and is highly specific; whereas unstructured data is a collection of many varied
data types that are stored in their native formats; while semi-structured data does not follow the tabular data structure models
associated with relational databases or other data table forms.

Pros and Cons of Structured Data

Pros Cons
Requires less processing in comparison to unstructured Limited usability because of its pre-defined
data and is easier to manage. structure/format
Structured data is stored in data warehouses which are
Machine algorithms can easily crawl and use structured
built for space saving but are difficult to change and not
data which simplifies querying
very scalable/flexible.
As an older format of data, there are several tools
available for structured data that simplify usage,
management, and analysis
Pros and Cons of Unstructured Data

Pros Cons
The greater number of formats makes it equally
A variety of native formats facilitates a greater number of use
challenging to analyze and leverage
cases and applications
unstructured data.
The large volume and undefined formats make
As there is no need to predefine data, unstructured data is
data management a challenge and specialized
collected quickly and easily.
tools a necessity.
Unstructured data is stored in on-premises or cloud data lakes
which are highly scalable.
Although challenging, the greater volume of unstructured data
provides better insights and more opportunities to turn your
data into a competitive advantage.

2. Characteristics of Big Data

Big Data contains a large amount of data that is not being processed by traditional data storage or the processing unit.

Volume

Volume refers to the huge amounts of data that is collected and generated every second in large organizations. This data is
generated from different sources such as IoT devices, social media, videos, financial transactions, and customer logs.

Storing and processing this huge amount of data was a problem earlier. But now distributed systems such as Hadoop are used
for organizing data collected from all these sources. The size of the data is crucial for understanding its value. Also, the volume
is useful in determining whether a collection of data is Big Data or not.

Data volume can vary. For example, a text file is a few kilobytes whereas a video file is a few megabytes. In fact, Facebook from
Meta itself can produce an enormous proportion of data in a single day. Billions of messages, likes, and posts each day
contribute to generating such huge data.

Variety

Another one of the most important Big Data characteristics is its variety. It refers to the different sources of data and their
nature. The sources of data have changed over the years. Earlier, it was only available in spreadsheets and databases.
Nowadays, data is present in photos, audio files, videos, text files, and PDFs.

The variety of data is crucial for its storage and analysis.

A variety of data can be classified into three distinct parts:

1. Structured data
2. Semi-Structured data
3. Unstructured data
Veracity

This feature of Big Data is connected to the previous one. It defines the degree of trustworthiness of the data. As most of the
data you encounter is unstructured, it is important to filter out the unnecessary information and use the rest for processing.

Veracity is one of the characteristics of big data analytics that denotes data inconsistency as well as data uncertainty.

As an example, a huge amount of data can create much confusion on the other hand, when there is a fewer amount of data,
that creates inadequate information.

Value

Among the characteristics of Big Data, value is perhaps the most important. No matter how fast the data is produced or its
amount, it has to be reliable and useful. Otherwise, the data is not good enough for processing or analysis. Research says that
poor quality data can lead to almost a 20% loss in a company’s revenue.

Data scientists first convert raw data into information. Then this data set is cleaned to retrieve the most useful data. Analysis
and pattern identification is done on this data set. If the process is a success, the data can be considered to be valuable.

Velocity

This term refers to the speed at which the data is created or generated. This speed of data producing is also related to how
fast this data is going to be processed. This is because only after analysis and processing, the data can meet the demands of the
clients/users.

Massive amounts of data are produced from sensors, social media sites, and application logs – and all of it is continuous. If the
data flow is not continuous, there is no point in investing time or effort on it.

As an example, per day, people generate more than 3.5 billion searches on Google

Big Data Analytics

Big Data Analytics is a powerful tool which helps to find the potential of large and complex datasets. Big Data Analytics is about
collecting, cleaning, processing, and analyzing data to uncover valuable insights. It’s a multi-step process that transforms raw
data into fruitful insights.

To get better understanding, let’s break it down into key steps:

 Data Collection: Data is the heart of Big Data Analytics. It is the process of the collection of data from various sources, which
can include customer reviews, surveys, sensors, social media etc. The main goal of data collection is to gather as much
relevant data as possible. The more data, the richer the insights.
 Data Cleaning (Data Preprocessing): Once we have the data, it often needs some cleaning. This process involves identifying
and dealing with missing values, correcting errors, and removing duplicates.
 Data Processing: Next, we need to process the data. This involves different steps like organizing, structuring, and formatting it
in a way that makes it appropriate for analysis.
 Data Analysis: Data analysis is performed using various statistical, mathematical, and machine learning techniques to extract
valuable insights from the processed data. For instance, it can reveal customer preferences, market trends, or patterns in
healthcare data.
 Data Visualization: Data analysis results are often presented in the form of visualizations – charts, graphs, and interactive
dashboards. These visual representations make complex data easy to understand and enable decision-makers to see trends
and patterns at a glance.
 Data Storage and Management: Storing and managing the analyzed data is crucial. It’s like archiving your findings. You may
want to revisit the insights in the future, and well-organized storage is essential for that. Additionally, it’s important to
ensure data security and compliance with regulations during this critical step.
 Continuous Learning and Improvement: Big Data Analytics isn’t a one-time affair. It’s an ongoing process. As you collect and
analyze more data, you learn more about your operations or customers. This insight can lead to refining your data collection
methods and analysis techniques for better results.

Types of Big Data Analytics

Big Data Analytics comes in many different types, each serving a different purpose:

Descriptive Analytics: This type helps us understand past events. In social media, it shows performance metrics, like the number
of likes on a post.
Diagnostic Analytics: In Diagnostic analytics delves deeper to uncover the reasons behind past events. In healthcare, it identifies
the causes of high patient re-admissions.
Predictive Analytics: Predictive analytics forecasts future events based on past data. Weather forecasting, for example, predicts
tomorrow’s weather by analyzing historical patterns.
Prescriptive Analytics: This type not only predicts outcomes but also suggests actions to optimize them. In e-commerce, it might
recommend the best price for a product to maximize profits.
Real-time Analytics: Real-time analytics processes data instantly. In stock trading, it helps traders make quick decisions based on
current market conditions.
Spatial Analytics: Spatial analytics focuses on location data. For city planning, it optimizes traffic flow using data from sensors and
cameras to reduce congestion.
Text Analytics: Text analytics extracts insights from unstructured text data. In the hotel industry, it can analyze guest reviews to
improve services and guest satisfaction.

Big Data Analytics Technologies and Tools

Big Data Analytics relies on various technologies and tools that might sound complex, let’s simplify them:
 Hadoop: Imagine Hadoop as an enormous digital warehouse. It’s used by companies like Amazon to store tons of data
efficiently. For instance, when Amazon suggests products you might like, it’s because Hadoop helps manage your shopping
history.
 Spark: Think of Spark as the super-fast data chef. Netflix uses it to quickly analyze what you watch and recommend your next
binge-worthy show.
 NoSQL Databases: NoSQL databases, like MongoDB, are like digital filing cabinets that Airbnb uses to store your booking
details and user data. These databases are famous because of their quick and flexible, so the platform can provide you with
the right information when you need it.
 Tableau: Tableau is like an artist that turns data into beautiful pictures. The World Bank uses it to create interactive charts
and graphs that help people understand complex economic data.
 Python and R: Python and R are like magic tools for data scientists. They use these languages to solve tricky problems. For
example, Kaggle uses them to predict things like house prices based on past data.
 Machine Learning Frameworks (e.g., TensorFlow): In Machine learning frameworks are the tools who make predictions.
Airbnb uses TensorFlow to predict which properties are most likely to be booked in certain areas. It helps hosts make smart
decisions about pricing and availability.

Challenges in Big Data Analytics

 Data Overload: Consider Twitter, where approximately 6,000 tweets are posted every second. The challenge is sifting through
this avalanche of data to find valuable insights.
 Data Quality: If the input data is inaccurate or incomplete, the insights generated by Big Data Analytics can be flawed. For
example, incorrect sensor readings could lead to wrong conclusions in weather forecasting.
 Privacy Concerns: With the vast amount of personal data used, like in Facebook’s ad targeting, there’s a fine line between
providing personalized experiences and infringing on privacy.
 Security Risks: With cyber threats increasing, safeguarding sensitive data becomes crucial. For instance, banks use Big Data
Analytics to detect fraudulent activities, but they must also protect this information from breaches.
 Costs: Implementing and maintaining Big Data Analytics systems can be expensive. Airlines like Delta use analytics to optimize
flight schedules, but they need to ensure that the benefits outweigh the costs.
Hadoop architecture

Hadoop is an open-source software framework that is used for storing and processing large amounts of data in a distributed
computing environment. It is designed to handle big data and is based on the MapReduce programming model, which allows for
the parallel processing of large datasets.
Hadoop is a beneficial technology for data analysts. There are many essential features in Hadoop which make it so important and
user-friendly.
1. The system is able to store and process enormous amounts of data at an extremely fast rate. A semi-structured, structured and
unstructured data set can differ depending on how the data is structured.
2. Enhances operational decision-making and batch workloads for historical analysis by supporting real-time analytics.
3. Data can be stored by organisations, and it can be filtered for specific analytical uses by processors as needed.
4. A large number of nodes can be added to Hadoop as it is scalable, so organisations will be able to pick up more data.
5. A protection mechanism prevents applications and data processing from being harmed by hardware failures. Nodes that are
down are automatically redirected to other nodes, allowing applications to run without interruption.
OR
Key features

Distributed Storage: Hadoop stores large data sets across multiple machines, allowing for the storage and processing of extremely
large amounts of data.
Scalability: Hadoop can scale from a single server to thousands of machines, making it easy to add more capacity as needed.
Fault-Tolerance: Hadoop is designed to be highly fault-tolerant, meaning it can continue to operate even in the presence of
hardware failures.
Data locality: Hadoop provides data locality feature, where the data is stored on the same node where it will be processed, this
feature helps to reduce the network traffic and improve the performance
High Availability: Hadoop provides High Availability feature, which helps to make sure that the data is always available and is not
lost.
Flexible Data Processing: Hadoop’s MapReduce programming model allows for the processing of data in a distributed fashion,
making it easy to implement a wide variety of data processing tasks.
Data Integrity: Hadoop provides built-in checksum feature, which helps to ensure that the data stored is consistent and correct.
Data Replication: Hadoop provides data replication feature, which helps to replicate the data across the cluster for fault
tolerance.
Data Compression: Hadoop provides built-in data compression feature, which helps to reduce the storage space and improve the
performance.
YARN: A resource management platform that allows multiple data processing engines like real-time streaming, batch processing,
and interactive SQL, to run and process data stored in HDFS.

Components of Hadoop

Hadoop is a framework that uses distributed storage and parallel processing to store and manage Big Data. It is the most
commonly used software to handle Big Data. There are three components of Hadoop.

Hadoop HDFS - Hadoop Distributed File System (HDFS) is the storage unit of Hadoop.

Hadoop MapReduce - Hadoop MapReduce is the processing unit of Hadoop.


Hadoop YARN - Hadoop YARN is a resource management unit of Hadoop.

Hadoop HDFS

Data is stored in a distributed manner in HDFS. There are two components of HDFS - name node and data node. While there is
only one name node, there can be multiple data nodes.

HDFS is specially designed for storing huge datasets in commodity hardware. An enterprise version of a server costs roughly
$10,000 per terabyte for the full processor. In case you need to buy 100 of these enterprise version servers, it will go up to a
million dollars.

Hadoop enables you to use commodity machines as your data nodes. This way, you don’t have to spend millions of dollars just on
your data nodes. However, the name node is always an enterprise server.

Features of HDFS

Provides distributed storage

Can be implemented on commodity hardware

Provides data security

Highly fault-tolerant - If one machine goes down, the data from that machine goes to the next machine

Master and Slave Nodes

Master and slave nodes form the HDFS cluster. The name node is called the master, and the data nodes are called the slaves.

The name node is responsible for the workings of the data nodes. It also stores the metadata.

The data nodes read, write, process, and replicate the data. They also send signals, known as heartbeats, to the name node. These
heartbeats show the status of the data node.
Consider that 30TB of data is loaded into the name node. The name node distributes it across the data nodes, and this data is
replicated among the data notes. You can see in the image above that the blue, grey, and red data are replicated among the three
data nodes.

Replication of the data is performed three times by default. It is done this way, so if a commodity machine fails, you can replace it
with a new machine that has the same data.

Hadoop MapReduce

Hadoop MapReduce is the processing unit of Hadoop. In the MapReduce approach, the processing is done at the slave nodes, and
the final result is sent to the master node.

The input dataset is first split into chunks of data. In this example, the input has three lines of text with three separate entities -
“bus car train,” “ship ship train,” “bus ship car.” The dataset is then split into three chunks, based on these entities, and processed
parallelly.

In the map phase, the data is assigned a key and a value of 1. In this case, we have one bus, one car, one ship, and one train.

These key-value pairs are then shuffled and sorted together based on their keys. At the reduce phase, the aggregation takes place,
and the final output is obtained.

Hadoop YARN

Hadoop YARN stands for Yet Another Resource Negotiator. It is the resource management unit of Hadoop

Hadoop YARN acts like an OS to Hadoop. It is a file system that is built on top of HDFS.

It is responsible for managing cluster resources to make sure you don't overload one machine.

It performs job scheduling to make sure that the jobs are scheduled in the right place
Suppose a client machine wants to do a query or fetch some code for data analysis. This job request goes to the resource manager
(Hadoop Yarn), which is responsible for resource allocation and management.

In the node section, each of the nodes has its node managers. These node managers manage the nodes and monitor the resource
usage in the node. The containers contain a collection of physical resources, which could be RAM, CPU, or hard drives. Whenever
a job request comes in, the app master requests the container from the node manager. Once the node manager gets the
resource, it goes back to the Resource Manager.

CAP theorem

The CAP theorem, often referred to as Brewer's theorem after its creator, Eric Brewer, is a fundamental concept in the world of
distributed systems, and its implications are especially pertinent in Big Data. It articulates the inherent trade-offs that distributed
databases and systems must navigate among three key properties, which are:

Consistency

Availability

Partition Tolerance

Consistency
Consistency, the first element of the CAP theorem, signifies that every read operation in a distributed system will return the most
recent write or an error. In other words, all nodes within the system exhibit the same data value at any given time. Achieving
strong consistency is crucial in applications where data accuracy is paramount, such as financial transactions or healthcare
records.

Availability
The second property, Availability, indicates that every request, whether it's a read or write operation, receives a response, and
that response is not an error. In essence, the system is always operational and responsive to client requests. High availability is
essential for systems that cannot tolerate downtime, like e-commerce platforms or real-time analytics.

Partition tolerance
Partition tolerance relates to the system's ability to function reliably despite network partitions or communication breakdowns.
Network partitions can occur due to factors like hardware failures, congestion, or geographical distribution, leading to nodes
being unable to communicate. A partition-tolerant system will still continue to operate, ensuring nodes can communicate even
under challenging network conditions.

The CAP theorem posits that, in a distributed system, you can't simultaneously achieve all three properties. Instead, you must
prioritise two out of the three, and the choice of which two significantly impacts the system's behaviour:

a) CA or Consistency and Availability: Prioritising both Consistency and Availability means that the system maintains strong data
consistency and high responsiveness but sacrifices Partition Tolerance. It can work well in stable network conditions, but it may
become problematic during network partitions.
b) CP or Consistency and Partition Tolerance: Emphasising Consistency and Partition Tolerance ensures strong data consistency
and the ability to withstand network partitions, but it might result in periods of unavailability during partition events.
c) AP or Availability and Partition Tolerance: Focusing on Availability and Partition Tolerance aims for high system availability
and the ability to operate under network partitions. However, this might come at the cost of relaxing strong consistency, allowing
for temporary data inconsistencies.

You might also like