Big Data and Analytics

BIG DATA AND
ANALYTICS
BIG DATA SIZE
 The size of big data can vary depending on the context and industry. However, big data typically refers to datasets that are too
large, complex, and dynamic to be processed by traditional data processing tools. These datasets can range from terabytes to
petabytes or even exabytes in size. (1 Petabyte (PB)=1,000 Terabytes (TB))
 To give you a sense of scale, consider the following examples:
• Facebook generates around 4 petabytes of new data per day from user posts, comments, and likes.
• The Large Hadron Collider, a particle accelerator used for scientific research, produces around 30 petabytes of data per year.
• The United States Library of Congress has collected over 39 million books, periodicals, and other print materials, which
equates to around 10 terabytes of data.
• The global internet traffic was projected to reach 2.6 zettabytes per year by 2020. (1 Zettabyte (ZB)=1,024 Petabytes (PB))
 As data continues to grow at an exponential rate, the size of big data is likely to increase significantly in the coming years.
BIG DATA CHARACTERISTICS
 The term "big data" refers to datasets that are too large, complex, or rapidly changing to be processed and
analyzed by traditional data management tools. Some of the key characteristics of big data include:
1. Volume: This refers to the sheer amount of data that is generated and collected. With the proliferation of
data-generating devices and platforms, the volume of data has grown exponentially. The challenge is to
store, manage, and process this data efficiently.
2. Velocity: This refers to the speed at which data is generated and needs to be processed. Many types of data
are generated in real-time, such as social media feeds, sensor data, and financial transactions, requiring a
high-speed processing capability.
3. Variety: This refers to the different types and formats of data, including structured, semi-structured, and
unstructured data. Big data comes in a variety of formats, including structured, semi-structured, and
unstructured data. This can include text, images, video, audio, and other data types.
BIG DATA CHARACTERISTICS (CONTINUED)
4. Veracity: This refers to the quality, accuracy, and reliability of the data. The challenge is to ensure that the data is
trustworthy and accurate, especially when it comes from multiple sources. Big data is often messy and may
contain errors, inconsistencies, or inaccuracies.
5. Value: This refers to the insights and knowledge that can be gained from the data. The challenge is to extract
valuable insights from the massive amount of data and use them to improve decision-making, identify new
opportunities, and gain a competitive edge.
6. Complexity: Big data can be complex, and may require sophisticated tools and techniques to analyze and process.
Machine learning algorithms and artificial intelligence techniques are often used to manage the complexity of big
data.
 These characteristics make big data challenging to work with, but also create opportunities for new insights and
innovation in fields such as machine learning, data science, and artificial intelligence. To fully realize the potential
of big data, organizations need to develop new tools, techniques, and approaches for managing and analyzing
large and complex datasets.
EVOLUTION OF BIG DATA
 The concept of big data has evolved significantly over the years. In the early days of computing, data was
primarily generated and analyzed by businesses and government organizations. However, the rise of the internet
and the proliferation of digital devices has led to an explosion in the volume, variety, and velocity of data being
generated.
 In the 1990s and early 2000s, the term "big data" was used to describe datasets that were too large to be handled
by traditional database management systems. The first attempts to address this challenge involved developing
distributed systems for storing and processing data, such as Apache Hadoop and Google's MapReduce. These
systems allowed organizations to store and process massive amounts of data by distributing it across a cluster of
computers.
 As big data technology matured, new tools and techniques were developed to make it easier to work with and
analyze large datasets. For example, NoSQL databases were developed to handle unstructured and semi-structured
data, and data warehouses were built to enable fast querying and analysis of large volumes of data.
 In recent years, the focus of big data has shifted from simply managing and processing large datasets to using data
to drive insights and decision-making. This has led to the rise of data science and machine learning, which use
statistical and computational methods to extract insights from data and make predictions.
 Today, big data is everywhere, and it is being generated by a wide range of sources, including social media,
sensors, mobile devices, and the internet of things. As the volume and complexity of data continue to grow, it is
likely that new tools and techniques will be developed to help organizations make sense of it all.
EVOLUTION OF BIG DATA
 Big Data phase 1.0
 Data analysis, data analytics and Big Data originate from the longstanding domain of database management. It relies heavily on the storage,
extraction, and optimization techniques that are common in data that is stored in Relational Database Management Systems (RDBMS).
 Database management and data warehousing are considered the core components of Big Data Phase 1. It provides the foundation of modern data
analysis as we know it today, using well-known techniques such as database queries, online analytical processing and standard reporting tools.
 Since the early 2000s, the Internet and the Web began to offer unique data collections and data analysis opportunities. With the expansion of web
traffic and online stores, companies such as Yahoo, Amazon and eBay started to analyze customer behavior by analyzing click-rates, IP-specific
location data and search logs. This opened a whole new world of possibilities.
 From a data analysis, data analytics, and Big Data point of view, HTTP-based web traffic introduced a massive increase in semi-structured and
unstructured data. Besides the standard structured data types, organizations now needed to find new approaches and storage solutions to deal
with these new data types in order to analyze them effectively. The arrival and growth of social media data greatly aggravated the need for tools,
technologies and analytics techniques that were able to extract meaningful information out of this unstructured data.
 Although web-based unstructured content is still the main focus for many organizations in data analysis, data analytics, and big data, the current
possibilities to retrieve valuable information are emerging out of mobile devices.
 Mobile devices not only give the possibility to analyze behavioral data (such as clicks and search queries), but also give the possibility to store
and analyze location-based data (GPS-data). With the advancement of these mobile devices, it is possible to track movement, analyze physical
behavior and even health-related data (number of steps you take per day). This data provides a whole new range of opportunities, from
transportation, to city design and health care.
 Simultaneously, the rise of sensor-based internet-enabled devices is increasing the data generation like never before. Famously coined as the
‘Internet of Things’ (IoT), millions of TVs, thermostats, wearables and even refrigerators are now generating zettabytes of data every day. And
the race to extract meaningful and valuable information out of these new data sources has only just begun.
STRUCTURING BIG DATA
Structuring big data involves organizing and processing large and complex datasets in a way that makes
them easier to analyze and use for decision-making. There are several techniques and tools that can be
used to structure big data:
1. Data modeling: Data modeling involves creating a conceptual representation of data, which defines
the relationships between different data elements. This can help to organize data into meaningful
categories and make it easier to query and analyze.
2. Data cleaning: Before analyzing big data, it is important to ensure that the data is accurate, complete,
and consistent. Data cleaning involves identifying and correcting errors, removing duplicates, and
filling in missing values.
3. Data integration: Big data often comes from multiple sources, and it may be necessary to integrate
these sources to create a single, unified dataset. This involves combining data from different sources,
resolving any conflicts or inconsistencies, and creating a common format for the data.
STRUCTURING BIG DATA (CONTINUED)
4. Data aggregation: Aggregating data involves grouping data together based on a common
characteristic, such as location or time. This can help to simplify the analysis of large datasets by
reducing the amount of data that needs to be processed.
5. Data indexing: Indexing involves creating a searchable database of the data, which can speed up
queries and make it easier to find specific information. This can be especially useful when working
with large and complex datasets.
6. Data visualization: Data visualization involves creating visual representations of data, such as charts,
graphs, and maps. This can help to make complex data easier to understand and can reveal patterns
and trends that might be difficult to detect otherwise.
By using these techniques, organizations can structure big data in a way that makes it easier to analyze,
use, and share. This can lead to more informed decision-making, better insights, and a competitive
advantage in the marketplace.
BIG DATA IN BUSINESS
Big data is transforming the way businesses operate, by enabling them to gain insights into their customers, operations, and
markets that were previously difficult or impossible to obtain. Here are some of the ways that big data is being used in
business:
1. Customer analytics: Big data enables businesses to collect and analyze vast amounts of data about their customers,
including their preferences, behaviors, and interactions with the company. This can help businesses to identify
opportunities for upselling and cross-selling, improve customer engagement and loyalty, and target their marketing efforts
more effectively.
2. Supply chain optimization: Big data can be used to track and analyze the performance of suppliers, transportation
providers, and other partners in the supply chain, enabling businesses to identify inefficiencies and optimize their
operations for improved efficiency and cost savings.
3. Predictive maintenance: By analyzing data from sensors and other sources, businesses can predict when equipment is
likely to fail, enabling them to perform maintenance proactively, reducing downtime and extending the life of the
equipment.
BIG DATA IN BUSINESS (CONTINUED)
4. Fraud detection: Big data can be used to identify patterns of fraudulent activity, enabling businesses to detect
and prevent fraud before it occurs.
5. Market research: Big data can be used to analyze trends and patterns in consumer behavior, allowing
businesses to identify opportunities for new products and services, and make more informed decisions about
marketing and advertising campaigns.
6. Personalization: Big data enables businesses to create more personalized experiences for their customers, by
analyzing their behavior and preferences and tailoring their products and services accordingly.
Big data is helping businesses to become more efficient, effective, and customer-centric. By leveraging the power
of big data, businesses can gain a competitive advantage, identify new opportunities for growth, and make more
informed decisions.
TECHNOLOGIES FOR HANDLING BIG DATA:
DISTRIBUTED AND PARALLEL COMPUTING
1. Definition:
 Distributed computing is a model in which a computation is divided into multiple independent sub-
tasks, which are executed on different computers or nodes in a network. These sub-tasks communicate
with each other to exchange data and coordinate their activities.
 Parallel computing, on the other hand, is a model in which a computation is divided into multiple
sub-tasks that are executed simultaneously on multiple processors or cores within a single computer or
node. Each sub-task is designed to operate independently and in parallel with the others, without the
need for communication or coordination.
2. Resource Utilization:
 Distributed computing allows for the use of multiple computers or nodes in a network to perform a
computation. This enables the distribution of the workload, reducing the load on each individual
computer and potentially improving overall performance. However, it requires communication and
coordination between the nodes, which can introduce overhead and latency.
 Parallel computing, on the other hand, allows for the use of multiple processors or cores within a
single computer to perform a computation. This allows for efficient use of resources within a single
system, without the need for communication or coordination between different nodes. However, it is
limited by the number of processors or cores available within the system.
3. Scalability:
 Distributed computing can be more scalable than parallel computing, as it allows for the addition of
more nodes to the network to increase processing power. This makes it well-suited for handling large-
scale computations that cannot be performed on a single system.
 Parallel computing, on the other hand, is limited by the number of processors or cores available
within a single system. While it can provide high performance for smaller-scale computations, it may
not be able to scale up to handle larger computations.
HADOOP ECOSYSTEM
 Hadoop is an open-source software framework used for distributed storage and processing of large
datasets on clusters of commodity hardware. It was originally developed by Doug Cutting and Mike
Cafarella in 2006 based on the Google File System and MapReduce programming model.
 Hadoop Ecosystem is a platform or a suite which provides various services to solve the big data
problems. It includes Apache projects and various commercial tools and solutions. There are four
major elements of Hadoop i.e. HDFS, MapReduce, YARN, and Hadoop Common. Most of the tools
or solutions are used to supplement or support these major elements. All these tools work collectively
to provide services such as absorption, analysis, storage and maintenance of data etc.
COMPONENTS OF A HADOOP ECOSYSTEM
• YARN: Yet Another Resource Negotiator

• MapReduce: Programming based Data Processing
• Spark: In-Memory data processing
• PIG, HIVE: Query based processing of data services
• HBase: NoSQL Database
• Mahout, Spark MLLib: Machine Learning algorithm libraries
• Solar, Lucene: Searching and Indexing
• Zookeeper: Managing cluster
• Oozie: Job Scheduling
HDFS
 The Hadoop Distributed File System (HDFS) is a distributed file system used to store and manage large datasets
in a Hadoop cluster. Here's an overview of the HDFS architecture, including the Name Node and Data Nodes:
 Name Node: The Name Node is the master node in the HDFS architecture and is responsible for managing the
file system namespace, keeping track of the location of data blocks, and coordinating read and write requests from
clients. The Name Node stores all of the metadata about the files and directories in the file system, including the
location of each block in the Data Nodes.
 Data Nodes: The Data Nodes are the slave nodes in the HDFS architecture and are responsible for storing and
serving data blocks to clients. Each Data Node is responsible for storing one or more blocks of data and can
communicate with the Name Node to perform tasks such as block replication, block deletion, and heartbeats to
report its status.
 When a client wants to read or write a file in HDFS, it communicates with the Name Node to determine the
location of the data blocks. The client then communicates directly with the Data Nodes to read or write the data
blocks. The Name Node keeps track of the location of each block and ensures that the file is stored with the
appropriate level of redundancy to ensure fault tolerance and high availability.
 To ensure the reliability of the file system, HDFS uses a replication model where each block is replicated across
multiple Data Nodes in the cluster. This ensures that even if one or more Data Nodes fail, the data is still available
from other nodes. The replication factor is configurable and can be set to the desired level of redundancy for the
given use case.
 Overall, the Name Node and Data Nodes work together to provide a reliable and scalable file system for storing
and managing large datasets in a Hadoop cluster.
HDFS client Name Node
Data Node Data Node Data Node Data Node
Local Local Local Local

Disk Disk Disk Disk
TRADITIONAL DBMS
Data Structure:
 Organized in tables with a structured schema.
 Well-suited for structured data with predefined relationships.
Data Processing:
 Designed for transactional processing and online transaction processing (OLTP).
 Suitable for complex queries and transactions.
Scalability:
 Vertical scaling is common, where you add more resources (CPU, RAM) to a single server (computer).
 Can become expensive and challenging to scale horizontally.
Use Cases:
 Well-suited for applications with relational data, structured queries, and complex transactions.
 Examples include financial systems, customer relationship management (CRM), and enterprise resource planning (ERP) systems.
HDFS
Data Structure:
 Stores data in a distributed and fault-tolerant manner, using a master/slave architecture.
 Suitable for handling large volumes of unstructured or semi-structured data.
Data Processing:
 Designed for batch processing and large-scale data analysis using frameworks like Apache Hadoop.
 Optimized for processing massive amounts of data in parallel.
Scalability:
 Horizontal scaling is the norm, where you add more machines to the cluster.
 Highly scalable and cost-effective for storing and processing big data.
Use Cases:
 Suited for big data analytics, machine learning, and scenarios where massive amounts of data need to be stored and processed.
 Examples include log processing, data warehousing, and large-scale data processing for analytics.

Big Data and Analytics

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Big Data and Analytics

Uploaded by

Copyright:

Available Formats

BIG DATA AND

• YARN: Yet Another Resource Negotiator

Data Node Data Node Data Node Data Node

Local Local Local Local

You might also like