Professional Documents
Culture Documents
BD UNIT 1
BD UNIT 1
UNIT 1
INTRODUCTION TO BIG DATA
INTRODUCTION TO BIG DATA
1. Data which are very large in size is called Big Data.
2. Normally we work on data of size MB (WordDoc,Excel) or maximum
GB(Movies, Codes) but data in Peta bytes.
3. i.e. 10^15 byte and now a days data is in Tera bytes (TB) size is called Big
Data.
4. It is stated that almost 90% of today's data has been generated in the
past 5 years.
5. In simple language, big data is a collection of data that is larger, more
complex than traditional data, and yet growing exponentially with time.
6. It is so huge that no traditional data management software or tool can
manage, store, or can process it efficiently. So, it needs to be processed
step by step via different methodologies.
Sources of Big Data
Social networking sites: Facebook, Google, LinkedIn all these sites
generates huge amount of data on a day to day basis as they have
billions of users worldwide.
E-commerce site: Sites like Amazon, Flipkart, Alibaba generates huge
amount of logs from which users buying trends can be traced.
Weather Station: All the weather station and satellite gives very huge
data which are stored and manipulated to forecast weather.
Telecom company: Telecom giants like Airtel, Vodafone study the user
trends and accordingly publish their plans and for this they store the
data of its million users.
Share Market: Stock exchange across the world generates huge amount
of data through its daily transaction.
The Applications of Big Data are
Banking and Securities
Communications, Media and Entertainment
Healthcare Providers
Education
Manufacturing and Natural Resources
Government
Insurance
Retail and Wholesale trade
Transportation
Energy and Utilities
Importance of Big Data
Unstructured Data:
Unstructured data do not have a fixed format.
These are stored in an unknown format. Such type of data is known as unstructured data. An example of
unstructured data is a web page with text, images, videos, etc.
Semi-structured Data:
Semi-structured data is the combination of structured as well as unstructured forms of data.
It does not contain any table to show relations; it contains tags or other markers to show hierarchy.
JSON files, XML files, and CSV files (Comma-separated files) are semi-structured data examples. The e-mails we send
or receive are also an example of semi-structured data.
STRUCTURED DATA
SEMI STRUCTURED DATA
Semi-structured data is not bound by any rigid schema for data storage and handling.
The data is not in the relational format and is not neatly organized into rows and columns like
that in a spreadsheet.
However, there are some features like key-value pairs that help in SEGREGATING the different
entities from each other.
Since semi-structured data doesn’t need a structured query language, it is commonly
called NoSQL data.
A data serialization language is used to exchange semi-structured data across systems that
may even have varied underlying infrastructure.
Semi-structured content is often used to store metadata about a business process but it can
also include files containing machine instructions for computer programs.
This type of information typically comes from external sources such as social media platforms
or other web-based data feeds.
SEMI STRUCTURED DATA
Data is created in plain text so that different text-editing tools can be used to draw valuable
insights. Due to a simple format, data serialization readers can be implemented on hardware with
limited processing resources and bandwidth.
Data Serialization Languages
Software developers use serialization languages to write memory-based data in files transit, store,
and parse. The sender and the receiver don’t need to know about the other system. As long as the
same serialization language is used, the data can be understood by both systems comfortably.
There are three predominantly used Serialization languages.
XML
XML– XML stands for eXtensible Markup Language.
It is a text-based markup language designed to store and transport data. XML parsers can be found in almost all
popular development platforms.
It is human and machine-readable.
XML has definite standards for schema, transformation, and display. It is self-descriptive. Below is an example
of a programmer’s details in XML.
<ProgrammerDetails>
<FirstName>Harshita</FirstName>
<LastName>Deo</LastName>
<CodingPlatforms>
<CodingPlatform Type="Fav">Topic</CodingPlatform>
<CodingPlatform Type="2ndFav">Big Data!</CodingPlatform>
<CodingPlatform Type="3rdFav">Sec C</CodingPlatform>
</CodingPlatforms>
</ProgrammerDetails>
XML expresses the data using tags (text within angular brackets) to shape the data (for ex: FirstName)
and attributes (For ex: Type) to feature the data.
JSON
JSON (JavaScript Object Notation) is a lightweight open-standard file format for data interchange.
JSON is easy to use and uses human/machine-readable text to store and transmit data objects.
{
"firstName": “Sachin",
"lastName": “Singh",
"codingPlatforms": [
{ "type": "Fav", "value": “Topic" },
{ "type": "2ndFav", "value": “Big Data!" },
{ "type": "3rdFav", "value": “Sec C" }
]
}
This format isn’t as formal as XML. It’s more like a key/value pair model than a formal data
depiction. JavaScript has inbuilt support for JSON. Although JSON is very popular amongst web
developers, non-technical personnel find it tedious to work with JSON due to its heavy dependence
on JavaScript and structural characters (braces, commas, etc.)
YAML
YAML is a user-friendly data serialization language. Figuratively, it stands for YAML Ain’t Markup
Language. It is adopted by technical and non-technical handlers all across the globe owing to its
simplicity. The data structure is defined by line separation and indentation and reduces the
dependency on structural characters. YAML is extremely comprehensive and its popularity is a
result of its human-machine readability.
firstName: Sachin
lastName: Singh
CodingPlatforms:
-type : Fav
value: Topic
-type: 2ndFav
value: Big Data
UNSTRUCTURED DATA
Unstructured data is the kind of data that doesn’t adhere to any definite schema or
set of rules. Its arrangement is unplanned and haphazard.
Photos, videos, text documents, and log files can be generally considered
unstructured data.
Additionally, Unstructured data is also known as “dark data” because it cannot be
analysed without the proper software tools.
UN-STRUCTURED DATA
SUMMARY
Structured data is neatly organized and obeys a fixed set of rules.
Semi-structured data doesn’t obey any schema, but it has certain discernible features
for an organization. Data serialization languages are used to convert data objects into
a byte stream. These include XML, JSON, and YAML.
Unstructured data doesn’t have any structure at all. All these three kinds of data are
present in an application. All three of them play equally important roles in developing
resourceful and attractive applications.
EVOLUTION OF BIG DATA
If we see the last few decades, we can analyse that Big Data technology has gained so much
growth. There are a lot of milestones in the evolution of Big Data which are described below:
Data Warehousing:
In the 1990s, data warehousing emerged as a solution to store and analyse large volumes of
structured data.
Hadoop:
Hadoop was introduced in 2006 by Doug Cutting and Mike Cafarella. Distributed storage
medium and large data processing are provided by Hadoop, and it is an open-source
framework.
NoSQL Databases:
In 2009, NoSQL databases were introduced, which provide a flexible way to store and
retrieve unstructured data.
Cloud Computing:
Cloud Computing technology helps companies to store their important data in data centres
that are remote, and it saves their infrastructure cost and maintenance costs.
Machine Learning:
Machine Learning algorithms are those algorithms that work on large data, and analysis is done on a
huge amount of data to get meaningful insights from it. This has led to the development of artificial
intelligence (AI) applications.
Data Streaming:
Data Streaming technology has emerged as a solution to process large volumes of data in real time.
Edge Computing:
Edge Computing is a kind of distributed computing paradigm that allows data processing to be done
at the edge or the corner of the network, closer to the source of the data.
Overall, big data technology has come a long way since the early days of data warehousing. The
introduction of Hadoop, NoSQL databases, cloud computing, machine learning, data streaming, and
edge computing has revolutionized how we store, process, and analyse large volumes of data. As
technology evolves, we can expect Big Data to play a very important role in various industries.
Big Data Characteristics
There are five v's of Big Data that explains the characteristics.
Volume
The name Big Data itself is related to an enormous size. Big Data is a vast 'volumes' of
data generated from many sources daily, such as business processes, machines, social
media platforms, networks, human interactions, and many more.
Facebook can generate approximately a billion messages, 4.5 billion times that the
"Like" button is recorded, and more than 350 million new posts are uploaded each
day. Big data technologies can handle large amounts of data.
Variety
Big Data can be structured, unstructured, and semi-structured that are being collected from
different sources. Data will only be collected from databases and sheets in the past, But
these days the data will comes in array forms, that are PDFs, Emails, audios, SM posts,
photos, videos, etc.
Veracity
Veracity means how much the data is reliable. It has many ways
to filter or translate the data. Veracity is the process of being
able to handle and manage data efficiently. Big Data is also
essential in business development.
For example, Facebook posts with hashtags.
Value refers to the 'Insights' gained from the data. It means whether the given data
set is producing any useful result. Data, in its raw form, gives no valuable result, but
once processed efficiently, it can give us important insights that could help us in
decision-making.
Velocity
Velocity plays an important role compared to others. Velocity creates the speed by which the data
is created in real-time. It contains the linking of incoming data sets speeds, rate of change,
and activity bursts. The primary aspect of Big Data is to provide demanding data rapidly.
Big data velocity deals with the speed at the data flows from sources like application logs,
business processes, networks, and social media sites, sensors, mobile devices, etc.
Velocity refers to the 'speed 'or rate with which the data is accumulated. In 2010, YouTube had
200 million monthly active users, which increased to 2.6 billion in 2022.
BIG DATA ARCHITECTURE
Big Data architecture is a framework that defines the components,
processes, and technologies needed to capture, store, process, and
analyse Big Data.
Big Data architecture typically includes four Big Data architecture layers:
1. Data collection and ingestion.
2. Data processing and analysis.
3. Data storage.
4. Data visualization and reporting.
Each layer has its own set of technologies, tools, and processes.
The term "Big Data architecture" refers to the systems and software used
to manage Big Data. A Big Data architecture must be able to handle the
scale, complexity, and variety of Big Data.
It must also be able to support the needs of different users, who may
want to access and analyse the data differently.
Big Data Architecture Layers
There are four main Big Data architecture layers to an architecture of Big Data:
1. Data Ingestion
This layer is responsible for collecting and storing data from various sources. In
Big Data, the data ingestion process of extracting data from various sources and
loading it into a data repository. Data ingestion is a key component of a Big Data
architecture because it determines how data will be ingested, transformed, and
stored.
2. Data Processing
Data processing is the second layer, responsible for collecting, cleaning, and
preparing the data for analysis. This layer is critical for ensuring that the data is
high quality and ready to be used in the future.
3. Data Storage
Data storage is the third layer, responsible for storing the data in a format that can be easily
accessed and analysed. This layer is essential for ensuring that the data is accessible and available
to the other layers.
4. Data Visualization
Data visualization is the fourth layer and is responsible for creating visualizations of the data that
humans can easily understand. This layer is important for making the data accessible.
BIG DATA PLATFORMS
The constant stream of information from various sources is becoming more intense, especially
with the advance in technology. And this is where big data platforms come in to store and
analyse the ever-increasing mass of information.
A big data platform is an integrated computing solution that combines numerous software
systems, tools, and hardware for big data management. It is a one-stop architecture that solves
all the data needs of a business regardless of the volume and size of the data at hand. Due to
their efficiency in data management, enterprises are increasingly adopting big data platforms to
gather tons of data and convert them into structured, actionable business insights.
Currently, the marketplace is flooded with numerous Open source and commercially available
big data platforms. They boast different features and capabilities for use in a big data
environment.
Characteristics of a big data platform
Ability to accommodate new applications and tools depending on the evolving business
needs
Support several data formats
Ability to accommodate large volumes of streaming or at-rest data
Have a wide variety of conversion tools to transform data to different preferred formats
Capacity to accommodate data at any speed
Provide the tools for scouring the data through massive data sets
Support linear scaling
The ability for quick deployment
Have the tools for data analysis and reporting requirements
How Big Data Platform works
Big Data platform workflow can be divided into the following stages:
Data Collection
Big Data platforms collect data from various sources, such as sensors, weblogs, social media, and
other databases.
Data Storage
Once the data is collected, it is stored in a repository, such as Hadoop Distributed File System (HDFS),
Amazon S3, or Google Cloud Storage.
Data Processing
Data Processing involves tasks such as filtering, transforming, and aggregating the data. This can be
done using distributed processing frameworks, such as Apache Spark, Apache Flink, or Apache
Storm.p
Data Analytics
After data is processed, it is then analysed with analytics tools and techniques, such as machine learning
algorithms, predictive analytics, and data visualization.
Data Governance
Data Governance (data cataloguing, data quality management, and data lineage tracking) ensures the
accuracy, completeness, and security of the data.
Data Management
Big data platforms provide management capabilities that enable organizations to make backups, recover,
and archive
Data analytics is taking the analysed data and working on it in a meaningful and useful way to make
well-versed business decisions.
Data analytics is a traditional or generic type of analytics used in enterprises to make data-driven
decisions.
What is big data analytics?
a. Big data analytics is the often complex process of examining big data to uncover
information -- such as hidden patterns, correlations, market trends and customer
preferences -- that can help organizations make informed business decisions.
b. On a broad scale, data analytics technologies and techniques give organizations a way
to analyse data sets and gather new information.
c. Business intelligence (BI) queries answer basic questions about business operations and
performance.
d. Big data analytics is a form of advanced analytics, which involve complex applications
with elements such as predictive models, statistical algorithms and what-if analysis
powered by analytics systems.
e. An example of big data analytics can be found in the healthcare industry, where millions
of patient records, medical claims, clinical results, care management records and other
data must be collected, aggregated, processed and analysed. Big data analytics is used
for accounting, decision-making, predictive analytics and many other purposes. This
data varies greatly in type, quality and accessibility, presenting significant challenges
but also offering tremendous benefits.
Why is big data analytics important?
Organizations can use big data analytics systems and software to make data-driven
decisions that can improve their business-related outcomes.
The benefits can include more effective marketing, new revenue opportunities,
customer personalization and improved operational efficiency.
With an effective strategy, these benefits can provide competitive advantages over
competitors.
How does big data analytics work?
Data analysts, data scientists, predictive modellers, statisticians and other analytics
professionals collect, process, clean and analyse growing volumes of structured transaction
data, as well as other forms of data not used by conventional BI and analytics programs.
The following is an overview of the four steps of the big data analytics process:
1.Data professionals collect data from a variety of different sources. Often, it's a mix of semi-
structured and unstructured data. While each organization uses different data streams, some
common sources include the following:
Internet clickstream data.
Web server logs.
Cloud applications.
Mobile applications.
Social media content.
Text from customer emails and survey responses.
Mobile phone records.
Machine data captured by sensors connected to the internet of things.
4.The collected, processed and cleaned data is analysed using analytics software.
This includes tools for the following:
Data mining, which sifts through data sets in search of patterns and
relationships.
Predictive analytics, which builds models to forecast customer
behaviour and other future actions, scenarios and trends.
Machine learning, which taps various algorithms to analyse large
data sets.
Deep learning, which is a more advanced offshoot of machine
learning.
Text mining and statistical analysis software.
Artificial intelligence.
Mainstream BI software.
Data visualization tools.
Types of big data analytics
There are several different types of big data analytics, each with their own application within the
enterprise.
Descriptive analytics. This is the simplest form of analytics, where data is analysed for general
assessment and summarization. For example, in sales reporting, an organization can analyse the
efficiency of marketing from such data.
Diagnostic analytics. This refers to analytics that determine why a problem occurred. For
example, this could include gathering and studying competitor pricing data to determine when a
product's sales fell off because the competitor undercut it with a price drop.
Predictive analytics. This refers to analysis that predicts what comes next. For example, this could
include monitoring the performance of machines in a factory and comparing that data to historical
data to determine when a machine is likely to break down or require maintenance or replacement.
Prescriptive analytics. This form of analysis follows diagnostics and predictions. After an issue has
been identified, it provides a recommendation of what can be done about it.
For example, this could include addressing inconsistencies in supply chain that are causing pricing
problems by identifying suppliers whose performance is unreliable, suggesting their replacement.
The benefits of using big data analytics include the following:
Real-time intelligence. Organizations can quickly analyse large amounts of real-time data from
different sources, in many different formats and types.
Better-informed decisions. Effective strategizing can benefit and improve the supply chain,
operations and other areas of strategic decision-making.
Cost savings. This can result from new business process efficiencies and optimizations.
Better customer engagement. A better understanding of customer needs, behaviour and sentiment
can lead to better marketing insights and provide information for product development.
Optimize risk management strategies. Big data analytics improve risk management strategies by
enabling organizations to address threats in real time.
#Benefits that come with using big data analytics:-
1. Data accessibility.
2. Data quality maintenance.
3. Data security.
4. Choosing the right tools.
5. Talent shortages
Big data privacy and ethics
Privacy
Big data privacy is protecting individuals' personal and sensitive data when it comes to collecting,
storing, processing, and analysing large amounts of data. Following are some important aspects of big
data privacy:
1. Informed consent
When it comes to big data privacy, informed consent is the foundation.
Organizations need to ask individuals' permission before they collect their data.
With informed consent, people know exactly what their data is being used for, how it's being used,
and what the consequences could be.
By giving clear explanations and letting people choose how they want to use their data, organizations
can create trust and respect for people's privacy.
2. Protecting individual identity
Protecting individual identity is of paramount importance.
There are two techniques used to protect individual identity: anonymisation and de-identification.
Anonymisation means removing or encrypting personal information (PII) so that individuals cannot be
found in the dataset.
De-identification goes beyond anonymisation by transforming data in ways that prevent re-
identification.
These techniques enable organisations to gain insights while protecting privacy.
3. Data integrity and confidentiality
Data integrity and confidentiality are two of the most important aspects of data security.
Without them, unauthorised access to data, data breaches, and cyber threats are at an all-time high.
That’s why it’s essential for organisations to implement strong security measures, such as encryption,
security access controls, and periodic security audits. Data integrity and confidentiality help
organisations build trust with their users and promote responsible data management.
4. Purpose limitation and data minimization
Big data privacy and ethics call for the principle of purpose limitation.
Data should only be used for specified, authorized purposes and should not be reused
without permission from the user.
Additionally, data minimization involves collecting and retaining only the minimum
amount of data necessary for the intended purpose, reducing privacy risks and
potential harm.
5. Transparency and accountability
One of the most important ways to build trust with users is through transparency in
data practices.
Individuals' data collection, data usage, and data sharing should all be clearly defined by
organizations.
Accountability for data management and privacy compliance reinforce ethical data
management.
6. Control and autonomy
Privacy and ethics require organizations to respect individual rights.
Individuals are entitled to access, update, and erase their data.
Organizations should provide easy mechanisms for users to exercise these rights and maintain control
and autonomy over their data.
Ethics
Big data ethics refers to the ethical and responsible decisions that are made when collecting,
processing, analysing, and deploying large and complex data sets. The following are some
important aspects of the big data ethics:
It is a data analytic tool for building a data pipeline by using minimal codes in it.
It offers a wide range of solutions for sales, marketing, and support.
With the help of its interactive graphical interface, it provides solutions for ETL, ELT, etc.
The best part of using Xplenty is its low investment in hardware & software and its offers support via email,
chat, telephonic and virtual meetings.
Xplenty is a platform to process data for analytics over the cloud and segregates all the data together.
#Features of Xplenty:
Rest API: A user can possibly do anything by implementing Rest API
Flexibility: Data can be sent, and pulled to databases, warehouses, and sales force.
Data Security: It offers SSL/TSL encryption and the platform is capable of verifying algorithms and certificates
regularly.
Deployment: It offers integration apps for both cloud & in-house and supports deployment to integrate apps
over the cloud.
Map Reduce Layer
The Map Reduce comes into existence when the client application submits the Map Reduce
job to Job Tracker.
In response, the Job Tracker sends the request to the appropriate Task Trackers.
Sometimes, the Task Tracker fails or time out. In such a case, that part of the job is
rescheduled.
Advantages of Hadoop:-
Fast: In HDFS the data distributed over the cluster and are mapped which helps in faster
retrieval. Even the tools to process the data are often on the same servers, thus reducing the
processing time. It is able to process terabytes of data in minutes and Peta bytes in hours.
Scalable: Hadoop cluster can be extended by just adding nodes in the cluster.
Cost Effective: Hadoop is open source and uses commodity hardware to store data so it
really cost effective as compared to traditional relational database management system.
Resilient to failure: HDFS has the property with which it can replicate data over the network,
so if one node is down or some other network failure happens, then Hadoop takes the other
copy of data and use it. Normally, data are replicated thrice but the replication factor is
configurable.
THANK
YOU