BD UNIT 1

BIG DATA
UNIT 1
INTRODUCTION TO BIG DATA
INTRODUCTION TO BIG DATA
1. Data which are very large in size is called Big Data.
2. Normally we work on data of size MB (WordDoc,Excel) or maximum
GB(Movies, Codes) but data in Peta bytes.
3. i.e. 10^15 byte and now a days data is in Tera bytes (TB) size is called Big
Data.
4. It is stated that almost 90% of today's data has been generated in the
past 5 years.
5. In simple language, big data is a collection of data that is larger, more
complex than traditional data, and yet growing exponentially with time.
6. It is so huge that no traditional data management software or tool can
manage, store, or can process it efficiently. So, it needs to be processed
step by step via different methodologies.
Sources of Big Data
Social networking sites: Facebook, Google, LinkedIn all these sites
generates huge amount of data on a day to day basis as they have
billions of users worldwide.
E-commerce site: Sites like Amazon, Flipkart, Alibaba generates huge
amount of logs from which users buying trends can be traced.
Weather Station: All the weather station and satellite gives very huge
data which are stored and manipulated to forecast weather.
Telecom company: Telecom giants like Airtel, Vodafone study the user
trends and accordingly publish their plans and for this they store the
data of its million users.
Share Market: Stock exchange across the world generates huge amount
of data through its daily transaction.
The Applications of Big Data are
Banking and Securities
Communications, Media and Entertainment
Healthcare Providers
Education
Manufacturing and Natural Resources
Government
Insurance
Retail and Wholesale trade
Transportation
Energy and Utilities
Importance of Big Data
A better understanding of market conditions.

Time and cost saving.
Solving advertisers' problems.
Offering better market insights.
Boosting customer acquisition and retention.
Real World Big Data Examples
Discovering consumer shopping habits.
Personalized marketing.
Fuel optimization tools for the transportation industry.
Monitoring health conditions through data from wearable's.
Live road mapping for autonomous vehicles.
Streamlined media streaming.
Predictive inventory ordering
TYPES OF DIGITAL DATA
Structured Data:
Any data in a fixed format is known as structured data.
It can only be accessed, stored, or processed in a particular format.
This type of data is stored in the form of tables with rows and columns.
Any Excel file or SQL file is an example of structured data. Structured data is easy to enter, query, and analyse. All of
the data follows the same format.
Unstructured Data:
Unstructured data do not have a fixed format.
These are stored in an unknown format. Such type of data is known as unstructured data. An example of
unstructured data is a web page with text, images, videos, etc.
Semi-structured Data:
Semi-structured data is the combination of structured as well as unstructured forms of data.
It does not contain any table to show relations; it contains tags or other markers to show hierarchy.
JSON files, XML files, and CSV files (Comma-separated files) are semi-structured data examples. The e-mails we send
or receive are also an example of semi-structured data.
STRUCTURED DATA
SEMI STRUCTURED DATA
Semi-structured data is not bound by any rigid schema for data storage and handling.
The data is not in the relational format and is not neatly organized into rows and columns like
that in a spreadsheet.
However, there are some features like key-value pairs that help in SEGREGATING the different
entities from each other.
Since semi-structured data doesn’t need a structured query language, it is commonly
called NoSQL data.
A data serialization language is used to exchange semi-structured data across systems that
may even have varied underlying infrastructure.
Semi-structured content is often used to store metadata about a business process but it can
also include files containing machine instructions for computer programs.
This type of information typically comes from external sources such as social media platforms
or other web-based data feeds.
SEMI STRUCTURED DATA
Data is created in plain text so that different text-editing tools can be used to draw valuable
insights. Due to a simple format, data serialization readers can be implemented on hardware with
limited processing resources and bandwidth.
Data Serialization Languages
Software developers use serialization languages to write memory-based data in files transit, store,
and parse. The sender and the receiver don’t need to know about the other system. As long as the
same serialization language is used, the data can be understood by both systems comfortably.
There are three predominantly used Serialization languages.
XML
XML– XML stands for eXtensible Markup Language.
It is a text-based markup language designed to store and transport data. XML parsers can be found in almost all
popular development platforms.
It is human and machine-readable.
XML has definite standards for schema, transformation, and display. It is self-descriptive. Below is an example
of a programmer’s details in XML.
<ProgrammerDetails>
<FirstName>Harshita</FirstName>
<LastName>Deo</LastName>
<CodingPlatforms>
<CodingPlatform Type="Fav">Topic</CodingPlatform>
<CodingPlatform Type="2ndFav">Big Data!</CodingPlatform>
<CodingPlatform Type="3rdFav">Sec C</CodingPlatform>
</CodingPlatforms>
</ProgrammerDetails>
XML expresses the data using tags (text within angular brackets) to shape the data (for ex: FirstName)
and attributes (For ex: Type) to feature the data.
JSON
JSON (JavaScript Object Notation) is a lightweight open-standard file format for data interchange.
JSON is easy to use and uses human/machine-readable text to store and transmit data objects.
{
"firstName": “Sachin",
"lastName": “Singh",
"codingPlatforms": [
{ "type": "Fav", "value": “Topic" },
{ "type": "2ndFav", "value": “Big Data!" },
{ "type": "3rdFav", "value": “Sec C" }
]
}
This format isn’t as formal as XML. It’s more like a key/value pair model than a formal data
depiction. JavaScript has inbuilt support for JSON. Although JSON is very popular amongst web
developers, non-technical personnel find it tedious to work with JSON due to its heavy dependence
on JavaScript and structural characters (braces, commas, etc.)
YAML
YAML is a user-friendly data serialization language. Figuratively, it stands for YAML Ain’t Markup
Language. It is adopted by technical and non-technical handlers all across the globe owing to its
simplicity. The data structure is defined by line separation and indentation and reduces the
dependency on structural characters. YAML is extremely comprehensive and its popularity is a
result of its human-machine readability.
firstName: Sachin
lastName: Singh
CodingPlatforms:
-type : Fav
value: Topic
-type: 2ndFav
value: Big Data
UNSTRUCTURED DATA
Unstructured data is the kind of data that doesn’t adhere to any definite schema or
set of rules. Its arrangement is unplanned and haphazard.
Photos, videos, text documents, and log files can be generally considered
unstructured data.
Additionally, Unstructured data is also known as “dark data” because it cannot be
analysed without the proper software tools.
UN-STRUCTURED DATA
SUMMARY
Structured data is neatly organized and obeys a fixed set of rules.
Semi-structured data doesn’t obey any schema, but it has certain discernible features
for an organization. Data serialization languages are used to convert data objects into
a byte stream. These include XML, JSON, and YAML.
Unstructured data doesn’t have any structure at all. All these three kinds of data are
present in an application. All three of them play equally important roles in developing
resourceful and attractive applications.
EVOLUTION OF BIG DATA
If we see the last few decades, we can analyse that Big Data technology has gained so much
growth. There are a lot of milestones in the evolution of Big Data which are described below:
Data Warehousing:
In the 1990s, data warehousing emerged as a solution to store and analyse large volumes of
structured data.
Hadoop:
Hadoop was introduced in 2006 by Doug Cutting and Mike Cafarella. Distributed storage
medium and large data processing are provided by Hadoop, and it is an open-source
framework.
NoSQL Databases:
In 2009, NoSQL databases were introduced, which provide a flexible way to store and
retrieve unstructured data.
Cloud Computing:
Cloud Computing technology helps companies to store their important data in data centres
that are remote, and it saves their infrastructure cost and maintenance costs.
Machine Learning:
Machine Learning algorithms are those algorithms that work on large data, and analysis is done on a
huge amount of data to get meaningful insights from it. This has led to the development of artificial
intelligence (AI) applications.
Data Streaming:
Data Streaming technology has emerged as a solution to process large volumes of data in real time.
Edge Computing:
Edge Computing is a kind of distributed computing paradigm that allows data processing to be done
at the edge or the corner of the network, closer to the source of the data.
Overall, big data technology has come a long way since the early days of data warehousing. The
introduction of Hadoop, NoSQL databases, cloud computing, machine learning, data streaming, and
edge computing has revolutionized how we store, process, and analyse large volumes of data. As
technology evolves, we can expect Big Data to play a very important role in various industries.
Big Data Characteristics
There are five v's of Big Data that explains the characteristics.
Volume
The name Big Data itself is related to an enormous size. Big Data is a vast 'volumes' of
data generated from many sources daily, such as business processes, machines, social
media platforms, networks, human interactions, and many more.
Facebook can generate approximately a billion messages, 4.5 billion times that the
"Like" button is recorded, and more than 350 million new posts are uploaded each
day. Big data technologies can handle large amounts of data.
Variety
Big Data can be structured, unstructured, and semi-structured that are being collected from
different sources. Data will only be collected from databases and sheets in the past, But
these days the data will comes in array forms, that are PDFs, Emails, audios, SM posts,
photos, videos, etc.
Veracity
Veracity means how much the data is reliable. It has many ways
to filter or translate the data. Veracity is the process of being
able to handle and manage data efficiently. Big Data is also
essential in business development.
For example, Facebook posts with hashtags.
Veracity refers to the 'trustworthiness‘ or quality of data. It

means whether the data is free from various ambiguities or
not.
Value
Value is an essential characteristic of big data. It is not the data that we process or
store. It is valuable and reliable data that we store, process, and also analyse.
Value refers to the 'Insights' gained from the data. It means whether the given data
set is producing any useful result. Data, in its raw form, gives no valuable result, but
once processed efficiently, it can give us important insights that could help us in
decision-making.
Velocity
Velocity plays an important role compared to others. Velocity creates the speed by which the data
is created in real-time. It contains the linking of incoming data sets speeds, rate of change,
and activity bursts. The primary aspect of Big Data is to provide demanding data rapidly.
Big data velocity deals with the speed at the data flows from sources like application logs,
business processes, networks, and social media sites, sensors, mobile devices, etc.
Velocity refers to the 'speed 'or rate with which the data is accumulated. In 2010, YouTube had
200 million monthly active users, which increased to 2.6 billion in 2022.
BIG DATA ARCHITECTURE
Big Data architecture is a framework that defines the components,
processes, and technologies needed to capture, store, process, and
analyse Big Data.
Big Data architecture typically includes four Big Data architecture layers:
1. Data collection and ingestion.
2. Data processing and analysis.
3. Data storage.
4. Data visualization and reporting.
Each layer has its own set of technologies, tools, and processes.
The term "Big Data architecture" refers to the systems and software used
to manage Big Data. A Big Data architecture must be able to handle the
scale, complexity, and variety of Big Data.
It must also be able to support the needs of different users, who may
want to access and analyse the data differently.
Big Data Architecture Layers
There are four main Big Data architecture layers to an architecture of Big Data:
1. Data Ingestion
This layer is responsible for collecting and storing data from various sources. In
Big Data, the data ingestion process of extracting data from various sources and
loading it into a data repository. Data ingestion is a key component of a Big Data
architecture because it determines how data will be ingested, transformed, and
stored.
2. Data Processing
Data processing is the second layer, responsible for collecting, cleaning, and
preparing the data for analysis. This layer is critical for ensuring that the data is
high quality and ready to be used in the future.
3. Data Storage
Data storage is the third layer, responsible for storing the data in a format that can be easily
accessed and analysed. This layer is essential for ensuring that the data is accessible and available
to the other layers.
4. Data Visualization
Data visualization is the fourth layer and is responsible for creating visualizations of the data that
humans can easily understand. This layer is important for making the data accessible.
BIG DATA PLATFORMS
The constant stream of information from various sources is becoming more intense, especially
with the advance in technology. And this is where big data platforms come in to store and
analyse the ever-increasing mass of information.
A big data platform is an integrated computing solution that combines numerous software
systems, tools, and hardware for big data management. It is a one-stop architecture that solves
all the data needs of a business regardless of the volume and size of the data at hand. Due to
their efficiency in data management, enterprises are increasingly adopting big data platforms to
gather tons of data and convert them into structured, actionable business insights.
Currently, the marketplace is flooded with numerous Open source and commercially available
big data platforms. They boast different features and capabilities for use in a big data
environment.
Characteristics of a big data platform
Ability to accommodate new applications and tools depending on the evolving business
needs
Support several data formats
Ability to accommodate large volumes of streaming or at-rest data
Have a wide variety of conversion tools to transform data to different preferred formats
Capacity to accommodate data at any speed
Provide the tools for scouring the data through massive data sets
Support linear scaling
The ability for quick deployment
Have the tools for data analysis and reporting requirements
How Big Data Platform works
Big Data platform workflow can be divided into the following stages:
Data Collection
Big Data platforms collect data from various sources, such as sensors, weblogs, social media, and
other databases.
Data Storage
Once the data is collected, it is stored in a repository, such as Hadoop Distributed File System (HDFS),
Amazon S3, or Google Cloud Storage.
Data Processing
Data Processing involves tasks such as filtering, transforming, and aggregating the data. This can be
done using distributed processing frameworks, such as Apache Spark, Apache Flink, or Apache
Storm.p
Data Analytics
After data is processed, it is then analysed with analytics tools and techniques, such as machine learning
algorithms, predictive analytics, and data visualization.
Data Governance
Data Governance (data cataloguing, data quality management, and data lineage tracking) ensures the
accuracy, completeness, and security of the data.
Data Management
Big data platforms provide management capabilities that enable organizations to make backups, recover,
and archive
BIG DATA PLATFORMS ARE:

CLOUDERA
APACHE SPARK
APACHE HADOOP
APACHE STORM
COMPLEX CLOUD BIG DATA PLATFORMS:
GCP(Google cloud server)
AZURE
AWS
BIG DATA DRIVERS
BIG DATA COMPONENTS
Data Sources/Data Ingestion
The ingestion layer is the very first step of pulling in raw data. It comes from internal sources,
relational databases, non relational databases and others, etc.
It can even come from social media, emails, phone calls or somewhere else. There are two kinds of
data ingestion:
Batch, in which large groups of data are gathered and delivered together. Data collection can be
triggered by conditions, launched on a schedule or ad hoc.
Streaming, which is a continuous flow of data. This is necessary for real-time data analytics. It
locates and pulls data as it’s generated. This requires more resources because it is constantly
monitoring for changes in data pools.
It’s all about just getting the data into the system. Parsing and organizing comes later.
There are lots of challenges, some of which are:
Maintaining security and compliance:

Variable data speeds:
Ensuring data quality:
BIG DATA TECHNOLOGY COMPONENTS
Difference between Data Analysis and Data
Analytics
Data analysis is a process involving the collection, manipulation, and examination of data for getting a
deep insight.
Data analysis is a specialized type of analytics used in businesses to evaluate data and gain insights.
Data analytics is taking the analysed data and working on it in a meaningful and useful way to make
well-versed business decisions.
Data analytics is a traditional or generic type of analytics used in enterprises to make data-driven
decisions.
What is big data analytics?
a. Big data analytics is the often complex process of examining big data to uncover
information -- such as hidden patterns, correlations, market trends and customer
preferences -- that can help organizations make informed business decisions.
b. On a broad scale, data analytics technologies and techniques give organizations a way
to analyse data sets and gather new information.
c. Business intelligence (BI) queries answer basic questions about business operations and
performance.
d. Big data analytics is a form of advanced analytics, which involve complex applications
with elements such as predictive models, statistical algorithms and what-if analysis
powered by analytics systems.
e. An example of big data analytics can be found in the healthcare industry, where millions
of patient records, medical claims, clinical results, care management records and other
data must be collected, aggregated, processed and analysed. Big data analytics is used
for accounting, decision-making, predictive analytics and many other purposes. This
data varies greatly in type, quality and accessibility, presenting significant challenges
but also offering tremendous benefits.
Why is big data analytics important?
Organizations can use big data analytics systems and software to make data-driven
decisions that can improve their business-related outcomes.
The benefits can include more effective marketing, new revenue opportunities,
customer personalization and improved operational efficiency.
With an effective strategy, these benefits can provide competitive advantages over
competitors.
How does big data analytics work?
Data analysts, data scientists, predictive modellers, statisticians and other analytics
professionals collect, process, clean and analyse growing volumes of structured transaction
data, as well as other forms of data not used by conventional BI and analytics programs.
The following is an overview of the four steps of the big data analytics process:
1.Data professionals collect data from a variety of different sources. Often, it's a mix of semi-
structured and unstructured data. While each organization uses different data streams, some
common sources include the following:
Internet clickstream data.
Web server logs.
Cloud applications.
Mobile applications.
Social media content.
Text from customer emails and survey responses.
Mobile phone records.
Machine data captured by sensors connected to the internet of things.
2. Data is prepared and processed.

After data is collected and stored in a data warehouse or data lake, data professionals must organize,
configure and partition the data properly for analytical queries. Thorough data preparation and
processing results in higher performance from analytical queries. Sometimes this processing is batch
processing, with large data sets analysed over time after being received; other times it takes the form
of stream processing, where small data sets are analysed in near real time, which can increase the
speed of analysis.
3. Data is cleansed to improve its quality.
Data professionals scrub the data using scripting tools or data quality software. They look for any
errors or inconsistencies, such as duplications or formatting mistakes, and organize and tidy the data.
4.The collected, processed and cleaned data is analysed using analytics software.
This includes tools for the following:
Data mining, which sifts through data sets in search of patterns and
relationships.
Predictive analytics, which builds models to forecast customer
behaviour and other future actions, scenarios and trends.
Machine learning, which taps various algorithms to analyse large
data sets.
Deep learning, which is a more advanced offshoot of machine
learning.
Text mining and statistical analysis software.
Artificial intelligence.
Mainstream BI software.
Data visualization tools.
Types of big data analytics
There are several different types of big data analytics, each with their own application within the
enterprise.
Descriptive analytics. This is the simplest form of analytics, where data is analysed for general
assessment and summarization. For example, in sales reporting, an organization can analyse the
efficiency of marketing from such data.
Diagnostic analytics. This refers to analytics that determine why a problem occurred. For
example, this could include gathering and studying competitor pricing data to determine when a
product's sales fell off because the competitor undercut it with a price drop.
Predictive analytics. This refers to analysis that predicts what comes next. For example, this could
include monitoring the performance of machines in a factory and comparing that data to historical
data to determine when a machine is likely to break down or require maintenance or replacement.
Prescriptive analytics. This form of analysis follows diagnostics and predictions. After an issue has
been identified, it provides a recommendation of what can be done about it.
For example, this could include addressing inconsistencies in supply chain that are causing pricing
problems by identifying suppliers whose performance is unreliable, suggesting their replacement.
The benefits of using big data analytics include the following:
Real-time intelligence. Organizations can quickly analyse large amounts of real-time data from
different sources, in many different formats and types.
Better-informed decisions. Effective strategizing can benefit and improve the supply chain,
operations and other areas of strategic decision-making.
Cost savings. This can result from new business process efficiencies and optimizations.
Better customer engagement. A better understanding of customer needs, behaviour and sentiment
can lead to better marketing insights and provide information for product development.
Optimize risk management strategies. Big data analytics improve risk management strategies by
enabling organizations to address threats in real time.
#Benefits that come with using big data analytics:-
1. Data accessibility.
2. Data quality maintenance.
3. Data security.
4. Choosing the right tools.
5. Talent shortages
Big data privacy and ethics
Privacy
Big data privacy is protecting individuals' personal and sensitive data when it comes to collecting,
storing, processing, and analysing large amounts of data. Following are some important aspects of big
data privacy:
1. Informed consent
When it comes to big data privacy, informed consent is the foundation.
Organizations need to ask individuals' permission before they collect their data.
With informed consent, people know exactly what their data is being used for, how it's being used,
and what the consequences could be.
By giving clear explanations and letting people choose how they want to use their data, organizations
can create trust and respect for people's privacy.
2. Protecting individual identity
Protecting individual identity is of paramount importance.
There are two techniques used to protect individual identity: anonymisation and de-identification.
Anonymisation means removing or encrypting personal information (PII) so that individuals cannot be
found in the dataset.
De-identification goes beyond anonymisation by transforming data in ways that prevent re-
identification.
These techniques enable organisations to gain insights while protecting privacy.
3. Data integrity and confidentiality
Data integrity and confidentiality are two of the most important aspects of data security.
Without them, unauthorised access to data, data breaches, and cyber threats are at an all-time high.
That’s why it’s essential for organisations to implement strong security measures, such as encryption,
security access controls, and periodic security audits. Data integrity and confidentiality help
organisations build trust with their users and promote responsible data management.
4. Purpose limitation and data minimization
Big data privacy and ethics call for the principle of purpose limitation.
Data should only be used for specified, authorized purposes and should not be reused
without permission from the user.
Additionally, data minimization involves collecting and retaining only the minimum
amount of data necessary for the intended purpose, reducing privacy risks and
potential harm.
5. Transparency and accountability
One of the most important ways to build trust with users is through transparency in
data practices.
Individuals' data collection, data usage, and data sharing should all be clearly defined by
organizations.
Accountability for data management and privacy compliance reinforce ethical data
management.
6. Control and autonomy
Privacy and ethics require organizations to respect individual rights.
Individuals are entitled to access, update, and erase their data.
Organizations should provide easy mechanisms for users to exercise these rights and maintain control
and autonomy over their data.
Ethics
Big data ethics refers to the ethical and responsible decisions that are made when collecting,
processing, analysing, and deploying large and complex data sets. The following are some
important aspects of the big data ethics:
1. Fairness and bias mitigation

One of the most important aspects of big data analytics is ensuring that data is collected and
analysed in a way that is fair and free of bias and discrimination.
Organizations should be aware of how bias can exist and how to reduce it so they can make
ethical choices and make sure everyone is treated equally.
2. Data governance and impact assessment

Ethical data management is best achieved when data governance frameworks are in place.
By appointing data stewards and setting up procedures, organizations encourage responsible
use of data.
Privacy impact assessments help identify and address privacy concerns before they escalate.
3. Ownership
In the world of big data privacy, when we refer to data ownership we mean who can control the data
and who can benefit from the collected data.
In reference to the two terms: control and benefit, individuals should own their personal data.
They should have control over how their personal data is collected, used and shared. Organizations
that collect and process large amounts of data should view themselves as custodians of data.
Organizations should responsibly manage data while respecting individuals’ rights.
Shared ownership models suggest that data should be shared between individuals and data collectors.
Data should be treated like a social commons, with everyone benefiting from it.
Control: It is argued that individuals should have authority over their own data, including the right to
consent and revoke data.
Benefits: Some advocate for individuals to benefit monetarily from their own data. Others argue
that benefits should be shared widely or invested in societal advancement.
4. Big data divide
Big data divide is the difference between 'haves' with access to data and 'have-nots' without, which
excludes those who lack the financial, educational and technological resources to analyse large
datasets.
The divide is highlighted by the fact that data knowledge and data mining capabilities are largely held
by large corporations.
This separation deprives people of valuable data. Despite the growth of data-driven applications in
the health and finance sectors, individuals are unable to mine personal data, or link missed silos
because of commercial software restrictions.
This creates an ethical dilemma about data ownership: if data is not available for personal analysis,
and benefits are not available, ownership is compromised.
The algorithmic biases resulting from inaccessible data also categorize people without their input and
result in unjust consequences. In the big data divide, the 'data poor' are unable to understand the
data that affects their lives.
Intelligent Data Analysis
Intelligent Data Analysis provides a forum for the examination of issues related to the research and
applications of Artificial Intelligence techniques in data analysis across a variety of disciplines. These
techniques include :
1. All areas of data visualization.
2. Data pre-processing .
3. Data engineering.
4. Database mining techniques, tools and applications.
5. Use of domain knowledge in data analysis.
6. Big data applications.
7. Evolutionary algorithms.
8. Machine learning
9. Neural nets.
10. Fuzzy logic, statistical pattern recognition.
11. Knowledge filtering, and post-processing.
In particular, papers are preferred that discuss the development of new AI-related data analysis
architectures, methodologies, and techniques and their applications to various domains.
DATA ANALYTIC TOOLS
1. APACHE Hadoop
It’s a Java-based open-source platform that is being used to store and process big data.
It is built on a cluster system that allows the system to process data efficiently and let the data run
parallel.
It can process both structured and unstructured data from one server to multiple computers.
Hadoop also offers cross-platform support for its users. Today, it is the best big data analytic
tool and is popularly used by many tech giants such as Amazon, Microsoft, IBM, etc.
Features of Apache Hadoop:-
1. Free to use and offers an efficient storage solution for businesses.
2. Offers quick access via HDFS (Hadoop Distributed File System).
3. Highly flexible and can be easily implemented with MySQL, and JSON.
4. Highly scalable as it can distribute a large amount of data in small segments.
5. It works on small commodity hardware like JBOD or a bunch of disks.
2. Cassandra
a. APACHE Cassandra is an open-source NoSQL distributed database that is used to fetch
large amounts of data.
b. It’s one of the most popular tools for data analytics and has been praised by many tech
companies due to its high scalability and availability without compromising speed and
performance.
c. It is capable of delivering thousands of operations every second and can handle
petabytes of resources with almost zero downtime.
d. It was created by Facebook back in 2008 and was published publicly.
#Features of APACHE Cassandra:-
1. Data Storage Flexibility: It supports all forms of data i.e. structured, unstructured, semi-
structured, and allows users to change as per their needs.
2. Data Distribution System: Easy to distribute data with the help of replicating data on
multiple data centers.
3. Fast Processing: Cassandra has been designed to run on efficient commodity hardware
and also offers fast storage and data processing.
4. Fault-tolerance: The moment, if any node fails, it will be replaced without any delay.
3. Spark
APACHE Spark is another framework that is used to process data and perform numerous tasks on
a large scale.
It is also used to process data via multiple computers with the help of distributing tools.
It is widely used among data analysts as it offers easy-to-use APIs that provide easy data pulling
methods and it is capable of handling multi-petabytes of data as well.
Recently, Spark made a record of processing 100 terabytes of data in just 23 minutes which broke
the previous world record of Hadoop (71 minutes).
This is the reason why big tech giants are moving towards spark now and is highly suitable for ML
and AI today.
#Features of APACHE Spark:
1. Ease of use: It allows users to run in their preferred language. (JAVA, Python, etc.)
2. Real-time Processing: Spark can handle real-time streaming via Spark Streaming
3. Flexible: It can run on, Mesos, Kubernetes, or the cloud.
4. Rapid Miner
It’s a fully automated visual workflow design tool used for data analytics.
It’s a no-code platform and users aren’t required to code for segregating data.
Today, it is being heavily used in many industries such as ed-tech, training, research, etc.
Though it’s an open-source platform but has a limitation of adding 10000 data rows and a single
logical processor. With the help of Rapid Miner, one can easily deploy their ML models to the web
or mobile (only when the user interface is ready to collect real-time figures).
#Features of Rapid Miner:
Accessibility: It allows users to access 40+ types of files (SAS, ARFF, etc.) via URL
Storage: Users can access cloud storage facilities such as AWS and dropbox
Data validation: Rapid miner enables the visual display of multiple results in history for better
evaluation.
5. Xplenty:
It is a data analytic tool for building a data pipeline by using minimal codes in it.
It offers a wide range of solutions for sales, marketing, and support.
With the help of its interactive graphical interface, it provides solutions for ETL, ELT, etc.
The best part of using Xplenty is its low investment in hardware & software and its offers support via email,
chat, telephonic and virtual meetings.
Xplenty is a platform to process data for analytics over the cloud and segregates all the data together.
#Features of Xplenty:
Rest API: A user can possibly do anything by implementing Rest API
Flexibility: Data can be sent, and pulled to databases, warehouses, and sales force.
Data Security: It offers SSL/TSL encryption and the platform is capable of verifying algorithms and certificates
regularly.
Deployment: It offers integration apps for both cloud & in-house and supports deployment to integrate apps
over the cloud.
Map Reduce Layer
The Map Reduce comes into existence when the client application submits the Map Reduce
job to Job Tracker.
In response, the Job Tracker sends the request to the appropriate Task Trackers.
Sometimes, the Task Tracker fails or time out. In such a case, that part of the job is
rescheduled.
Advantages of Hadoop:-
Fast: In HDFS the data distributed over the cluster and are mapped which helps in faster
retrieval. Even the tools to process the data are often on the same servers, thus reducing the
processing time. It is able to process terabytes of data in minutes and Peta bytes in hours.
Scalable: Hadoop cluster can be extended by just adding nodes in the cluster.
Cost Effective: Hadoop is open source and uses commodity hardware to store data so it
really cost effective as compared to traditional relational database management system.
Resilient to failure: HDFS has the property with which it can replicate data over the network,
so if one node is down or some other network failure happens, then Hadoop takes the other
copy of data and use it. Normally, data are replicated thrice but the replication factor is
configurable.
THANK
YOU

BD UNIT 1

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

BD UNIT 1

Uploaded by

Copyright:

Available Formats

BIG DATA

A better understanding of market conditions.

Veracity refers to the 'trustworthiness‘ or quality of data. It

BIG DATA PLATFORMS ARE:

There are lots of challenges, some of which are:

Maintaining security and compliance:

2. Data is prepared and processed.

1. Fairness and bias mitigation

2. Data governance and impact assessment

You might also like