Download as pdf or txt
Download as pdf or txt
You are on page 1of 29

INF4101-Big Data et NoSQL

1. BIG DATA INTRODUCTION

Dr Mouhim Sanaa
WHAT IS IN IT?

 What is bigData
 BigData challenges
 BigData: 5Vs
 Big Data use cases
 Limitations of traditionnal approach
WHAT IS BIG DATA

• Big data is a term that is used to describe large collections of data (also known as data sets).
Big data might be unstructured and grow so large and quickly that is difficult to manage with
regular database or statistics tools.
WHAT IS BIG DATA

BigData challenges

• Transmit and broadcast data between distributed applications


 brokers: Kafka, RabbitMQ, ActiveMQ

• Store and secure data in a distributed way


 Hadoop HDFS, Nosql: MongoDB, Hbase, Elastic search…

• Process and analyze data in a distributed way to extract knowledge and for decision making
 Big data processing : Batch processing (Map Reduce, SPARK)
Stream Proccing (spark, kafka, Flink, Storm …)

 Knowledge extraction : Data mining, Machine Learning ( TenserFlow, Deep learning…)

• Analyze and visualize decision-making indicators


WHAT IS BIG DATA
5V’S OF BIG DATA
5V’S OF BIG DATA Volume

International System of Binary International Electrotechnical


Usage
Units (SI) (deprecated) Commission (IEC) - 1999
kilobyte KB 103 210 kibibyte KiB 210
megabyte MB 106 220 mebibyte MiB 220
gigabyte GB 109 230 gibibyte GiB 230
terabyte TB 1012 240 tebibyte TiB 240
petabyte PB 1015 250 pebibyte PiB 250
exabyte EB 1018 260 exbibyte EiB 260
zettabyte ZB 1021 270 zebibyte ZiB 270
yottabyte YB 1024 280 yobibyte YiB 280
Source: Wikipedia, http://en.wikipedia.org/wiki/Kibibytes
5V’S OF BIG DATA Volume

The growth was higher than previously


expected caused by the increased demand
due to the COVID-19 pandemic, as more
people worked and learned from home and
used home entertainment options more
often.
5V’S OF BIG DATA Volume
5V’S OF BIG DATA Volume

The three primary sources of Big Data

Social data comes from the Likes, Tweets & Retweets, Comments, Video Uploads, and general media that
are uploaded and shared via the world’s favorite social media platforms.

Machine data is defined as information which is generated by industrial equipment, sensors that are
installed in machinery, and even web logs which track user behavior.

Transactional data is generated from all the daily transactions that take place both online and offline.
Invoices, payment orders, storage records, delivery receipts – all are characterized as transactional data
yet data alone is almost meaningless, and most organizations struggle to make sense of the data that
they are generating and how it can be put to good use.
5V’S OF BIG DATA Volume

Le nombre de tweets envoyés par seconde, minute, jour, mois, année


•Nombre de tweets envoyés par seconde : 6 000 environ
•Nombre de tweets envoyés par minute : 350 000 environ
•Nombre de tweets envoyés par jour : 500 millions environ
•Nombre de tweets envoyés par an : 200 milliards environ
5V’S OF BIG DATA Variety

Different kinds of data is being genereted from various sources


5V’S OF BIG DATA Velocity

Data is being generated at an alarming rate

Chaque minute, dans le monde, sont envoyés plus de 204 millions de mails. Chaque minute, Google
enregistre 2 millions de requêtes différentes sur son moteur de recherche et Facebook, autant de
« likes ». Chaque minute, 85 000 dollars de commandes sont passées sur Amazon.
5V’S OF BIG DATA Veracity

In scoping out your big data strategy you need to have your team and partners
work to help keep your data clean and processes to keep ‘dirty data’ from
accumulating in your systems.
5V’S OF BIG DATA Value
USE CASES OF BIG DATA
USE CASES OF BIG DATA

• 360° View of the Customer


• Fraud Prevention
• Data Warehouse Offload
• Price Optimization
• Recommendation Engines
• Social Media Analysis and Response (ex. rumors)
• Preventive Maintenance and Support
• Internet of Things
USE CASES OF BIG DATA
• 360° View of the Customer
Imagine a dashboard with a 360° of the customer. It can include data from various sources:
• every single interaction a customer has ever had with a company.
• Demographic data, like customers’ names, addresses, household income and family members.
• sales information about which types of policies the customers hold.
• Information from the company’s customer relationship management (CRM) solution about:
 past interactions with the firm
 and even provide links to transcripts of recent calls,
 email messages
 or chat sessions.
 Support history
• Pages of the company website a customer had recently visited -> providing valuable clues about the reason
a customer might be calling.
• Customer’s recent social media posts.
USE CASES OF BIG DATA
• 360° View of the Customer

• enables organizations to anticipate their wants and needs, and thereby improve all marketing
efforts, all sales communications, and all customer service engagements.
• It helps with predicting potential demand, upselling and cross-selling, increasing loyalty,
retention and satisfaction, and with designing personalized customer journeys that enhance the
whole customer experience.
• helps reduce costs – the more intelligence an organization has about its customers, the more
targeted campaigns can be, and the less money is wasted developing and executing campaigns
that never perform.
• Some tools can even analyze customers’ language to detect their current emotions and suggest
appropriate responses to sales or customer service agents.
USE CASES OF BIG DATA
Security intelligence

Organizations are also using big data analytics to help them stop hackers and cyberattackers.

 Servers in a company are generating log files, other log files are available for sale.
 Many organizations analyze all of the internal and external log information to prevent and detect
attacks in real time.
USE CASES OF BIG DATA

Data Warehouse Offload


• Problem
It is common to have a data warehouse that facilitates their business intelligence (BI) efforts.
Data warehouse technology tends to be very costly to purchase and run.
The data warehouse solutions haven't always been able to provide the desired performance.
• Solution
Many enterprises use an open source big data solution like Hadoop to replace or compliment their
data warehouses.
Hadoop-based solutions often provide much faster performance while reducing licensing fees and
other costs.
USE CASES OF BIG DATA

Recommendation Engines

• When you are watching a movie at Netflix or shopping for products from Amazon, you probably
now take it for granted that the website will suggest similar items that you might enjoy.
• Of course, the ability to offer those recommendations arises from the use of big data analytics to
analyze historical data.
• These recommendation engines have become so commonplace on the Web that many customers
now expect them when they are shopping online.
• And organizations that haven't taken advantage of their big data in this way may lose customers to
competitors or may lose out on upsell or cross-sell opportunities (association rules).
USE CASES OF BIG DATA

Internet Of Things
The Internet of Things refers to the rapidly growing network of connected objects that are able to
collect and exchange data using embedded sensors.
• Smart Home: The smart home is likely the most popular IoT application at the moment. Control
doors and windows using internet (domestics).
• Wearables: The Apple Watch and other smartwatches on the market have turned our wrists into
smartphone holsters by enabling text messaging, phone calls, and more.
• Smart Cities: The Internet of Things can solve traffic congestion issues and reduce noise, crime, and
pollution.
• Connected Car: These vehicles are equipped with Internet access and can share that access with
others, just like connecting to a wireless network in a home or office.
TRADITONNAL APPROACH

An enterprise will have a computer to store and process big data.

• Here data will be stored in an RDBMS like Oracle Database or MS SQL Server.
• Sophisticated softwares can be written to
interact with the database,
process the required data
and present it to the users for analysis purpose.
LIMITATIONS OF TRADITONNAL APPROACH

This approach works well where:

 we have less volume of data that can be accommodated by standard database servers,
 or up to the limit of the processor which is processing the data.

But when it comes to dealing with huge amounts of data, it is really a difficult task to
process such data through a traditional database server.

The first solution is to get a big capacity node.


LIMITATIONS OF TRADITONNAL APPROACH

This approach works well where:

 we have less volume of data that can be accommodated by standard database servers,
 or up to the limit of the processor which is processing the data.

But when it comes to dealing with huge amounts of data, it is really a difficult task to
process such data through a traditional database server.

The second solution is to add more instances of


machines .
LIMITATIONS OF TRADITONNAL APPROACH

This approach works well where:

 we have less volume of data that can be accommodated by standard database servers,
 or up to the limit of the processor which is processing the data.

But when it comes to dealing with huge amounts of data, it is really a difficult task to
process such data through a traditional database server.

The second solution is to get a big capacity node.


GOOGLE SOLUTION

• the store data is duplicated, or more technically, replicated.


• To have equal distribution of space usage on different machines, all the
files are split into blocks of fixed size, about 100 megabytes.
• To request this metadata with minimal latency, you need master node,
which stores all metadata in memory.

• MapReduce divides the task into small parts and assigns those parts to
many computers connected over the network, and collects the results to
form the final result dataset.
WHICH SOLUTION FOR BIGDATA

• The solution is a framework wich allows a distributed storage and a parallel processing of
“BigData”: Hadoop

• Hadoop is a technology used to store data in a cluster of nodes .

You might also like