Professional Documents
Culture Documents
Chap03 - Big Data and Data Retrieval
Chap03 - Big Data and Data Retrieval
2
Big Data
Big data is a term that describes the large volume of data (structured
and unstructured), that inundates a business on a day-to-day basis.
It can be analyzed for insights that lead to better decisions and strategic
business moves
3
The THREE V’s of Big Data
Datasets cannot reasonably be handled by traditional computers or
tools due to their volume, velocity, and variety
Volume: Organizations collect data from a variety of sources, including
business transactions, social media and information from sensor or
machine-to-machine data
Velocity: Data streams in at an unprecedented speed and must be
dealt with in a timely manner (e.g.: RFID tags, sensors and smart
metering are driving the need to deal with torrents of data in near-real
time)
Variety: Data comes in all types of formats (structured, numeric data in
traditional databases, unstructured text documents, email, video, audio,
stock ticker data and financial transactions)
4
What can we do with Big Data?
Take the data from any source and analyze it to find answers that
enable:
Cost reductions
Time reductions
New product development and optimized offerings
Smart decision making
Many more…
5
When you combine big data with high-powered analytics,
you can accomplish business-related tasks, such as:
6
Big data analytics
It is the process of collecting, organizing and analyzing the big data
to discover patterns and other useful information
Advantages to an organization:
To better understand the information contained within the data
To identify the data that is most important to the business and
future business decisions
7
Big Data & Key Technologies
8
Key Technologies that enable Big Data Analytics for businesses
9
NoSQL databases: a mechanism for storage and retrieval of
data that is modeled (e.g. key-value, document, and graph
databases)
10
Stream analytics: an event data processing service providing
real-time analytics and insights from apps, devices, sensors,
and more
11
Distributed file stores: a computer network where data is stored on
more than one node (in a replicated fashion) for redundancy and
performance
12
Data virtualization: a technology that delivers information from
various data sources
13
Data integration: tools for data orchestration across solutions such
as Amazon Elastic MapReduce (EMR), Apache Hive, Apache Pig,
Apache Spark, MapReduce, Couchbase, Hadoop, and MongoDB.
14
Data preparation: software that eases the burden of sourcing,
shaping, cleansing, and sharing diverse and messy data sets to
accelerate data’s usefulness for analytics.
15
Information retrieval
17
Key issues involved in data retrieval
18
Discussion
19