Download as pdf or txt
Download as pdf or txt
You are on page 1of 44

BITS Pilani

Pilani Campus

Sourcing and Collecting Your


Data

BITS Pilani Dr. Nirankush Dutta


Pilani Campus
Sources of Big Data:
Considerations
• Structure of data • Quality of the data
• Structured • Verified
• Unstructured • Static
• Semi-structured • Streaming
• Sources of data • Storage of the data
• Internal • Remotely accessed
• External • Shared
• Private • Dedicated platforms
• Public • Portability
• Value of the data • Relationship of the data
• Generic • Superset
• Unique • Subset
• Specialized • Correlated

BITS Pilani, Pilani Campus


Stages in the Analytics Process
• Locating
• Importing
– Scrubbing
– Indexing
• Designing templates and scripts
• Mining data for value

BITS Pilani, Pilani Campus


Hunting for Data
• Finding data for big data analytics
– Science, Investigation, Assumption
• Concentrated effort to find the appropriate data.
• Determine what Big Data analytics is going to be used for

BITS Pilani, Pilani Campus


Setting the Goal
• Which all data sources can you think for your organization?
• Define the goals and objectives before hunting for data
sources
• Start with the internal, structured data first
• Next come the unstructured data
• Finally, external data to be taken into account

BITS Pilani, Pilani Campus


Types of Data
• Structured data (e.g. – Financial data, customer data)

• Unstructured and semi-structured data (e.g. – photos, videos)

• Internal data (e.g. – sales data, CCTV video data)

• External data (e.g. – Weather data, Social media profile data)


– Private
– Public

BITS Pilani, Pilani Campus


Datification: The new forms of Data
• The world is being ‘datafied’ and there are now many forms of
useful data.
• Data are being mined from:
– Our activities (Activity data)
– Our conversations (Conversation data)
– Photo and video image data
– Sensor data
– The internet of things

BITS Pilani, Pilani Campus


The anatomy of Big Data
• Four V’s of Big Data
– Volume
– Velocity
– Variety
– Veracity

BITS Pilani, Pilani Campus


Retail Organization

BITS Pilani, Pilani Campus


Growing Sources of Big Data
• Data growth rate over the past few years have been infinite, in
many cases!
• Industries falling under the umbrella of new data creation and
digitization of existing data:
– Transportation, logistics, retail, utilities, and
telecommunications
– Health care
– Government
– Entertainment media
– Life sciences
– Video surveillance

BITS Pilani, Pilani Campus


Growing Sources of Big Data
(Cntd.)
• The legal profession is adding to the multitude of data
sources, thanks to the discovery process.

• Leading e-discovery companies are handling terabytes or even


petabytes of information to reanalyze for the full course of a
legal proceeding.

• Additional information and large data sets can be found on


social media sites such as Facebook, Foursquare, and Twitter.

BITS Pilani, Pilani Campus


Some More Big Data Sources

BITS Pilani, Pilani Campus


Diving Deeper into Big Data
Sources
• A change in resolution is further driving the expansion of Big
Data.
• Some examples of increased resolution can be found in the
following areas:
– Financial transactions
– Smart instrumentation
– Mobile telephony

BITS Pilani, Pilani Campus


BITS Pilani, Pilani Campus
A Wealth of Public Information
• Many of the tools that are readily available on the market
• For point-and-click simplicity, Extractiv and Mozenda offer the
ability to acquire data from multiple sources and to search the
Web for information
• For processing data on the web: Google Refine
• 80Legs specializes in gathering data from social networking
sites as well as retail and business directories.

BITS Pilani, Pilani Campus


A Wealth of Public Information
(Cntd.)
• Analysis tools: Grep, Turk and BigSheets

• Visualization tools: Tableau Public, OpenHeatMap and Gephi

• Big data services: Crunchbase, InfoChimps, Kaggle, Freebase,


Timetric

BITS Pilani, Pilani Campus


Accessing External data
• Why we need external data?
– National census data for demographics and trends
– Social media platforms as sources of customer insights
– Google Trends for monitoring industry trends
– Weather data for planning and stocking decisions

• Where can we get external data?


– Specialized industry-focused data providers (e.g., Corelogic)
– Free external data sources (e.g., WHO, IMF, government
initiatives)

BITS Pilani, Pilani Campus


Building a Platform
Factors that lead to storage dilemma due to increase in the size
of data and other factors.
– Capacity, Security, Latency, Access, Flexibility, Persistence &
Cost

Factors which need to be considered while building a platform.


• Support for batch and real-time analytics
• Alternative approaches
• Available Big Data mapping tools
• Big Data abstraction tools.

BITS Pilani, Pilani Campus


Building a Platform(Cntd.)
• Business logic
• Moving away from SQL
• In-memory processing
• Built-in support for event-driven data distribution
• Support for public, private, and hybrid clouds
• Consistent management

BITS Pilani, Pilani Campus


Bringing structure to unstructured
data
• Metadata creation

• Search technologies

• Automated data categorization

• Taxonomies, semantics, and


natural language recognition

• Data visualization and


personalization

BITS Pilani, Pilani Campus


Architecture and Process in a DW

09 Jan 2021 BA ZC415/PDBA ZC413 21


BITS Pilani, Pilani Campus
Selection of Columns to be
Loaded
• Translating coded values
• Mapping of values
• Calculating a new calculated value
• Joining from different sources
• Summing up of several rows of data
• Transposing

09 Jan 2021 BA ZC415/PDBA ZC413 22


BITS Pilani, Pilani Campus
Staging Area and Operational Data
Stores
• Data arranged as flat files
• Generally new data extracts or rows are added to tables in the
staging area
• Subsequent complex ETL processes may be performed
• Real-time data -> Operational data store

09 Jan 2021 BA ZC415/PDBA ZC413 23


BITS Pilani, Pilani Campus
Causes and Effects of Poor Data
Quality
• Poor data quality
– Substandard customer service
– Impaired decision making and management and operational
levels
– Delay in budgeting process
• Data quality firewall
• Data profiling
• Data validation
– Hard
– Soft
• Data cleansing

09 Jan 2021 BA ZC415/PDBA ZC413 24


BITS Pilani, Pilani Campus
Data Warehouse: Functions and
Components
• Collected, joined and transformed in the actual DW
• Enriched with dimensions, such as organizational relationship
and placed in the product hierarchy
• Metadata repository • Why is metadata important?
• Data mart vs data warehouse
• Organization of data in DM
– Relational
– OLAP cubes

09 Jan 2021 BA ZC415/PDBA ZC413 25


BITS Pilani, Pilani Campus
Alternative Ways of Storing Data
• Hadoop
– Stores large amount of data on multiple servers
– Can replicate data
– Data can be stored quickly
– “Store once, read many times”
• Disadvantages?
– Raw data
– Complexity
– Time

09 Jan 2021 BA ZC415/PDBA ZC413 26


BITS Pilani, Pilani Campus
Techniques in data warehousing
• Master Data Management
– MDM provides a unified view of data, when data is integrated
from different data sources

• Service-Oriented Architecture
– SOA is a way of thinking about how to use the organization’s
resources based on a service approach and with the objective
of providing a more efficient achievement of overall business
targets

BITS Pilani, Pilani Campus


Getting Started with Big Data
Acquisition
• Barrier is mostly cultural, not technological

• Training to understand the paradigm shift

• Integration of development and operations teams (DevOps)

BITS Pilani, Pilani Campus


Getting started with Big Data
Acquisition
• As these data sets grow in size—typically ranging from several
terabytes to multiple petabytes—businesses face the
challenge of capturing, managing, and analyzing the data in
an acceptable time frame.

How is this problem handled?


Move to
Business
Integrate the
Train Data Executives &
DevOps Team
Decision
Makers

BITS Pilani, Pilani Campus


Getting started with Big Data
Acquisition(Cntd.)
• Identify a problem that business leaders can understand
• Do not focus exclusively on the technical data management
challenge
• Define the questions that must be answered to meet the
business objective
• Understand the tools available to merge the data
• Build a scalable infrastructure
• Identify technologies that you can trust
• Choose a technology that fits the problem.
• Be aware of changing data formats and changing data needs

BITS Pilani, Pilani Campus


Collecting Data

• Sophisticated tools for capturing data, thanks to the IoT.


• Sensors
• Apps
• CCTV video
• Beacons
• Website cookies
• Social media

09 Jan 2021 BA ZC415/PDBA ZC413 31


BITS Pilani, Pilani Campus
Storing Data

• Company server
• Computer hard disk
• Distributed or cloud-based storage systems
• Data warehouses
• Data lakes
• Off-the-shelf hardware and open-source software
• ‘Enterprise’ versions

09 Jan 2021 BA ZC415/PDBA ZC413 32


BITS Pilani, Pilani Campus
Cloud-based / distributed storage
systems

• Distributed/cloud storage
• ‘Distributed storage’ : cheap, off the shelf components to
create high-capacity data storage, which is controlled by
software that keeps track of where everything is, and finds it
for you, when you need it
• ‘Cloud Storage’ simply means that your data is stored
remotely, but connected to the Internet, so that it is
accessible from anywhere with an internet connection.

09 Jan 2021 BA ZC415/PDBA ZC413 33


BITS Pilani, Pilani Campus
Introducing Hadoop

• Most widely used system for providing data storage and


processing across ‘commodity’ hardware
• Backbone of data infrastructure
• Highly flexible
• Modules: Distributed File System and MapReduce
• Off-the-shelf components being linked together, as opposed
to expensive, bespoke systems custom made for an
organization.
• Alternative: Spark
• Data warehouse vs data lake?

09 Jan 2021 BA ZC415/PDBA ZC413 34


BITS Pilani, Pilani Campus
Analyzing and processing data

• The process of extracting insights from data boils down to


three steps:
1) Preparing the data (identifying, cleaning and formatting the
data so you can analyze it more easily
2) Building the analytic model
3) Drawing a conclusion from the insights gained

• Google’s BigQuery
• Microsoft’s HDInsight
• Amazon Web Services

09 Jan 2021 BA ZC415/PDBA ZC413 35


BITS Pilani, Pilani Campus
Analytic Services
• Amazon Web Services
• Cloudera CDH
• Hortonworks Data Platform
• Infobright
• IBM Big Data Platform
• InfoSphere BigInsights
• IBM Watson
• MapR
• Microsoft HDInsight
• Pivotal Big Data Suite
• Splunk Enterprise
09 Jan 2021 BA ZC415/PDBA ZC413 36
BITS Pilani, Pilani Campus
Providing access to data

• The final layer of any data infrastructure


• Visualizing and communicating data
• Access to data
• Data stewardship
• External users and customers

09 Jan 2021 BA ZC415/PDBA ZC413 37


BITS Pilani, Pilani Campus
Considering Data Stewardship

• Company-wide data strategies to engage all staff with data-


driven decision making and operations
• Meaningless and valueless data
• Missing and mismatched metadata
• Data Stewardship

09 Jan 2021 BA ZC415/PDBA ZC413 38


BITS Pilani, Pilani Campus
Communicating Data

• Visualization platforms to make data attractive and easy to


understand
• Self-service BI reporting and management dashboards
• Automated machine-to-machine (M2M) communication

09 Jan 2021 BA ZC415/PDBA ZC413 39


BITS Pilani, Pilani Campus
BITS Pilani
Pilani Campus

Case: Apixio
Apixio
• Enabling healthcare providers to learn from practice-based
evidence to individually tailor care
• Need to mine unstructured data for insights
• Extracting data from various sources
– OCR technology
– ML based algorithms
– NLP capabilities
• Product: HCC Profiler
– Customers: Insurance plans & Healthcare delivery networks
• Outcomes:
– Increased accuracy and efficiency
– Finding gaps in patient documentation

BITS Pilani, Pilani Campus


Apixio
• Data used:
– Both structured and unstructured
– Information on diseases and procedures reported to the
government
• Technical details:
– Non-relational database Cassandra
– Hadoop and Spark
– Own bespoke orchestration and management layer
– AWS
– Processed and analyzed in-house
– Own knowledge graph

BITS Pilani, Pilani Campus


Apixio
• Challenges overcome:
– Convincing healthcare providers and health insurance plans to
share data
– Data security

BITS Pilani, Pilani Campus


BITS Pilani, Pilani Campus

You might also like