ACC IT APP MIdterm Bigdata

Principles Learning Objectives
• The database approach to data management hasbecome broadly accepted.

• Data modeling is a key aspect of organizing data andinformation.
• A well-designed and well-managed database is anextremely valuable tool in
supporting decision making.
• We have entered an era where organizations aregrappling with a tremendous
growth in the amount of data available and struggling to understand how to
manage and make use of it.
• A number of available tools and technologies allow organizations to take
advantage of the opportunities offered by big data
Big data is the term used to describe data collections that are so enormous (terabytes or more)
and complex (from sensor data to social media data) that traditional data management
software, hardware, and analysis processes are incapable of dealing with them
Characteristics of Big Data
Computer technology analyst Doug Laney associated the three characteristics

of volume, velocity, and variety with big data:
● Volume. In 2014, it was estimated that the volume of data that exists in the
digital universe was zettabytes (one zettabyte equals one trillion gigabytes).
The digital universe is expected to grow to an amazing 44 zettabytes by 2020,
with perhaps one-third of that data being of value to organizations.
● Velocity. The velocity at which data is currently coming at us exceeds 5

trillion bits per second.16 This rate is accelerating rapidly, and the volume of
digital data is expected to double every two years between now and 2020.
● Variety. Data today comes in a variety of formats. Some of the data is

what computer scientists call structured data—its format is known in
advance, and it fits nicely into traditional databases.
Sources of Big Data
Organizations collect and use data from a variety of sources, including business applications,
social media, sensors and controllers that are part of the manufacturing process, systems that
manage the physical environment in fac_x0002_tories and offices, media sources (including audio
and video broadcasts), machine logs that record events and customer call data, public sources
(such as government Web sites), and archives of historical records of transactions and
communications.
Big Data Uses
Here are just a few examples of how organizations are employing big data to improve their day-to-day operations,
planning, and decision making:
● Retail organizations monitor social networks such as Facebook, Google, linkedIn, Twitter, and Yahoo to engage brand
advocates, identify brand adversaries (and attempt to reverse their negative opinions), and even enable passionate
customers to sell their products.
● Advertising and marketing agencies track comments on social media to understand consumers’ responsiveness to ads,
campaigns, and promotions.
● Hospitals analyze medical data and patient records to try to identify patients likely to need readmission within a few
months of discharge, with the goal of engaging with those patients in the hope of preventing
another expensive hospital stay.
● Consumer product companies monitor social networks to gain insight into customer behavior, likes and dislikes, and
product perception to identify necessary changes to their products, services, and advertising.
● Financial services organizations use data from customer interactions to identify customers who are likely to be
attracted to increasingly targeted and sophisticated offers.
● Manufacturers analyze minute vibration data from their equipment, which changes slightly as it wears down, to predict
the optimal time to perform maintenance or replace the equipment to avoid expensive repairs or potentially catastrophic
failure.
Challenges of Big Data
Volume, Variety, Velocity: This refers to the sheer amount of data, the diversity of its formats (structured,
unstructured, semi-structured), and the speed at which it's generated. Traditional data storage and processing
systems often struggle to keep up.
Data Quality: Big data is more prone to errors, inconsistencies, and missing information. This can lead to skewed
results and flawed decision-making.
Security: Protecting sensitive data from breaches and unauthorized access is a major concern. Big data often
contains a mix of personal and confidential information.
Skilled Professionals: There's a shortage of data scientists, data analysts, and other professionals with the expertise
to manage and analyze big data effectively.
Cost: Building and maintaining the infrastructure required for big data can be expensive. This includes storage,
processing power, and software.
Organizational Resistance: Change management is a big part of implementing big data. Some organizations may be
resistant to adopting new data-driven approaches.
Data Management
Management Association (DAMA) International is a
Data management is an integrated set of functions that defines
nonprofit, vendor_x0002_independent, international
the processes by which data is obtained, certified fit for use,
association whose members promote the
stored, secured, and processed in such a way as to ensure that
under_x0002_standing, development, and practice of
the accessibility, reliability, and timeliness of the data meet
managing data as an essential enterprise asset.
the needs of the data users within an organization.
Data governance is the core component of data

man_x0002_agement; it defines the roles,
responsibilities, and processes for ensuring that data
can be trusted and used by the entire organization,
with people identified and in place who are
responsible for fixing and preventing issues with data.
data steward: An individual responsible for the Data lifecycle management (DLM) is a
management of critical data elements, including policy-based approach to managing the
identi_x0002_fying and acquiring new data sources; flow of an enterprise’s data, from its
creating and maintaining consistent reference data and initial acquisition or creation and storage
master data defini_x0002_tions; and analyzing data for to the time when it becomes outdated
quality and reconciling data issues and is deleted.
The data governance team defines the owners

of the data assets in the enterprise. The team
also develops a policy that specifies who is
accountable for various portions or aspects of
the data, including its accuracy, accessibility,
consistency, completeness, updating, and
archiving.
Data Warehouses
A data warehouse is a database that holds business information from many sources in the
enterprise, covering all aspects of the company’s processes, products, and customers. Data
warehouses allow managers to “drill down” to get greater detail or “roll up” to generate
aggregate or summary reports. The primary purpose is to relate information in innovative ways
and help man_x0002_agers and executives make better decisions. A data warehouse stores
histori_x0002_cal data that has been extracted from operational systems and external data
sources.
Data warehouses are continuously refreshed with huge amounts of data from a variety of sources
so the probability that some of the sources contain “dirty data” is high. The ETL (extract,
transform, load) process takes data from a variety of sources, edits and transforms it into the
format used in the data warehouse, and then loads this data into the warehouse
This process is essential in ensuring the quality of the data in the data warehouse.
● Extract. Source data for the data warehouse comes from many sources and may be represented in
a variety of formats. The goal of this process is to extract the source data from all the various
sources and convert it into a single format suitable for processing. During the extract step, data that
fails to meet expected patterns or values may be rejected from further processing
● Transform. During this stage of the ETL process, a series of rules or algorithms are applied to the
extracted data to derive the data that will be stored in the data warehouse.
● Load. During this stage of the ETL process, the extracted and transformed data is loaded into the
data warehouse. As the data is being loaded into the data warehouse, new indices are created and
the data is checked against the constraints defined in the database schema to ensure its quality. As a
result, the data load stage for a large data warehouse can take days.
data mart: A subset of a data warehouse that is used by small- and medium-sized businesses and departments
within large companies to support decision making.
data lake (enterprise data hub): A “store everything” approach to big data that saves all the data in its raw
and unaltered form.
NoSQL database: A way to store and retrieve data that is modeled using some means other than the simple
two_x0002_dimensional tabular relations used in relational databases.
NoSQL stands for "not only SQL" or "non-relational" and refers to a type of
database that stores data differently than traditional relational databases.
Here's a quick rundown of NoSQL databases:
•Structure: Unlike relational databases that use fixed tables with rows and columns,
NoSQL databases store data in flexible formats like documents, key-value pairs, or
graphs. This makes them more scalable for large and unstructured datasets.
•Schema: Relational databases require a predefined schema (data structure) upfront.

NoSQL databases are often schema-less or have flexible schema, allowing you to add
new data fields as needed.
•Scalability: NoSQL databases excel at horizontal scaling, meaning you can easily add
more servers to handle growing data volumes.
Hadoop: An open-source software framework including several software
modules that provide a means for storing and processing extremely large
data sets.
Hadoop Distributed File System (HDFS): A system used for data storage that
divides the data into subsets and distributes the subsets onto different servers
for processing.
MapReduce program: A composite program that consists of a Map procedure
that performs filtering and sorting and a Reduce method that performs a
summary operation.
in-memory database (IMDB): A database management system that stores the
entire database in random access memory (RAM). This approach provides access
to data at rates much faster than storing data on some form of secondary storage
(e.g., a hard drive or flash drive) as is done with traditional database management
systems.

ACC IT APP MIdterm Bigdata

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

ACC IT APP MIdterm Bigdata

Uploaded by

Copyright:

Available Formats

Principles Learning Objectives

• The database approach to data management hasbecome broadly accepted.

Characteristics of Big Data

Computer technology analyst Doug Laney associated the three characteristics

● Velocity. The velocity at which data is currently coming at us exceeds 5

● Variety. Data today comes in a variety of formats. Some of the data is

Data governance is the core component of data

The data governance team defines the owners

Here's a quick rundown of NoSQL databases:

•Schema: Relational databases require a predefined schema (data structure) upfront.

You might also like