Download as pdf or txt
Download as pdf or txt
You are on page 1of 101

Topic 12

Manipulating
data

Prepared by: Mohammad Nabeel


Arshad
– Our world runs on data but that data must be relevant, accurate, and valid. It
must be stored and manipulated in ways that allow it to be of value to
individuals and organizations. Collecting, storing and manipulating data is the
focus of many IT systems.

– Data manipulation is the process of changing data to make it easier to read or


be more organized. For example, a log of data could be organized in
alphabetical order, making individual entries easier to locate
12.1 Data integrity
– Data integrity is the overall accuracy, completeness, and consistency of data. Data
integrity also refers to the safety of data in regards to regulatory compliance.

– The term data integrity refers to the accuracy and consistency of data. Maintaining data
integrity means making sure the data remains intact and unchanged throughout its entire
life cycle. This includes the capture of the data, storage, updates, transfers, backups, etc

– When the integrity of data is secure, the information stored in a database will remain
complete, accurate, and reliable no matter how long it’s stored or how often it’s accessed.
Data integrity also ensures that your data is safe from any outside forces.
Why is it Important to Maintain Data
Integrity?
– Imagine making an extremely important business decision hinging on data that
is entirely, or even partially, inaccurate. Organizations routinely make
data-driven business decisions, and data without integrity, those decisions can
have a dramatic effect on the company’s bottom line goals.

– A new International report reveals that a large


majority of senior executives don’t have a
high level of trust in the way their organization
uses data, analytics, or AI.
Methods to maintain data integrity
The data integrity threats listed above also highlight an aspect of data security that can help preserve
data integrity. Use the following checklist to preserve data integrity and minimize risk for your
organization:
1.Validate Input: When your data set is supplied by a known or unknown source (an end-user, another
application, a malicious user, or any number of other sources) you should require input validation. That
data should be verified and validated to ensure that the input is accurate.
2.Validate Data: It’s critical to certify that your data processes haven’t been corrupted. Identify
specifications and key attributes that are important to your organization before you validate the data.
3.Remove Duplicate Data: Sensitive data from a secure database can easily find a home on a
document, spreadsheet, email, or in shared folders where employees without proper access can see it.
It’s prudent to clean up stray data and remove duplicates.
.
4.Back up Data: In addition to removing duplicates to ensure data security, data backups are a critical
part of the process. Backing up is necessary and goes a long way to prevent permanent data loss. How
often should you be backing up? As often as possible. Keep in mind that backups are critical when
organizations get hit with ransomware attacks. Just make sure that your backups aren’t also encrypted!
5.Access Controls: We’ve made the case above for input validation, data validation, removing
duplications, and backups – all necessary to preserve data integrity.
6.Always Keep an Audit Trail: Whenever there is a breach, it’s critical to data integrity to be able to
track down the source. Often referred to as an audit trail, this provides an organization the breadcrumbs
to accurately pin point the source of the problem.
12.1.2 Data Dictionary
– A data dictionary contains metadata i.e data about the database. The data dictionary
is very important as it contains information such as what is in the database, who is
allowed to access it, where is the database physically stored etc. The users of the
database normally don't interact with the data dictionary, it is only handled by the
database administrators.

The data dictionary in general contains information about the following:


– Names of all the database tables and their schemas.
– Details about all the tables in the database, such as their owners, their security
constraints, when they were created etc.
– Physical information about the tables such as where they are stored and how.
– Table constraints such as primary key attributes, foreign key information etc.
– Information about the database views that are visible.
– Data Dictionary acts as an automated or a manual, active or a passive file which has the
ability to store the definitions of the data elements and the data characteristics. The Data
Dictionary is actually the repository of the information about the data. Data Dictionary
defines each of the data elements and also gives it a name for the easy access.

– Data Dictionary acts as the core or the hub of the database management system, is the
third component of the database management system. The Data Dictionary provides with
the following information –
1. The name of the data item.
2. The description of the data item.
3. The sources of the data.
4. The impact analysis.
5. Keywords that are used for the categorization and the search of the data item
descriptions.
– Functions of the Data Dictionary
1. Defines the data element.
2. Helps in the scheduling.
3. Helps in the control.
4. Permits the various users who know which data is available and how can it be
obtained.
5. Helps in the identification of the organizational data irregularity.
6. Acts as a very essential data management tool.
7. Provides with a good standardization mechanism.
8. Acts as the corporate glossary of the ever growing information resource.
9. Provides the report facility, the control facility along with the excerpt facility.
12.1.3 Construct a Data Dictionary

– Every research database, large or small, simple or complicated, should be


accompanied by a data dictionary that describes the variables contained in the
database. It will be invaluable if the person who created the database is no
longer around. A data dictionary is, itself, a data file, containing one record for
every variable in the database.

– For each variable, the dictionary should contain most of the following
information (sometimes referred to as metadata, which means “data about
data”):
– A short variable name (usually no more than eight or ten characters) that’s used when telling the
software what variables you want it to use in an analysis

– A longer verbal description of the variable (up to 50 or 100 characters)

– The type of data (text, categorical, numerical, date/time, and so on)

– If numeric: Information about how that number is displayed (how many digits are before and
after the decimal point)

– If date/time: How it’s formatted (for example, 12/25/13 10:50pm or 25Dec2013 22:50)

– If categorical: What the permissible categories are

– How missing values are represented in the database (99, 999, “NA,” and so on)

– Many statistical packages allow (or require) you to specify this information when you’re creating
the file anyway, so they can generate the data dictionary for you automatically.
– Communication is also improved through the understanding created by the
definitions in the data dictionary. Here are a few guidelines to help in creating a
data dictionary:

– Gather Information

– Decide the Format

– Make it Flexible
12.1.4 Concept and need for data
validation
– Data validation primarily helps in ensuring that the data sent to connected
applications is complete, accurate, secure and consistent.
– This is achieved through data validation's checks and rules that routinely check
for the validity of data. These rules are generally defined in a data dictionary or
are implemented through data validation software.

– Different types of validation can be performed depending on destination


constraints or objectives. Data validation is a form of data cleansing.
– Why perform data validation?
When moving and merging data it’s important to make sure data from different
sources and repositories will conform to business rules and not become corrupted due
to inconsistencies in type or context. The goal is to create data that is consistent,
accurate and complete so to prevent data loss and errors during a move.

– When is data validation performed?


In data warehousing, data validation is often performed prior to the ETL (Extraction
Translation Load) process. A data validation test is performed so that analyst can get
insight into the scope or nature of data conflicts. Data validation is a general term and
can be performed on any type of data, however, including data within a single
application (such as Microsoft Excel) or when merging simple data within a single data
store.
12.1.5 Interpret and design validation
rules
– Validation rules verify that the data a user enters in a record meets the
standards you specify before the user can save the record.
– A validation rule can contain a formula or expression that evaluates the data in
one or more fields and returns a value of “True” or “False”.
– Validation rules also include an error message to display to the user when the
rule returns a value of “True” due to an invalid value.
Types of Validation rules

– a. presence
– b. range
– c. lookup
– d. list
– e. length
– f. format
– g. check digit.
a. Presence check

– There might be an important piece of data that you want to make sure is always
stored.
– For example, a school will always want to know an emergency contact number,
a video rental store might always want to know a customer's address, a
wedding dress shop might always want a record of the brides wedding date.
– A presence check makes sure that a critical field cannot be left blank, it must be
filled in. If someone tries to leave the field blank then an error message will
appear and you won't be able to progress to another record or save any other
data which you have entered.
b. Range check
– A range check is commonly used when you are working with data which
consists of numbers, currency or dates/times.

– A range check allows you to set suitable boundaries:


c. lookup
– When a field contains a limited list of items then a lookup list can help reduce errors.
For example:
– - a shop might put the dress sizes into a lookup list
– - a car showroom might put the car models into a lookup list
– - a vet might list the most popular types of animals that they deal with.
– For example a database storing film information wants to record the type of film it is, so for
convenience a 'lookup' drop down list is provided on the data entry form.

The benefits of a lookup list are that they:

– - speed up data entry because it is usually much faster to pick from a list than to
type each individual entry
– - improved accuracy because they reduce the risk of spelling mistakes
– - limit the options to choose from by only presenting the required options
– However, using a lookup validation technique does not prevent someone from
entering the wrong data into the field and so mistakes can still be made.
d. list

– A list is created for the limited options to be entered.

– Users can easily select the required option from the list and no need to enter.

– Saves time and less chances of errors. Also there will be no spelling mistakes, as
the data is selected not written.
e. length
– Sometimes you may have data which always has the same number of characters.

– For example a UK landline telephone number has 11 characters.

– fixed length input field

– A length check could be set up to ensure that exactly 11 numbers are entered into the field. This type of validation canot
check that the 11 numbers are correct but it can ensure that 10 or 12 numbers aren't entered.

A length check can also be set up to allow characters to be entered within a certain range.
For example, postcodes can be in the form of:

– CV45 2RE (7 without a space or 8 with a space) or

– B9 3TF (5 without a space or 6 with a space).

– An input field expecting a post code entry could have a rule that it must be between 5 and 8 characters.
f. format
– You may see this validation technique referred to as either a picture or a format check, they are the
same thing so no need to worry that you need to learn two different definitions.
Some types of data will always consist of the same pattern.
– Example 1
– Think about a postcode. The majority of postcodes look something like this:
– CV36 7TP
– WR14 5WB
– Replace either of those examples with L for any letter which appears and N for any number that
appears and you will end up with:
– LLNN NLL
– This means that you can set up a picture/format check for something like a postcode field to ensure
that a letter isn't entered where a number should be or a number in place of a letter.
– n.b. a few postcodes break this rule e.g. B9 7NT. You can still set up a picture/format check to include
this variation.
g. check digit

– This is used when you want to be sure that a range of numbers has been entered correctly for
example a barcode or an ISBN number:

– ISBN 1 84146 201 2

– The check digit is the final number in the sequence, so in this example it is the final ‘2’.

– bar code check digit

– The '6' on the right is the check digit.

– The computer will perform a complex calculation on all of the numbers and then compare the
answer to the check digit. If both match, it means the data was entered correctly.
12.2 Data
normalisation

Prepared By: Mohammad Nabeel


Arshad
12.2.1 Understand the concept of data redundancy
and the problems associated with it.
Data redundancy:
– Data redundancy – if data in the database can be found in two different locations (direct
redundancy) or if data can be calculated from other data items (indirect redundancy)
then the data is said to contain redundancy.

Data redundancy in database means that some data fields are repeated in the database.

– This data repetition may occur either if a field is repeated in two or more tables or if the
field is repeated within the table.
– Data can appear multiple times in a database for a variety of reasons. For example, a shop
may have the same customer’s name appearing several times if that customer has bought
several different products at different dates.
Problems associated with data
redundancy
There can be number of problems caused due to redundant data:

– Increases the size of the database unnecessarily.


– Causes data inconsistency.
– Decreases efficiency of database.
– May cause data corruption.

Such data redundancy in DBMS can be prevented by database normalization.


12.2.2 Understand the concept of and
need for normalisation.
– Normalisation:
In database design, the process of organizing data to minimize redundancy.
Normalization usually involves dividing a database into two or more tables and defining
relationships between the tables.
The objective is to isolate data so that additions, deletions, and modifications of a field can
be made in just one table and then propagated through the rest of the database via the
defined relationships.
-----------------------------------------------------------------------------------------------------
Normalization is the process of reorganizing data in a database so that it meets two basic
requirements: (1) There is no redundancy of data (all data is stored in only one place), and
(2) data dependencies are logical (all related data items are stored together).
Aims of Normalisation

– Normalisation ensures that the database is structured in the best possible way.
– To achieve control over data redundancy. There should be no unnecessary
duplication of data in different tables.
– To ensure data consistency. Where duplication is necessary the data is the
same.
– To ensure tables have a flexible structure. E.g. number of classes taken or books
borrowed should not be limited.
– To allow data in different tables can be used in complex queries.
Normalised database has many
advantages

– 1. Prevent the Same Data from Being Stored in Many Places

– 2. Prevent Updates Made to Some Data and Not Others

– 3. Prevent Deleting Unrelated Data

– 4. Ensure Queries are More Efficient


12.2.3 Be able to normalise a collection of
data into first, second, and third normal
forms.
– The data in the database can be considered to be in one of a number of `normal
forms'. Basically the normal form of the data indicates how much redundancy is
in that data. The normal forms have a strict ordering.
– The three main types of normalization are listed below. Note: "NF" refers to
"normal form."
– 1NF
– 2NF
– 3NF
1nf

– First normal form (1NF) deals with the `shape' of the record type
– A relation is in 1NF if, and only if, it contains no repeating attributes or groups
of attributes.
– Example:
– The Student table with the repeating group is not in 1NF
– It has repeating groups, and it is called an `unnormalised table'.
2nf

– Second normal form (or 2NF) is a more stringent normal form defined as:
– A relation is in 2NF if, and only if, it is in 1NF and every non-key attribute is fully
functionally dependent on the whole key.

Thus the relation is in 1NF with no repeating groups, and all non-key attributes
must depend on the whole key, not just some part of it. Another way of saying
this is that there must be no partial key dependencies
3nf

– 3NF is an even stricter normal form and removes virtually all the redundant data :
– A relation is in 3NF if, and only if, it is in 2NF and there are no transitive functional
dependencies
– Transitive functional dependencies arise:
– when one non-key attribute is functionally dependent on another non-key attribute:
– FD: non-key attribute -> non-key attribute
– and when there is redundancy in the database
– By definition transitive functional dependency can only occur if there is more than one
non-key field, so we can say that a relation in 2NF with zero or one non-key field must
automatically be in 3NF.
2nf
3nf
exercise
12.2.4 Be able to design a logical data
model (normalised data and relations).
12.3 Big Data

Prepared By:

Mohammad Nabeel Arshad


Big data

– Big data is an evolving term that describes a large volume of structured,


semi-structured and unstructured data that has the potential to be mined for
information and used in machine learning projects and other advanced analytics
applications.

– Big data is often characterized by the 3Vs:


the extreme volume of data, the wide variety of data types and the velocity at
which the data must be processed.
12.3.1 Understand the concept of Big Data and the
issues associated with its collection:

– Big data refers to the large, diverse sets of information that grow at
ever-increasing rates. It encompasses the volume of information, the velocity or
speed at which it is created and collected, and the variety or scope of the data
points being covered. Big data often comes from multiple sources and arrives in
multiple formats.
– Big Data has the potential to help companies improve operations and make
faster, more intelligent decisions. The data is collected from a number of
sources including emails, mobile devices, applications, databases, servers and
other means.
– This data, when captured, formatted, manipulated, stored and then analyzed,
can help a company to gain useful insight to increase revenues, get or retain
customers and improve operations.
– An example of big data might be petabytes (1,024 terabytes) or exabytes (1,024
petabytes) of data consisting of billions to trillions of records of millions of
people—all from different sources (e.g. Web, sales, customer contact center,
social media, mobile data and so on). The data is typically loosely structured
data that is often incomplete and inaccessible.
– Data volumes are continuing to grow and so are the possibilities of what can be
done with so much raw data available. However, organizations need to be able
to know just what they can do with that data and how much they can leverage
to build insights for their consumers, products, and services. Of the 85% of
companies using Big Data, only 37% have been successful in data-driven
insights. A 10% increase in the accessibility of the data can lead to an increase
of $65Mn in the net income of a company.
– While Big Data offers a ton of benefits, it comes with its own set of issues. This
is a new set of complex technologies, while still in the nascent stages of
development and evolution.
Issues associated with collection of
Big Data
Big data challenges are numerous: Big data projects have become a normal part of
doing business — but that doesn't mean that big data is easy.
There are certain challenges in collection of huge volume of data, here are some
critical issues:
– a. volume
– b. velocity
– c. variety
– d. veracity
– e. value.
a. Volume

– Volume refers to the vast amounts of data generated every second. Just think of all
the emails, twitter messages, photos, video clips, sensor data etc. we produce and
share every second. We are not talking Terabytes but Zettabytes or Brontobytes.
– On Facebook alone we send 10 billion messages per day, click the "like' button 4.5
billion times and upload 350 million new pictures each and every day. If we take all
the data generated in the world between the beginning of time and 2008, the same
amount of data will soon be generated every minute! This increasingly makes data
sets too large to store and analyse using traditional database technology.
– With big data technology we can now store and use these data sets with the help of
distributed systems, where parts of the data is stored in different locations and
brought together by software.
b. Velocity

– Velocity refers to the speed at which new data is generated and the speed at
which data moves around. Just think of social media messages going viral in
seconds, the speed at which credit card transactions are checked for fraudulent
activities, or the milliseconds it takes trading systems to analyse social media
networks to pick up signals that trigger decisions to buy or sell shares.
– Big data technology allows us now to analyse the data while it is being
generated, without ever putting it into databases.
c. Variety

– Variety refers to the different types of data we can now use. In the past we
focused on structured data that neatly fits into tables or relational databases,
such as financial data (e.g. sales by product or region). In fact, 80% of the
world’s data is now unstructured, and therefore can’t easily be put into tables
(think of photos, video sequences or social media updates).
– With big data technology we can now harness differed types of data (structured
and unstructured) including messages, social media conversations, photos,
sensor data, video or voice recordings and bring them together with more
traditional, structured data.
d. Veracity

– Veracity refers to the messiness or trustworthiness of the data. With many


forms of big data, quality and accuracy are less controllable (just think of
Twitter posts with hash tags, abbreviations, typos and colloquial speech as well
as the reliability and accuracy of content) but big data and analytics technology
now allows us to work with these type of data.
– The volumes often make up for the lack of quality or accuracy.
e. Value

– Value: Then there is another V to take into account when looking at Big Data:
Value! It is all well and good having access to big data but unless we can turn it
into value it is useless. So you can safely argue that 'value' is the most important
V of Big Data.
– It is important that businesses make a business case for any attempt to collect
and leverage big data.
– It is so easy to fall into the buzz trap and embark on big data initiatives without
a clear understanding of costs and benefits.
12.3.2 Understand the underlying infrastructure and services
which allow Big Data to happen:

– Broadly speaking, infrastructure is a combination of hardware, software, and


services. The foundation is processing and storage hardware because the
software has to live somewhere. The combination ends with services, as the
infrastructure must stay robust, supportive, and scalable.

– a. collection
– b. storage
– c. transmission.
a. Collection

6 ways generally how data is collected:


1. Loyalty Cards
2. Gameplay
3. Satellite Imagery
4. Employer Databases
5. Your Inbox
6. Social Media Activity
Loyalty Cards

– Loyalty programs or cards carry great benefits for companies. The program
focuses on rewarding the repeat clients and incentivizes extra shopping. This
means, that every time you use credit cards, or loyalty cards, your purchase
data is not only being tracked but also stored.
– Typically, it does make sense for retailers to know which products/services are
being sold to which group of customers but this insight is also used by them to
create a detailed customer profile.
2. Gameplay

– For the gaming industry, there are many experienced companies that need
potential tools to access and use them to reach their full potential. Thus, they
have turned to the big data collection and analysis. This means that even online
gamers are now not exempt from the collection of big data.
– The constant and strong web connection of different devices enables game
developers to easily access a huge amount of data quite instantaneously,
regardless if the game is single-player. Each time a player faces difficulty on
some particular level, he/she tempt to make an in-app purchase, delete or
re-install the game, give up within a few minutes or play for a long time, all of
this information is tracked as well as stored.
3. Satellite Imagery

– Almost 250 miles above our heads and planets, plenty of low-cost small
satellites are offering us a lot better understanding of the actual human
economy. In other words, another interesting source of big data collection by
companies is something visible from the sky.
– With the advent of latest technology like Google Maps and Google Earth, this
satellite data is publicly available now. This means that savvy analytics experts
can develop surprisingly a lot more complete picture of some particular areas.
4. Employer Databases

HR departments can use big data to profile their employees and quantify
workplace performance. Employee history with the company might be an common
interest, but big data also includes less intuitive figures, including:
– The amount of time workers spend with certain programs on their computers
– The times of day where employees appear most active
– The moment employees first power on their devices
5. Your Inbox

– Modern email services are depositories of large amounts of user data.


– While the following information is not true of all services, it is the case in some
of the most popular email providers, including Google and Yahoo. Both of these
companies use algorithms to scan the content of your email for certain
keywords with the goal of providing advertising targeted toward your interests.
6. Social Media Activity

– Social media sites are another large provider of big data. Social media users
often willingly provide information about their personal lives to such services,
and Terms of Service agreements typically allow sites the right to store and use
this information as they see fit.
– However, big data analytics can also be used to document which features users
agree to disable, which posts they delete and how often they log into the site at
different parts of the day. This information can be used to create thorough
profiles of users’ habits and detail what information is important to them.
There may be a lot of data collection methods. Here I will make a
clarification of the general steps to collect big data.

– Gather data: This is the first step of gathering data from different data sources. Different methods are also
provided in this step according to different data collection purposes, for example, census, buying data from the
Data-as-Service companies or using the web scraping tools
– Store data: After gathering the big data, you need to put the data into an appropriated databases or storage
services for further professing. Usually this step requires the investment in physical foundation as well as the
cloud services.
– Clean up data: Since there are many noisy information you don’t need, you need to pick up the one that meets
your needs. This step is to sort the data, including cleaning up, concatenating and merging the data.
– Reorganize data: You need to reorganize the data after cleaning up the big data for further use. Usually you
need to turn the unstructured or semi-unstructured formats into structured formats (like Hadoop and HDFS).
– Verify data: To make sure the data you get is right and could make sense, you need to verify the data. Choose
some samples to see whether it works. Make sure that you are in the right direction so you can apply these
techniques to your sourcing.
b. Storage

– A big data storage system clusters a large number of commodity servers


attached to high-capacity disk to support analytic software written to crunch
vast quantities of data.
– The system relies on massively parallel processing databases to analyze data
ingested from a variety of sources.
– Big data often lacks structure and comes from various sources, making it a poor
fit for processing with a relational database. The Apache Hadoop Distributed
File System (HDFS) is the most prevalent analytics engine for big data, and is
typically combined with some flavor of a NoSQL database.
– At root, the key requirements of big data storage are that it can handle very
large amounts of data and keep scaling to keep up with growth, and that it can
provide the input/output operations per second (IOPS) necessary to deliver
data to analytics tools.
– The largest big data practitioners – Google, Facebook, Apple, etc – run what are
known as hyperscale computing environments.
– Hyperscale computing environments have been the preserve of the largest
web-based operations to date, but it is highly probable that such
compute/storage architectures will bleed down into more mainstream
enterprises in the coming years.
c. Transmission

– Transferring massive data sets across shared communication links are


non-trivial tasks that require significant resources and coordination.
– There are currently two main approaches for servicing these types of big data
transmissions: end-to-end and store-and-forward.
– In end-to-end, one or more data streams are opened between the sender and
receiver, and data are transmitted directly over these links.
– In store-and-forward, data are transmitted from the sender to one or more
intermediate nodes, before being forwarded to the receiver.
– Bulk transmissions have received attention in recent years since file sizes have been increasing
across-theboard. In addition to cloud players like Amazon, Microsoft, Google, Yahoo!, Akamai, and
Facebook, research labs, universities, and even ordinary users are generating larger files. For
organizations and labs that transmit large files on a regular basis, it is cost effective to purchase or lease
network links. However, ordinary users - in homes, offices, and schools - may also occasionally want to
transmit a larger than normal file.
– In these cases, one has to rely on the internet or postal mail for transmission. Since the need for large file
transfers by ordinary users has been growing, there has been substantial interest in delay tolerant bulk
transmissions over the internet.
– The “delay tolerance” refers to the transmissions being given lower priority than regular internet traffic.
To ensure that other internet users are not affected, bulk transmissions only grab high bandwidth during
periods of low internet usage.
12.3.3 Understand the impact of
storing Big Data:

– Big data can present an abundance of new growth opportunities, from internal
insights to front-facing customer interactions. Three major business opportunities
include: automation, in-depth insights, and data-driven decision making.

We will consider the following factors for impact of storing big data:
– a. access
– b. processing time
– c. transmission time
– d. security.
a. access

– Your data won’t be much good to you if it’s hard to access; after all, data
storage is just a temporary measure so you can later analyze the data and put it
to good use.
– Accordingly, you’ll need some kind of system with an intuitive, accessible user
interface (UI), and clean accessibility for whatever functionality you want.
– There are a number of different approaches available for facilitating rapid data
access, the major choices being flat files, traditional databases, and the
emergent NoSQL paradigm. Each of these designs offers different strengths and
weaknesses based on the structure of the data stored and the skills of the
analysts involved.
– As compared to data processing, data access has very different characteristics,
including:
– The data structure highly depends on how applications or users need to retrieve
the data
– Data retrieval pattens need to be well understood because some data can be
repetitively retrieved by large number of users or applications.
– The amount of data retrieved each time should be targeted, and therefore
should only contain a fraction of the available data.
– Flat file systems record data on disk and are accessed directly by analysts, usually using
simple parsing tools. Most log systems create flat file data by default: after producing some
fixed number of records, they close a file and open up a new file. Flat files are simple to
read and analyze, but lack any particular tools for providing optimized access.
– Database systems such as Oracle and Postgres are the bedrock of enterprise computing.
They use well-defined interface languages, you can find system administrators and
maintainers with ease, and they can be configured to provide extremely stable and scalable
solutions.
– Efficient data access is a critical engineering effort; the time to access data directly impacts
the number of queries an analyst can make, and that concretely impacts the type of
analyses they will do.
– Choosing the right data system is a function of the volume of data stored, the type of data
stored, and the population that’s going to analyze it. There is no single right choice, and
depending on the combination of queries expected and data stored, each of these
strategies can be the best.
b. processing time

– Data Processing for big data emphasizes “scaling” from the beginning, meaning that
whenever data volume increases, the processing time should still be within the
expectation given the available hardware. The overall data processing time can range
from minutes to hours to days, depending on the amount of data and the complexity
of the logic in the processing.
– TeraSort is a popular benchmark that measures the amount of time to sort one
terabyte of randomly distributed data on a given computer system
– Many enterprises, and even quite small businesses, can take advantage of big data
analytics. They will need the ability to handle relatively large data sets and handle
them quickly, but may not need quite the same response times as those organisations
that use it push adverts out to users over response times of a few seconds.
c. transmission time
Big data transmission will occur on the premises network and may affect the enterprise
WAN. If the cloud is used, Internet access will be taxed for its capacity. The network's
capability to absorb and transfer big data traffic is made up of six elements:
– Bandwidth -- You will always need more. As big data is analyzed, users may want to
collect even more data as they learn how to better analyze it. Don't forget that data
about the networks adds to the traffic load, and therefore more bandwidth may be
required. Bandwidth should be scalable in response to traffic that can increase
rapidly. You need to relate bandwidth utilization to the application used.
– Network Delay/Latency -- Real-time delivery with real-time responses based on
analysis means that network delay can cause the data and responses to be created
and delivered too late. Predictable consistent latency needs to be delivered.
– Security -- This is important for both access to and transmission of the data. It is very
likely that the data is sensitive for both the organizations and its customers.
– Delivery Accuracy -- Data can sometimes be lost or delivered with errors. No
network is perfect, but knowing that there has been data corruption can help
minimize the impacts.
– Availability -- The loss of networks can be highly disruptive. An availability of
99.99+% is a good goal. Make sure you know what events or conditions are not
included in the availability calculation, as you may actually be experiencing only 99%
availability.
– Resiliency -- Failures will occur; they always do. How fast those failures can be
resolved leads to either confidence in the network and its management or
skepticism of the value of data collection and analysis.
d. security

– Big Data security is the processing of guarding data and analytics processes,
both in the cloud and on-premise, from any number of factors that could
compromise their confidentiality. There are several challenges to securing big
data that can compromise its security.
– Keep in mind that these challenges are by no means limited to on-premise big
data platforms. They also pertain to the cloud. When you host your big data
platform in the cloud, take nothing for granted. Work closely with your provider
to overcome these same challenges with strong security service level
agreements.
– The sheer size of a big data installation, terabytes to petabytes large, is too big
for routine security audits. And because most big data platforms are
cluster-based, this introduces multiple vulnerabilities across multiple nodes and
servers.
– If the big data owner does not regularly update security for the environment,
they are at risk of data loss and exposure.

– Secure your big data platform from high threats and low, and it will serve your
business well for many years.
12.3.4 Understand
the concepts of data
mining, data
warehousing and
data analytics

Prepared By: Mohammad Nabeel


Arshad
Data Mining

– In simple words, data mining is defined as a process used to extract usable data
from a larger set of any raw data. It implies analysing data patterns in large
batches of data using one or more software.

– It is the process of sorting through large data sets to identify patterns and
establish relationships to solve problems through data analysis. Data mining
tools allow enterprises to predict future trends.
Data Warehouse

– A data warehousing is defined as a technique for collecting and managing data


from varied sources to provide meaningful business insights. It is a blend of
technologies and components which aids the strategic use of data.
– It is electronic storage of a large amount of information by a business which is
designed for query and analysis instead of transaction processing. It is a process
of transforming data into information and making it available to users in a
timely manner to make a difference.
Data Analytics

– Data analytics (DA) is the process of examining data sets in order to draw
conclusions about the information they contain, increasingly with the aid of
specialized systems and software.
– Data analytics technologies and techniques are widely used in commercial
industries to enable organizations to make more-informed business decisions
and by scientists and researchers to verify or disprove scientific models,
theories and hypotheses.
a. descriptive

– Descriptive Analytics, which use data aggregation and data mining to provide
insight into the past and answer: “What has happened?”
– Descriptive Analytics: Insight into the past
– Descriptive analysis or statistics does exactly what the name implies they
“Describe”, or summarize raw data and make it something that is interpretable
by humans. They are analytics that describe the past. The past refers to any
point of time that an event has occurred, whether it is one minute ago, or one
year ago. Descriptive analytics are useful because they allow us to learn from
past behaviors, and understand how they might influence future outcomes.
b. predictive

– Predictive Analytics, which use statistical models and forecasts techniques to


understand the future and answer: “What could happen?”
– Predictive Analytics: Understanding the future
– Predictive analytics provides companies with actionable insights based on data.
Predictive analytics provide estimates about the likelihood of a future outcome.
It is important to remember that no statistical algorithm can “predict” the
future with 100% certainty. Companies use these statistics to forecast what
might happen in the future. This is because the foundation of predictive
analytics is based on probabilities.
c. prescriptive.

– Prescriptive Analytics, which use optimization and simulation algorithms to advice on


possible outcomes and answer: “What should we do?”
– Prescriptive Analytics: Advise on possible outcomes
– The relatively new field of prescriptive analytics allows users to “prescribe” a number
of different possible actions to and guide them towards a solution. In a nut-shell, these
analytics are all about providing advice. Prescriptive analytics attempt to quantify the
effect of future decisions in order to advise on possible outcomes before the decisions
are actually made. At their best, prescriptive analytics predicts not only what will
happen, but also why it will happen providing recommendations regarding actions
that will take advantage of the predictions.
12.3.5 Understand how Big Data
is used by individuals,
organisations and society
– Industry influencers, academicians, and other prominent stakeholders certainly agree that big
data has become a big game changer in most, if not all, types of modern industries over the last
few years. As big data continues to permeate our day-to-day lives, there has been a significant
shift of focus from the hype surrounding it to finding real value in its use.

– While understanding the value of big data continues to remain a challenge, other practical
challenges including funding and return on investment and skills continue to remain at the
forefront for a number of different industries that are adopting big data.
a. healthcare – Healthcare big data refers to collecting, analyzing and
leveraging consumer, patient, physical, and clinical data
that is too vast or complex to be understood by
traditional means of data processing. Instead, big data
is often processed by machine learning algorithms and
data scientists.
– The rise of healthcare big data comes in response to the
digitization of healthcare information and the rise of
value-based care, which has encouraged the industry to
use data analytics to make strategic business decisions.
– Faced with the challenges of healthcare data – such as
volume, velocity, variety, and veracity – health systems
need to adopt technology capable of collecting, storing,
and analyzing this information to produce actionable
insights.
Importance of Big Data in Healthcare

Big data has become more influential in healthcare due to three major shifts in the
healthcare industry:
– the vast amount of data available
– growing healthcare costs
– a focus on consumerism
– Create holistic, 360-degree view of consumers, patients, and
physicians.
Benefits of Big – Improve care personalization and efficiency with comprehensive
Data in Healthcare patient profiles.

– Identify geographic markets with a high potential for growth.

– Inform physician relationship management efforts by tracking


physician preferences, referrals, and clinical appointment data.

– Boost healthcare marketing efforts with information about consumer,


patient, and physician needs and preferences.

– Provide straightforward identification of patterns in health outcomes,


patient satisfaction, and hospital growth.

– Optimize hospital growth by improving care efficiency, effectiveness,


and personalization.
b. infrastructure planning

– Infrastructure is the cornerstone of any Big Data architecture. However, before


Data Scientists can analyze the data, it needs to be stored and processed.
– Build an infrastructure designed to enable new levels of insights derived from
exploiting all relevant data. The platform should be fluent in all forms of data
and analytics: transactional data, Hadoop data, and so forth. Build a highly
flexible, scalable IT infrastructure, tuned for today’s big data environment and
designed to capitalize on the integration of social, mobile, and cloud
technologies.
– he Big Data infrastructure needs to churn and deliver a large amount of data in
real time at high speeds. That means latency (speed of response) needs to be
controlled.
– To deliver on this promise, infrastructure should have the massive processing
power and high-speed connectivity. This means there is a need for high IOPS
(input /output operations) which can be delivered by server virtualization and
use of flash memory.
c. transportation

– The lay definition of transportation is “an act, process, or instance of transporting or


being transported” referring to the movement of people or goods.
– However, transportation can be viewed in different manners as the result of the
interplay of different technologies, behaviors with dependencies and impacts on
economic, environmental and geographical aspects.
– For instance, Figure 1 (Transport Research & Innovation Portal) shows the various
transport themes, which are grouped within five transport dimensions: mode, sector,
technology, policy, and evaluation. While each theme focuses on specific aspects of
transport research, many themes overlap.
– On the other hand, Figure 2 (Teodorovic and Janic, 2017) attempts to categorize the
transportation with a focus on two dimensions: mode and sector.
Figure - 1
Figure - 2
d. fraud detection

– By using big data analytics to look for patterns of fraudulent behavior in


enormous amounts of unstructured and structured claims-related data,
companies are detecting fraud in real time.
– By developing predictive models based on both historical and real-time data on
wages, medical claims, attorney costs, demographics, weather data, call center
notes, and voice recordings, companies are in a better position to identify
suspected fraudulent claims in the early stages.
– The return on investment for these companies can be huge. They are able to
analyze complex information and accident scenarios in minutes as compared to
days or months before implementing a big data platform.
– Given the vast amount of data that our investigators need to sift through to find
fraudulent patterns, an integrated big data and search architecture emerged as
the most feasible approach.
– Insurance fraud is such a huge cost for companies that executives are moving
quickly to incorporate big data analytics and other advanced technology to
address the problem of insurance fraud. Insurance companies not only feel the
impact of these high costs, but the costs also have a negative impact on
customers who are charged higher rates to account for the losses.
End of Topic 12

You might also like