Professional Documents
Culture Documents
Topic 12 Manipulating Data: Prepared By: Mohammad Nabeel Arshad
Topic 12 Manipulating Data: Prepared By: Mohammad Nabeel Arshad
Manipulating
data
– The term data integrity refers to the accuracy and consistency of data. Maintaining data
integrity means making sure the data remains intact and unchanged throughout its entire
life cycle. This includes the capture of the data, storage, updates, transfers, backups, etc
– When the integrity of data is secure, the information stored in a database will remain
complete, accurate, and reliable no matter how long it’s stored or how often it’s accessed.
Data integrity also ensures that your data is safe from any outside forces.
Why is it Important to Maintain Data
Integrity?
– Imagine making an extremely important business decision hinging on data that
is entirely, or even partially, inaccurate. Organizations routinely make
data-driven business decisions, and data without integrity, those decisions can
have a dramatic effect on the company’s bottom line goals.
– Data Dictionary acts as the core or the hub of the database management system, is the
third component of the database management system. The Data Dictionary provides with
the following information –
1. The name of the data item.
2. The description of the data item.
3. The sources of the data.
4. The impact analysis.
5. Keywords that are used for the categorization and the search of the data item
descriptions.
– Functions of the Data Dictionary
1. Defines the data element.
2. Helps in the scheduling.
3. Helps in the control.
4. Permits the various users who know which data is available and how can it be
obtained.
5. Helps in the identification of the organizational data irregularity.
6. Acts as a very essential data management tool.
7. Provides with a good standardization mechanism.
8. Acts as the corporate glossary of the ever growing information resource.
9. Provides the report facility, the control facility along with the excerpt facility.
12.1.3 Construct a Data Dictionary
– For each variable, the dictionary should contain most of the following
information (sometimes referred to as metadata, which means “data about
data”):
– A short variable name (usually no more than eight or ten characters) that’s used when telling the
software what variables you want it to use in an analysis
– If numeric: Information about how that number is displayed (how many digits are before and
after the decimal point)
– If date/time: How it’s formatted (for example, 12/25/13 10:50pm or 25Dec2013 22:50)
– How missing values are represented in the database (99, 999, “NA,” and so on)
– Many statistical packages allow (or require) you to specify this information when you’re creating
the file anyway, so they can generate the data dictionary for you automatically.
– Communication is also improved through the understanding created by the
definitions in the data dictionary. Here are a few guidelines to help in creating a
data dictionary:
– Gather Information
– Make it Flexible
12.1.4 Concept and need for data
validation
– Data validation primarily helps in ensuring that the data sent to connected
applications is complete, accurate, secure and consistent.
– This is achieved through data validation's checks and rules that routinely check
for the validity of data. These rules are generally defined in a data dictionary or
are implemented through data validation software.
– a. presence
– b. range
– c. lookup
– d. list
– e. length
– f. format
– g. check digit.
a. Presence check
– There might be an important piece of data that you want to make sure is always
stored.
– For example, a school will always want to know an emergency contact number,
a video rental store might always want to know a customer's address, a
wedding dress shop might always want a record of the brides wedding date.
– A presence check makes sure that a critical field cannot be left blank, it must be
filled in. If someone tries to leave the field blank then an error message will
appear and you won't be able to progress to another record or save any other
data which you have entered.
b. Range check
– A range check is commonly used when you are working with data which
consists of numbers, currency or dates/times.
– - speed up data entry because it is usually much faster to pick from a list than to
type each individual entry
– - improved accuracy because they reduce the risk of spelling mistakes
– - limit the options to choose from by only presenting the required options
– However, using a lookup validation technique does not prevent someone from
entering the wrong data into the field and so mistakes can still be made.
d. list
– Users can easily select the required option from the list and no need to enter.
– Saves time and less chances of errors. Also there will be no spelling mistakes, as
the data is selected not written.
e. length
– Sometimes you may have data which always has the same number of characters.
– A length check could be set up to ensure that exactly 11 numbers are entered into the field. This type of validation canot
check that the 11 numbers are correct but it can ensure that 10 or 12 numbers aren't entered.
A length check can also be set up to allow characters to be entered within a certain range.
For example, postcodes can be in the form of:
– An input field expecting a post code entry could have a rule that it must be between 5 and 8 characters.
f. format
– You may see this validation technique referred to as either a picture or a format check, they are the
same thing so no need to worry that you need to learn two different definitions.
Some types of data will always consist of the same pattern.
– Example 1
– Think about a postcode. The majority of postcodes look something like this:
– CV36 7TP
– WR14 5WB
– Replace either of those examples with L for any letter which appears and N for any number that
appears and you will end up with:
– LLNN NLL
– This means that you can set up a picture/format check for something like a postcode field to ensure
that a letter isn't entered where a number should be or a number in place of a letter.
– n.b. a few postcodes break this rule e.g. B9 7NT. You can still set up a picture/format check to include
this variation.
g. check digit
– This is used when you want to be sure that a range of numbers has been entered correctly for
example a barcode or an ISBN number:
– The check digit is the final number in the sequence, so in this example it is the final ‘2’.
– The computer will perform a complex calculation on all of the numbers and then compare the
answer to the check digit. If both match, it means the data was entered correctly.
12.2 Data
normalisation
Data redundancy in database means that some data fields are repeated in the database.
– This data repetition may occur either if a field is repeated in two or more tables or if the
field is repeated within the table.
– Data can appear multiple times in a database for a variety of reasons. For example, a shop
may have the same customer’s name appearing several times if that customer has bought
several different products at different dates.
Problems associated with data
redundancy
There can be number of problems caused due to redundant data:
– Normalisation ensures that the database is structured in the best possible way.
– To achieve control over data redundancy. There should be no unnecessary
duplication of data in different tables.
– To ensure data consistency. Where duplication is necessary the data is the
same.
– To ensure tables have a flexible structure. E.g. number of classes taken or books
borrowed should not be limited.
– To allow data in different tables can be used in complex queries.
Normalised database has many
advantages
– First normal form (1NF) deals with the `shape' of the record type
– A relation is in 1NF if, and only if, it contains no repeating attributes or groups
of attributes.
– Example:
– The Student table with the repeating group is not in 1NF
– It has repeating groups, and it is called an `unnormalised table'.
2nf
– Second normal form (or 2NF) is a more stringent normal form defined as:
– A relation is in 2NF if, and only if, it is in 1NF and every non-key attribute is fully
functionally dependent on the whole key.
Thus the relation is in 1NF with no repeating groups, and all non-key attributes
must depend on the whole key, not just some part of it. Another way of saying
this is that there must be no partial key dependencies
3nf
– 3NF is an even stricter normal form and removes virtually all the redundant data :
– A relation is in 3NF if, and only if, it is in 2NF and there are no transitive functional
dependencies
– Transitive functional dependencies arise:
– when one non-key attribute is functionally dependent on another non-key attribute:
– FD: non-key attribute -> non-key attribute
– and when there is redundancy in the database
– By definition transitive functional dependency can only occur if there is more than one
non-key field, so we can say that a relation in 2NF with zero or one non-key field must
automatically be in 3NF.
2nf
3nf
exercise
12.2.4 Be able to design a logical data
model (normalised data and relations).
12.3 Big Data
Prepared By:
– Big data refers to the large, diverse sets of information that grow at
ever-increasing rates. It encompasses the volume of information, the velocity or
speed at which it is created and collected, and the variety or scope of the data
points being covered. Big data often comes from multiple sources and arrives in
multiple formats.
– Big Data has the potential to help companies improve operations and make
faster, more intelligent decisions. The data is collected from a number of
sources including emails, mobile devices, applications, databases, servers and
other means.
– This data, when captured, formatted, manipulated, stored and then analyzed,
can help a company to gain useful insight to increase revenues, get or retain
customers and improve operations.
– An example of big data might be petabytes (1,024 terabytes) or exabytes (1,024
petabytes) of data consisting of billions to trillions of records of millions of
people—all from different sources (e.g. Web, sales, customer contact center,
social media, mobile data and so on). The data is typically loosely structured
data that is often incomplete and inaccessible.
– Data volumes are continuing to grow and so are the possibilities of what can be
done with so much raw data available. However, organizations need to be able
to know just what they can do with that data and how much they can leverage
to build insights for their consumers, products, and services. Of the 85% of
companies using Big Data, only 37% have been successful in data-driven
insights. A 10% increase in the accessibility of the data can lead to an increase
of $65Mn in the net income of a company.
– While Big Data offers a ton of benefits, it comes with its own set of issues. This
is a new set of complex technologies, while still in the nascent stages of
development and evolution.
Issues associated with collection of
Big Data
Big data challenges are numerous: Big data projects have become a normal part of
doing business — but that doesn't mean that big data is easy.
There are certain challenges in collection of huge volume of data, here are some
critical issues:
– a. volume
– b. velocity
– c. variety
– d. veracity
– e. value.
a. Volume
– Volume refers to the vast amounts of data generated every second. Just think of all
the emails, twitter messages, photos, video clips, sensor data etc. we produce and
share every second. We are not talking Terabytes but Zettabytes or Brontobytes.
– On Facebook alone we send 10 billion messages per day, click the "like' button 4.5
billion times and upload 350 million new pictures each and every day. If we take all
the data generated in the world between the beginning of time and 2008, the same
amount of data will soon be generated every minute! This increasingly makes data
sets too large to store and analyse using traditional database technology.
– With big data technology we can now store and use these data sets with the help of
distributed systems, where parts of the data is stored in different locations and
brought together by software.
b. Velocity
– Velocity refers to the speed at which new data is generated and the speed at
which data moves around. Just think of social media messages going viral in
seconds, the speed at which credit card transactions are checked for fraudulent
activities, or the milliseconds it takes trading systems to analyse social media
networks to pick up signals that trigger decisions to buy or sell shares.
– Big data technology allows us now to analyse the data while it is being
generated, without ever putting it into databases.
c. Variety
– Variety refers to the different types of data we can now use. In the past we
focused on structured data that neatly fits into tables or relational databases,
such as financial data (e.g. sales by product or region). In fact, 80% of the
world’s data is now unstructured, and therefore can’t easily be put into tables
(think of photos, video sequences or social media updates).
– With big data technology we can now harness differed types of data (structured
and unstructured) including messages, social media conversations, photos,
sensor data, video or voice recordings and bring them together with more
traditional, structured data.
d. Veracity
– Value: Then there is another V to take into account when looking at Big Data:
Value! It is all well and good having access to big data but unless we can turn it
into value it is useless. So you can safely argue that 'value' is the most important
V of Big Data.
– It is important that businesses make a business case for any attempt to collect
and leverage big data.
– It is so easy to fall into the buzz trap and embark on big data initiatives without
a clear understanding of costs and benefits.
12.3.2 Understand the underlying infrastructure and services
which allow Big Data to happen:
– a. collection
– b. storage
– c. transmission.
a. Collection
– Loyalty programs or cards carry great benefits for companies. The program
focuses on rewarding the repeat clients and incentivizes extra shopping. This
means, that every time you use credit cards, or loyalty cards, your purchase
data is not only being tracked but also stored.
– Typically, it does make sense for retailers to know which products/services are
being sold to which group of customers but this insight is also used by them to
create a detailed customer profile.
2. Gameplay
– For the gaming industry, there are many experienced companies that need
potential tools to access and use them to reach their full potential. Thus, they
have turned to the big data collection and analysis. This means that even online
gamers are now not exempt from the collection of big data.
– The constant and strong web connection of different devices enables game
developers to easily access a huge amount of data quite instantaneously,
regardless if the game is single-player. Each time a player faces difficulty on
some particular level, he/she tempt to make an in-app purchase, delete or
re-install the game, give up within a few minutes or play for a long time, all of
this information is tracked as well as stored.
3. Satellite Imagery
– Almost 250 miles above our heads and planets, plenty of low-cost small
satellites are offering us a lot better understanding of the actual human
economy. In other words, another interesting source of big data collection by
companies is something visible from the sky.
– With the advent of latest technology like Google Maps and Google Earth, this
satellite data is publicly available now. This means that savvy analytics experts
can develop surprisingly a lot more complete picture of some particular areas.
4. Employer Databases
HR departments can use big data to profile their employees and quantify
workplace performance. Employee history with the company might be an common
interest, but big data also includes less intuitive figures, including:
– The amount of time workers spend with certain programs on their computers
– The times of day where employees appear most active
– The moment employees first power on their devices
5. Your Inbox
– Social media sites are another large provider of big data. Social media users
often willingly provide information about their personal lives to such services,
and Terms of Service agreements typically allow sites the right to store and use
this information as they see fit.
– However, big data analytics can also be used to document which features users
agree to disable, which posts they delete and how often they log into the site at
different parts of the day. This information can be used to create thorough
profiles of users’ habits and detail what information is important to them.
There may be a lot of data collection methods. Here I will make a
clarification of the general steps to collect big data.
– Gather data: This is the first step of gathering data from different data sources. Different methods are also
provided in this step according to different data collection purposes, for example, census, buying data from the
Data-as-Service companies or using the web scraping tools
– Store data: After gathering the big data, you need to put the data into an appropriated databases or storage
services for further professing. Usually this step requires the investment in physical foundation as well as the
cloud services.
– Clean up data: Since there are many noisy information you don’t need, you need to pick up the one that meets
your needs. This step is to sort the data, including cleaning up, concatenating and merging the data.
– Reorganize data: You need to reorganize the data after cleaning up the big data for further use. Usually you
need to turn the unstructured or semi-unstructured formats into structured formats (like Hadoop and HDFS).
– Verify data: To make sure the data you get is right and could make sense, you need to verify the data. Choose
some samples to see whether it works. Make sure that you are in the right direction so you can apply these
techniques to your sourcing.
b. Storage
– Big data can present an abundance of new growth opportunities, from internal
insights to front-facing customer interactions. Three major business opportunities
include: automation, in-depth insights, and data-driven decision making.
We will consider the following factors for impact of storing big data:
– a. access
– b. processing time
– c. transmission time
– d. security.
a. access
– Your data won’t be much good to you if it’s hard to access; after all, data
storage is just a temporary measure so you can later analyze the data and put it
to good use.
– Accordingly, you’ll need some kind of system with an intuitive, accessible user
interface (UI), and clean accessibility for whatever functionality you want.
– There are a number of different approaches available for facilitating rapid data
access, the major choices being flat files, traditional databases, and the
emergent NoSQL paradigm. Each of these designs offers different strengths and
weaknesses based on the structure of the data stored and the skills of the
analysts involved.
– As compared to data processing, data access has very different characteristics,
including:
– The data structure highly depends on how applications or users need to retrieve
the data
– Data retrieval pattens need to be well understood because some data can be
repetitively retrieved by large number of users or applications.
– The amount of data retrieved each time should be targeted, and therefore
should only contain a fraction of the available data.
– Flat file systems record data on disk and are accessed directly by analysts, usually using
simple parsing tools. Most log systems create flat file data by default: after producing some
fixed number of records, they close a file and open up a new file. Flat files are simple to
read and analyze, but lack any particular tools for providing optimized access.
– Database systems such as Oracle and Postgres are the bedrock of enterprise computing.
They use well-defined interface languages, you can find system administrators and
maintainers with ease, and they can be configured to provide extremely stable and scalable
solutions.
– Efficient data access is a critical engineering effort; the time to access data directly impacts
the number of queries an analyst can make, and that concretely impacts the type of
analyses they will do.
– Choosing the right data system is a function of the volume of data stored, the type of data
stored, and the population that’s going to analyze it. There is no single right choice, and
depending on the combination of queries expected and data stored, each of these
strategies can be the best.
b. processing time
– Data Processing for big data emphasizes “scaling” from the beginning, meaning that
whenever data volume increases, the processing time should still be within the
expectation given the available hardware. The overall data processing time can range
from minutes to hours to days, depending on the amount of data and the complexity
of the logic in the processing.
– TeraSort is a popular benchmark that measures the amount of time to sort one
terabyte of randomly distributed data on a given computer system
– Many enterprises, and even quite small businesses, can take advantage of big data
analytics. They will need the ability to handle relatively large data sets and handle
them quickly, but may not need quite the same response times as those organisations
that use it push adverts out to users over response times of a few seconds.
c. transmission time
Big data transmission will occur on the premises network and may affect the enterprise
WAN. If the cloud is used, Internet access will be taxed for its capacity. The network's
capability to absorb and transfer big data traffic is made up of six elements:
– Bandwidth -- You will always need more. As big data is analyzed, users may want to
collect even more data as they learn how to better analyze it. Don't forget that data
about the networks adds to the traffic load, and therefore more bandwidth may be
required. Bandwidth should be scalable in response to traffic that can increase
rapidly. You need to relate bandwidth utilization to the application used.
– Network Delay/Latency -- Real-time delivery with real-time responses based on
analysis means that network delay can cause the data and responses to be created
and delivered too late. Predictable consistent latency needs to be delivered.
– Security -- This is important for both access to and transmission of the data. It is very
likely that the data is sensitive for both the organizations and its customers.
– Delivery Accuracy -- Data can sometimes be lost or delivered with errors. No
network is perfect, but knowing that there has been data corruption can help
minimize the impacts.
– Availability -- The loss of networks can be highly disruptive. An availability of
99.99+% is a good goal. Make sure you know what events or conditions are not
included in the availability calculation, as you may actually be experiencing only 99%
availability.
– Resiliency -- Failures will occur; they always do. How fast those failures can be
resolved leads to either confidence in the network and its management or
skepticism of the value of data collection and analysis.
d. security
– Big Data security is the processing of guarding data and analytics processes,
both in the cloud and on-premise, from any number of factors that could
compromise their confidentiality. There are several challenges to securing big
data that can compromise its security.
– Keep in mind that these challenges are by no means limited to on-premise big
data platforms. They also pertain to the cloud. When you host your big data
platform in the cloud, take nothing for granted. Work closely with your provider
to overcome these same challenges with strong security service level
agreements.
– The sheer size of a big data installation, terabytes to petabytes large, is too big
for routine security audits. And because most big data platforms are
cluster-based, this introduces multiple vulnerabilities across multiple nodes and
servers.
– If the big data owner does not regularly update security for the environment,
they are at risk of data loss and exposure.
– Secure your big data platform from high threats and low, and it will serve your
business well for many years.
12.3.4 Understand
the concepts of data
mining, data
warehousing and
data analytics
– In simple words, data mining is defined as a process used to extract usable data
from a larger set of any raw data. It implies analysing data patterns in large
batches of data using one or more software.
– It is the process of sorting through large data sets to identify patterns and
establish relationships to solve problems through data analysis. Data mining
tools allow enterprises to predict future trends.
Data Warehouse
– Data analytics (DA) is the process of examining data sets in order to draw
conclusions about the information they contain, increasingly with the aid of
specialized systems and software.
– Data analytics technologies and techniques are widely used in commercial
industries to enable organizations to make more-informed business decisions
and by scientists and researchers to verify or disprove scientific models,
theories and hypotheses.
a. descriptive
– Descriptive Analytics, which use data aggregation and data mining to provide
insight into the past and answer: “What has happened?”
– Descriptive Analytics: Insight into the past
– Descriptive analysis or statistics does exactly what the name implies they
“Describe”, or summarize raw data and make it something that is interpretable
by humans. They are analytics that describe the past. The past refers to any
point of time that an event has occurred, whether it is one minute ago, or one
year ago. Descriptive analytics are useful because they allow us to learn from
past behaviors, and understand how they might influence future outcomes.
b. predictive
– While understanding the value of big data continues to remain a challenge, other practical
challenges including funding and return on investment and skills continue to remain at the
forefront for a number of different industries that are adopting big data.
a. healthcare – Healthcare big data refers to collecting, analyzing and
leveraging consumer, patient, physical, and clinical data
that is too vast or complex to be understood by
traditional means of data processing. Instead, big data
is often processed by machine learning algorithms and
data scientists.
– The rise of healthcare big data comes in response to the
digitization of healthcare information and the rise of
value-based care, which has encouraged the industry to
use data analytics to make strategic business decisions.
– Faced with the challenges of healthcare data – such as
volume, velocity, variety, and veracity – health systems
need to adopt technology capable of collecting, storing,
and analyzing this information to produce actionable
insights.
Importance of Big Data in Healthcare
Big data has become more influential in healthcare due to three major shifts in the
healthcare industry:
– the vast amount of data available
– growing healthcare costs
– a focus on consumerism
– Create holistic, 360-degree view of consumers, patients, and
physicians.
Benefits of Big – Improve care personalization and efficiency with comprehensive
Data in Healthcare patient profiles.