Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

Khiem (a dataset):

"Hi everyone! I'm Minh Khiem, a dataset about students in Group 2 of the
class BA 04. An inexperienced student just collected me, and, well, let's just
say I have a few... issues. I have to find someone to help me!"

Thu Hoang (a high-quality dataset):


I'm Thu Hoang, a high-quality dataset processed and polished to perfection.
Khiem, it sounds like you need a glow-up! Let's work on those data quality
dimensions together."

Khiem:
"Absolutely, Thu Hoang! Can you help me understand these dimensions
better? I want to be as reliable and accurate as you are."

Question 1: Let’s identify the error in this data set:


Thu Hoang’s explanation: This is a violation of the dimension "Accuracy"
of Data Quality. The “Accuracy” dimension is the degree to which the data
correctly represents the entity or attribute being described.

Question 2: Let’s identify the error in this data set:


Thu Hoang’s explanation: This is a violation of the dimension
"Completeness" of Data Quality. The “Completeness” dimension is the
percentage of missing data from a given data set.

Question 3: Let’s identify the error in this data set:


Thu Hoang’s explanation: This is a violation of the dimension
"Consistency" of Data Quality. The “Consistency” dimension refers to the
uniformity of data across multiple datasets or systems. Data is consistent
when it maintains the same standards and formats in all its occurrences.

Question 4: Let’s identify the error in this data set:


Thu Hoang’s explanation: This is a violation of the dimension
"Timeliness" of Data Quality. The “Timeliness” dimension refers to the
availability of data when it is needed, ensuring that the data is up-to-date
and relevant at the time of use.

Question 5: Let’s identify the error in this data set:


Thu Hoang’s explanation: This is a violation of the dimension "Validity"
of Data Quality. The “Validity” dimension refers to the data that is
formatted and structured according to the applicable rules, standards, and
constraints of the system or context in which it is used.

Question 6: Let’s identify the error in this data set:


Thu Hoang’s explanation: This is a violation of the dimension
"Uniqueness" of Data Quality. The “Uniqueness” dimension ensures
duplicate or overlapping data is identified and marked.
Question 7: Let’s identify the error in this data set:

Thu Hoang’s explanation: This is a violation of the dimension "Integrity"


of Data Quality. The “Integrity” dimension refers to the validity of
relationships across various data entities.
Data set Minh Khiêm argues: But Master, isn’t that just inconsistency? My
format was a little bit different in the Sales Department.
Thư Hoàng’s explanation: Not only is your format contradicting, but it is
also wrong. In the case of consistency, although the names are formatted
differently, they are still correct. However, here, the email format is
completely wrong, and you can’t contact the students through email.
Dataset Minh Khiêm: Oh, I understand now.

After listening to Thu Hoang's advice, Minh Khiem tried to change and
improve himself, through which he glowed up and became a
high-quality dataset.

Khiem:
Now, I can understand what is needed to be a quality dataset, I’m going to
share it to everyone! For a bottle of wine, good quality means it meets the
requirements of flavor, color, and scent. For a dataset like me, it means
meeting seven key dimensions: Accuracy, Completeness, Consistency,
Validity, Timeliness, Uniqueness, and Integrity.

Let's start with accuracy. Accuracy refers to how close the data values are to
their real-world values. For example, when you go to a restaurant, you
expect the food to look exactly like the pictures on the menu, right? If not,
you'd feel tricked and disappointed. The same goes for data - accurate
dataset makes informed and up-to-date decisions

Next up, completeness! This refers to how much the required data is
present. A complete dataset provides a comprehensive view, helps analysts
to have a full picture of the data gathered, and good information
How about consistency? This dimension is all about uniformity across
multiple data sources. Consistent data has the same formatting and values
represented the same way everywhere. Imagine if your university showed
your mid-term scores as numbers, but your final overall score as a long
paragraph with no numbers. You'd be so confused! Consistent data avoids
something like that.

Timeliness is another key dimension. This is about having data available


exactly when you need it. It's like having an up-to-date weather app on your
phone. If it said sunshine but then you got caught in the rain, that app may
not be up-to-date, and that untimely data could get you in trouble! The data
must be at the right time!

Let's continue with validity. Validity refers to how well the format and
structure of the data follow the standards of the system and the context in
which it is used. For example, a valid email address must contain an "@"
symbol and a period (".") to be formatted correctly. This element is vital, as
valid data will fit business rules and expected formats. This means you
don't have to spend additional time configuring the data to be able to use it,
saving valuable resources.

Next, we have integrity. Integrity is all about the consistency of data validity
across several datasets. If integrity is maintained throughout all data
entities, analysts can confidently make decisions based on credible,
trustworthy data. Imagine trying to analyze data from multiple sources,
only to find inconsistencies and contradictions – it would be a nightmare!

Finally, let's discuss uniqueness. Uniqueness refers to whether the dataset


has duplicate or overlapping data. A unique dataset shouldn't have any
identical data entries. The importance of uniqueness cannot be
understated, as spare or overlapped data would either overestimate or
underestimate the data results, leading to failure decisions and detrimental
effects.

You might also like