Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 3

Requirements:

1- Data Quality:

- Can you spot data quality issues in the dataset? If yes summarize them

 Yes, in the gander of customer should be unified into two types e.g (M= Male, F= Female) I used
abbreviation for Male and Female as one litter M-F respectively for easier and faster data entry.
 (Incompleteness) = missing phone numbers, national IDs and DaysUntilDelivery.
 (Inaccuracy) = we have a junk value in national IDs ex. 1.08815E+12, and phone number ex.
###############.
 And we have typo error in regions ex. Tabook, Riyad, Medina, Mecca, Baha.
 (Non-standard) = the phone numbers (12) and national IDs (10) should have the same length for
each.

- What can be done to fix these issues? please apply on the sheet, otherwise the document

your notes on the medium of your choice

 The gender was uniformed to ether Male or Female.


 The phone numbers issue can’t be solved by data analyst because its missing entry, but we can
avoid it in future by restriction phone number input in the sheet to be 12 length start with
(966#########) and 10 length start with (05########). And about the national IDs it will be 10
lengths.
 The region was uniformed to the correct spilling.
 The website should have a restriction for the Phone numbers and national IDs, that the box for
Phone number (12 or 10) and IDs (10).

- What do you think are possible ways, methods, or technique to ensure data quality is

ensured across an organization?

1. Data Governance: Establish policies and roles for data quality management.

2. Data Profiling: Analyze data to understand structure and identify anomalies.

3. Data Validation: Implement rules to ensure data adherence to standards.

4. Data Cleansing: Correct errors and standardize data using automated tools.

5. Data Documentation: Maintain up-to-date documentation, including metadata.

6. Data Quality Metrics: Define and monitor key performance indicators.

7. Data Quality Audits: Regularly audit data quality, using both automated and manual checks.

8. Data Training and Awareness: Train staff on data quality importance and best practices.

9. Master Data Management (MDM): Maintain a single, accurate version of master data.

10. Automated Data Quality Tools: Use specialized tools for automated checks and monitoring.

11. Data Integration Best Practices: Follow best practices for consistent data integration.
12. User Input and Feedback: Encourage user reporting of data quality issues for continuous
improvement.

To ensure data quality across an organization, key methods include implementing a robust data
governance framework, conducting data profiling, enforcing validation rules, and employing
automated tools for data cleansing. Maintaining comprehensive documentation, conducting
regular audits, and monitoring key performance indicators are crucial. Fostering a data-centric
culture through training, implementing master data management, and utilizing advanced tools
also play essential roles in ensuring data quality.

- How would you deal with missing data?

The missing data is identity data like phone numbers and national IDs can’t be replace by significance
type of way of guessing which is using Mean, Median and Mode.

Otherwise we can use the Mean, Median and Mode for the other missing data.

2- Descriptive Statistics:

- Calculate the Mean, Median and Mode of monthly last mile shipments, and explain what

each metric means.

Mean (Average):

Definition: The mean, or average, is a measure of central tendency that represents the sum of a set of
values divided by the total number of values in the dataset.

Median:

Definition: The median is the middle value of a dataset when it is ordered in ascending or descending
numerical order. If there is an even number of values, the median is the average of the two middle
values.

Mode:

Definition: The mode is the value that appears most frequently in a dataset. A dataset may have no
mode, one mode (unimodal), or multiple modes (multimodal) if two or more values have the same
highest frequency.

Note: A dataset with one mode is called unimodal, while a dataset with two or more modes is
multimodal.
- What months are below the mean and what months are above the mean and tell us what

that could indicate.

Months
months are below the mean Feb, May, Oct, Nov, Dec
( without missing data)
months are above the mean Jan, Mar, Apr, Jun, Jul, Sep,
( without missing data)
months are below the mean Feb, Mar, May, Aug, Oct, Nov, Dec
(with)
months are above the mean Jan, Mar, Apr, Jun, Jul, Sep,
(with)

- What are the highest deliveries months and what are the lowest?

The highest deliveries months is Jan, Jul, Sep (without missing data) and the lowest is Feb, May, Oct.

The highest deliveries months is Apr, Jun, Jul and the lowest is Feb, may, Oct.( with missing data).

- What are the proportions between Male and Female in terms of gender in the customer

population.

Female = 200

Male = 300

The male is order request more than the female at the rate of 1.5%.

- Can you define the interquartile range and calculate it for days until delivery?

- Can you specify outlier deliveries in days until delivery and what method you will use?

- Provide insights on status of shipments across the year.

- Provide insights on services of shipments across the year.

- In terms of regions, status and services kindly provide an analysis into that relation and

mention the trends you can notice.

You might also like