Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

ASSIGNMENT - I

Data Science

7-B
Submitted on: March 24, 2022

Haider Nawaz Janjua


SP19-BCS-083
COMSATS UNIVERSITY ISLAMABAD

Department of Computer Science


Class Assignment – I
CLO-(C1)

Part A
Q1: What is Data Science?

Answer: Data Science is the field where first data is prepared for analytics,
then analytics are performed on the data after which interesting patterns and
insights are revealed which are presented to stakeholders so that they can
make informed decisions.

Basically, in Data Science, useful information and insights are discovered


from the data.

Q2: Discuss life cycle of Data Science

Answer: Team Data Science Process (TDSP, by Microsoft)1, based on


CRISP-DM, outlines the following major stages in the cycle:
1. Business understanding:
▪ Objectives are defined (what question(s) to answer)
▪ and data sources are identified (finding relevant data that will
help in answering the question(s)).
2. Data acquisition and understanding
▪ Data is ingested (moved from source to target location)
▪ And the Data is explored.
3. Modeling
▪ Feature Engineering (features are selected)
▪ Model Training (splitting dataset, building a model, and
evaluating)
4. Deployment
▪ Put the Model in Production.

1
https://docs.microsoft.com/en-us/azure/architecture/data-science-process/lifecycle-data
1
5. Customer acceptance
▪ Confirm that the deployment satisfies customer’s objectives.

Q3: List down tools / languages used in this domain along the algorithms
/ techniques applied to get results.

Answer:
1. Tools:
▪ SAS
▪ Apache Spark
▪ Microsoft Excel
▪ Hadoop
2. Languages:
▪ R
▪ Python
▪ Scala
3. Algorithms:
▪ Linear Regression
▪ Logistic Regression
▪ Naïve Bayes
▪ KNN
▪ Support Vector Machine
▪ K-Means

4. Techniques2:
▪ Pattern Recognition
▪ Bootstrapping
▪ Resampling

2
https://seleritysas.com/blog/2021/01/22/key-data-science-modeling-techniques-used-in-data-evaluation-and-
analysis/
2
▪ Cross Validation

Q4: Also highlight any 3 use cases of Data Science.

Answer:
1. Price Optimization:
Analyzing multiple prices to determine what price will maximize
sales as well as profits.
2. Customer Segmentation and Clustering:
Customers based on survey, geographic area, or their shopping
habits are segmented. This allows for personalized advertising as
well as coming up with sales strategies for each segment.
3. Sentiment Analysis:
Text such as feedback, social media post, surveys are analyzed to
better understand user’s motivations.
4. Fraud Detection:
This involves using algorithm to detect fake profile, unusual
transactions, fake insurance claims.

3
Part B

Question 1

The temperature of a fridge containing a pharmaceutical company's new vaccine. We measure


the temperate every hour with the following data points (in Fahrenheit):

31, 32, 32, 31, 28, 29, 31, 38, 32, 31, 30, 29, 30, 31, 26

4
• Calculate, mean, median, standard deviation.
Answer:

• Plot the data and see if it is skewed or not.


Answer:

No, the graph is not skewed and is normal. This is because 1) mean, median and mode values are
approximately the same (i.e., 31). 2) Furthermore, number of datapoints before and after the peak
is same, i.e., 7 (symmetric) 3) we can see the bell shape.

5
• Which measure is better, mean, mode or median?
Answer:
Since the data is normal, it doesn’t really matter as the value of mean, median and mode
is same.

Question 2

The scores of students in their maths test:


86,86, 93, 86, 67, 85, 75, 64, 78,72,21

6
• Calculate, mean, median, standard deviation.
Answer:

• Plot the data and see if it is skewed or not.


Answer:

7
Yes, it is skewed. This is because 1) mean, median and mode values are NOT the same.
2) The Graph is not symmetric around the peak. 3) We cannot see the bell shape.

• Which measure is better, mean, mode or median?


Answer:
Mean:
The mean (73.9) here is influenced by an outlier (21 marks), so that is not preferred.

Between Median and Mode:


Most of the data is in between late 70’s and mid 80’s. The median (78) represents this fact well
vs mode (86) which doesn’t represent the data from the late 70’s.

That is why Median is preferred in this case.

You might also like