DA Exam Paper

Scheme of Evaluation (DA)
Q1) a) Match the following
A) Excel File containing students subject 1) Unstructured Data

wise monthly attendance report
B) CCTV Footage of First Floor Block B, 2) Semi Structured Data

SCE, KIIT
C) XML File containing Mentee CGPA 3) Structured Data

Details
Soln: A) - 3), B) - 1), C) - 2)
Q1. b) A mutual fund achieved the following rates of growth over an 11-month period: {3%
2% 7% 8% 2% 4% 3% 7.5% 7.2% 2.7% 2.09%}.
Determine 87th percentile.
Answer
Arrange in ascending order: { 2% 2% 2.09% 2.7% 3% 3% 4% 7% 7.2% 7.5% 8%}
87th Percentile = (87/100)*11=9.57= Nearly 10 (rounded off to 10)
So, the 10th number in the order list = 7.5%
Q1. c)The following observations are arranged in ascending order. The median of the data is
25 find the value of x. The observations: 17, x, 24, x + 7, 35, 36, 46, 53.
Answer: Median= (x+7+35)/2=25
X=8
Q1. d) The mean of the following distribution is 2.5 and fi>0. Find the value of p and also the value of
the observation.
xi 0 1 2 3 4 5
fi 5 3 p 7 p-1 4
0∗5+1∗3+2𝑝+3∗7+4(𝑝−1)+5∗4
Answer: ( ) = 2.5
5+3+𝑝+7+𝑝−1+4
2(40+6p)=5(18+2p)
P=5
So, p-1=4
Q1. e) What is the primary purpose of the Shuffle phase in Hadoop MapReduce, and how does it
contribute to the overall efficiency of data processing?
Answer:
The primary purpose of the Shuffle phase in Hadoop MapReduce is to organize and redistribute the
intermediate key-value pairs produced by the Map phase before they are sent to the Reduce phase.
During Shuffle, data is sorted, grouped, and transferred across the network to the appropriate nodes
where the Reduce tasks will execute. This phase plays a crucial role in optimizing the data
distribution and ensuring that each Reduce task has the necessary input data for further processing.
The efficiency of the Shuffle phase directly impacts the overall performance of the MapReduce job
by minimizing data transfer and facilitating parallel processing in the subsequent Reduce phase.
2.
(i) A finance company wants to evaluate their users, on the basis of loans they have taken. They have
hired you to find the number of cases per location and categorize the count with respect to the
reason for taking a loan. Next, they have also tasked you to display their average risk score. Discuss
and then model your views concerning descriptive and predictive analytics.
Answer:
Descriptive Analytics:
1. Number of Cases per Location and Loan Reason:
• Use descriptive statistics to summarize the number of cases for each location.
• Categorize the counts with respect to the reasons for taking a loan.
• Present the information in tabular form or create visualizations like bar charts to
highlight the distribution.
2. Average Risk Score:
• Calculate the average risk score for all users.
• Explore the distribution of risk scores using histograms or box plots.
• Identify any patterns or trends in risk scores across different user segments or
locations.
3. Data Exploration:
• Conduct exploratory data analysis (EDA) to understand the characteristics of the

loan data.
• Identify any outliers or anomalies in the data that might affect the analysis.
• Use visualization tools to provide a clear understanding of the data distribution.

Predictive Analytics:
1. Predictive Modeling for Risk Score:
• Build a predictive model to forecast the risk score of new users.
• Use machine learning algorithms like regression or classification, depending on the

nature of the risk score.
• Split the data into training and testing sets to evaluate the model's performance.
2. Loan Approval/Default Prediction:
• Develop a predictive model to forecast whether a user is likely to default on a loan.
• Utilize relevant features such as credit history, income, loan amount, etc.
• Evaluate the model's accuracy, precision, recall, and other metrics to assess its
effectiveness.
3. Segmentation Analysis:
• Segment users based on various characteristics (e.g., location, income, loan

amount).
• Analyze the risk profile within each segment to tailor loan approval processes.
4. Optimization Strategies:
• Identify optimization strategies for the loan approval process based on predictive
insights.
• Implement data-driven decisions to improve the efficiency and effectiveness of the

lending process.
(ii) How distributed and parallel computing play a role in big data environment?
Answer: Using the example of analyzing social media data to understand how distributed and
parallel computing comes into play in a big data environment.
Imagine a social media platform like Twitter, which generates a massive amount of data every
second with millions of users posting tweets, liking, and sharing content. To make sense of this data
and extract valuable insights, you need powerful computing resources.
Distributed computing allows the workload to spread across multiple machines or servers. In this
scenario, we could have a distributed system where different servers handle different tasks such as
collecting tweets, analyzing user engagement, detecting trends, and performing sentiment analysis.
Each server specializes in a specific aspect of the data processing pipeline.
Parallel computing further enhances the speed and efficiency of data processing by breaking down
tasks into smaller sub-tasks that can be executed simultaneously across multiple processors or cores
within each server.
For instance, when analyzing sentiment in tweets, parallel computing enables us to divide the
sentiment analysis task into smaller chunks, with each chunk being processed independently by
different processor cores. This allows for faster sentiment analysis of a large volume of tweets.
Overall, distributed and parallel computing work together in a big data environment to handle the
immense volume of data generated by platforms like social media, allowing for faster processing,
scalability, and efficient utilization of resources.
Q3.(i) Calculate the coefficient of covariance for the following data sample with 2 features.
Subsequently, comment on the type of covariance.
X 2 8 18 20 28 20
Y 5 12 18 23 45 50
Solution:
Number of observations = 6, mean of X = 16, and mean of Y = 25.5

1
Cov(X, Y) = (5) [(2 – 16)(5 – 25.5) + (8 – 16)(12 – 25.5) + (18 – 16)(18 – 25.5) + (20 – 16)(23 – 25.5) +
(28 – 16)(45 – 25.5) + (20 – 16)(50 – 25.5)] = 702/5 = 140.4
Since Cov(X, Y) is positive, it tends that two variables (i.e., X, Y) tend to move in the same direction,
demonstrating positive covariance.
Q3(ii). Consider your travel time in minutes from the hostel to library: 19, 20, 21, 22, 23, 24, 24, 20,
25, 26, 26, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 49, and 43. Draw the Box plot depicting lower
whisker, upper whisker, Q1, Q2, Q3 and subsequently determine the outliers. Illustrate the solution
in a step-by-step process.
Solution: The box plot with lower whisker, upper whisker, Q1, Q2, and Q3 looks as follows.
Step 1: Calculation of Q1, Q2 and Q3 is as follows:

Put the data points in the increasing order: 19, 20, 20, 21, 22, 23, 24, 24, 25, 26, 26, 28, 29, 30, 31,
32, 33, 34, 35, 36, 37, 38, 39, 43, 49
Cutting them into half: 19, 20, 20, 21, 22, 23, 24, 24, 25, 26, 26, 28 , 29, 30, 31, 32, 33, 34, 35, 36, 37,
38, 39, 43, 49.
Q2 is 50/100*(25+1) = 13th sample = 29
Lower half is 19, 20, 20, 21, 22, 23, 24, 24, 25, 26, 26, 28 and Q1 is (23+24)/2= 23.5
Upper half is 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 43, 49 and Q3 is (35 + 36) / 2 = 35.5
Lower whisker is Q1-1.5*(Q3-Q1) = 23.5 - 1.5 * (35.5 – 23.5) = 5.5
Upper whisker is Q3+1.5*(Q3-Q1) = 35.5 + 1.5 * (35.5-23.5) = 53.5
Hence, the outliers are the data points that are less than 5.5 and above 53.5. Since, no data points in
the sample are beyond this range, hence it can be concluded that the sample does not have any
outliers.
Q4.(i) Explain any three probability sampling method with example.

Different types of Probability sampling.
• Simple Random Sampling • Systematic Sampling • Stratified Sampling • Cluster sampling •
Multi Stage Sampling
Refer to slides.
Q4(ii) Imagine that in an article you read the statement that all adults on average sleep 7 hours a day.
You spend 7 days in collecting the following data (Monday = 7, Tuesday = 6, Wednesday = 9, Thursday
= 7, Friday = 8, Saturday = 9, and Sunday = 12). How to validate this assertion is valid with hypothesis
tests? Given the confidence level value is 2.58 with the confidence level of 0.99. The solution has to
demonstrate the step-by-step process.
Solution:
Step 1: Set up the null and alternate hypothesis
H0: all adults do not have an average sleep of 7 hours per day.
Ha: on an average, all adults sleep 7 hours per day.
Step 2: Determine the type of test to use
Since the sample size is 7, the z-test is used.
Step 3: Calculate the tested statistic z using the following formula
Where x̄ is the mean of the sample, µ0 is the null hypothesis (i.e., the mean) to be tested, s is the
sample standard deviation, and n is the sample size.
x̄ = (7 + 6 + 9 + 7 + 8 + 9 + 12) / 7 = 8.28
s = √ ((8.28 - 7)2 + (8.28 - 6)2 + (8.28 - 9)2 + (8.28 - 7)2 + (8.28 - 8)2 + (8.28 - 9)2 + (8.28 - 12)2) / √7 = 1.82
μ0 = 7
Plugging the values into the formula z = ((8.28 – 7) / 1.82) * √7 = 1.861
Step 4: Look up the values of z (called the critical value) i.e., the confidence level value is 2.58 with
the confidence interval of 0.99.
Step 5: Draw a conclusion
In this case the tested statistic value of z calculated is less than the critical value given in the problem
statement (i.e., 1.861 < 2.58). Therefore, the null hypothesis is accepted.
Conclusion: This means that all adults do not have the average sleep 7 hours a day.
Q5. Find the value of Linear Regression from the following set of data.
The linear regression will thus be Predicted (Y) = 8.3272 + 2.1466 X

DA Exam Paper

Uploaded by

Copyright:

Available Formats

You might also like

DA Exam Paper

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

DA Exam Paper

Uploaded by

Copyright:

Available Formats

Scheme of Evaluation (DA)

Q1) a) Match the following

A) Excel File containing students subject 1) Unstructured Data

B) CCTV Footage of First Floor Block B, 2) Semi Structured Data

C) XML File containing Mentee CGPA 3) Structured Data

Soln: A) - 3), B) - 1), C) - 2)

Determine 87th percentile.

Arrange in ascending order: { 2% 2% 2.09% 2.7% 3% 3% 4% 7% 7.2% 7.5% 8%}

87th Percentile = (87/100)*11=9.57= Nearly 10 (rounded off to 10)

So, the 10th number in the order list = 7.5%

Answer: Median= (x+7+35)/2=25

1. Number of Cases per Location and Loan Reason:

2. Average Risk Score:

• Calculate the average risk score for all users.

• Explore the distribution of risk scores using histograms or box plots.

• Conduct exploratory data analysis (EDA) to understand the characteristics of the

• Use visualization tools to provide a clear understanding of the data distribution.

1. Predictive Modeling for Risk Score:

• Build a predictive model to forecast the risk score of new users.

• Use machine learning algorithms like regression or classification, depending on the

2. Loan Approval/Default Prediction:

• Develop a predictive model to forecast whether a user is likely to default on a loan.

• Segment users based on various characteristics (e.g., location, income, loan

• Implement data-driven decisions to improve the efficiency and effectiveness of the

Number of observations = 6, mean of X = 16, and mean of Y = 25.5

Step 1: Calculation of Q1, Q2 and Q3 is as follows:

Q2 is 50/100*(25+1) = 13th sample = 29

Lower whisker is Q1-1.5*(Q3-Q1) = 23.5 - 1.5 * (35.5 – 23.5) = 5.5

Upper whisker is Q3+1.5*(Q3-Q1) = 35.5 + 1.5 * (35.5-23.5) = 53.5

Q4.(i) Explain any three probability sampling method with example.

The linear regression will thus be Predicted (Y) = 8.3272 + 2.1466 X

You might also like

Lower whisker is Q1-1.5(Q3-Q1) = 23.5 - 1.5 (35.5 – 23.5) = 5.5

Upper whisker is Q3+1.5(Q3-Q1) = 35.5 + 1.5 (35.5-23.5) = 53.5