Professional Documents
Culture Documents
DA Exam Paper
DA Exam Paper
DA Exam Paper
Q1. b) A mutual fund achieved the following rates of growth over an 11-month period: {3%
2% 7% 8% 2% 4% 3% 7.5% 7.2% 2.7% 2.09%}.
Answer
Q1. c)The following observations are arranged in ascending order. The median of the data is
25 find the value of x. The observations: 17, x, 24, x + 7, 35, 36, 46, 53.
X=8
Q1. d) The mean of the following distribution is 2.5 and fi>0. Find the value of p and also the value of
the observation.
xi 0 1 2 3 4 5
fi 5 3 p 7 p-1 4
0∗5+1∗3+2𝑝+3∗7+4(𝑝−1)+5∗4
Answer: ( ) = 2.5
5+3+𝑝+7+𝑝−1+4
2(40+6p)=5(18+2p)
P=5
So, p-1=4
Q1. e) What is the primary purpose of the Shuffle phase in Hadoop MapReduce, and how does it
contribute to the overall efficiency of data processing?
Answer:
The primary purpose of the Shuffle phase in Hadoop MapReduce is to organize and redistribute the
intermediate key-value pairs produced by the Map phase before they are sent to the Reduce phase.
During Shuffle, data is sorted, grouped, and transferred across the network to the appropriate nodes
where the Reduce tasks will execute. This phase plays a crucial role in optimizing the data
distribution and ensuring that each Reduce task has the necessary input data for further processing.
The efficiency of the Shuffle phase directly impacts the overall performance of the MapReduce job
by minimizing data transfer and facilitating parallel processing in the subsequent Reduce phase.
2.
(i) A finance company wants to evaluate their users, on the basis of loans they have taken. They have
hired you to find the number of cases per location and categorize the count with respect to the
reason for taking a loan. Next, they have also tasked you to display their average risk score. Discuss
and then model your views concerning descriptive and predictive analytics.
Answer:
Descriptive Analytics:
• Use descriptive statistics to summarize the number of cases for each location.
• Categorize the counts with respect to the reasons for taking a loan.
• Present the information in tabular form or create visualizations like bar charts to
highlight the distribution.
• Identify any patterns or trends in risk scores across different user segments or
locations.
3. Data Exploration:
• Identify any outliers or anomalies in the data that might affect the analysis.
• Split the data into training and testing sets to evaluate the model's performance.
• Utilize relevant features such as credit history, income, loan amount, etc.
• Evaluate the model's accuracy, precision, recall, and other metrics to assess its
effectiveness.
3. Segmentation Analysis:
• Analyze the risk profile within each segment to tailor loan approval processes.
4. Optimization Strategies:
• Identify optimization strategies for the loan approval process based on predictive
insights.
(ii) How distributed and parallel computing play a role in big data environment?
Answer: Using the example of analyzing social media data to understand how distributed and
parallel computing comes into play in a big data environment.
Imagine a social media platform like Twitter, which generates a massive amount of data every
second with millions of users posting tweets, liking, and sharing content. To make sense of this data
and extract valuable insights, you need powerful computing resources.
Distributed computing allows the workload to spread across multiple machines or servers. In this
scenario, we could have a distributed system where different servers handle different tasks such as
collecting tweets, analyzing user engagement, detecting trends, and performing sentiment analysis.
Each server specializes in a specific aspect of the data processing pipeline.
Parallel computing further enhances the speed and efficiency of data processing by breaking down
tasks into smaller sub-tasks that can be executed simultaneously across multiple processors or cores
within each server.
For instance, when analyzing sentiment in tweets, parallel computing enables us to divide the
sentiment analysis task into smaller chunks, with each chunk being processed independently by
different processor cores. This allows for faster sentiment analysis of a large volume of tweets.
Overall, distributed and parallel computing work together in a big data environment to handle the
immense volume of data generated by platforms like social media, allowing for faster processing,
scalability, and efficient utilization of resources.
Q3.(i) Calculate the coefficient of covariance for the following data sample with 2 features.
Subsequently, comment on the type of covariance.
X 2 8 18 20 28 20
Y 5 12 18 23 45 50
Solution:
Since Cov(X, Y) is positive, it tends that two variables (i.e., X, Y) tend to move in the same direction,
demonstrating positive covariance.
Q3(ii). Consider your travel time in minutes from the hostel to library: 19, 20, 21, 22, 23, 24, 24, 20,
25, 26, 26, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 49, and 43. Draw the Box plot depicting lower
whisker, upper whisker, Q1, Q2, Q3 and subsequently determine the outliers. Illustrate the solution
in a step-by-step process.
Solution: The box plot with lower whisker, upper whisker, Q1, Q2, and Q3 looks as follows.
Cutting them into half: 19, 20, 20, 21, 22, 23, 24, 24, 25, 26, 26, 28 , 29, 30, 31, 32, 33, 34, 35, 36, 37,
38, 39, 43, 49.
Lower half is 19, 20, 20, 21, 22, 23, 24, 24, 25, 26, 26, 28 and Q1 is (23+24)/2= 23.5
Upper half is 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 43, 49 and Q3 is (35 + 36) / 2 = 35.5
Hence, the outliers are the data points that are less than 5.5 and above 53.5. Since, no data points in
the sample are beyond this range, hence it can be concluded that the sample does not have any
outliers.
Q4(ii) Imagine that in an article you read the statement that all adults on average sleep 7 hours a day.
You spend 7 days in collecting the following data (Monday = 7, Tuesday = 6, Wednesday = 9, Thursday
= 7, Friday = 8, Saturday = 9, and Sunday = 12). How to validate this assertion is valid with hypothesis
tests? Given the confidence level value is 2.58 with the confidence level of 0.99. The solution has to
demonstrate the step-by-step process.
Solution:
Step 1: Set up the null and alternate hypothesis
H0: all adults do not have an average sleep of 7 hours per day.
Ha: on an average, all adults sleep 7 hours per day.
Step 2: Determine the type of test to use
Since the sample size is 7, the z-test is used.
Step 3: Calculate the tested statistic z using the following formula
Where x̄ is the mean of the sample, µ0 is the null hypothesis (i.e., the mean) to be tested, s is the
sample standard deviation, and n is the sample size.
x̄ = (7 + 6 + 9 + 7 + 8 + 9 + 12) / 7 = 8.28
s = √ ((8.28 - 7)2 + (8.28 - 6)2 + (8.28 - 9)2 + (8.28 - 7)2 + (8.28 - 8)2 + (8.28 - 9)2 + (8.28 - 12)2) / √7 = 1.82
μ0 = 7
Plugging the values into the formula z = ((8.28 – 7) / 1.82) * √7 = 1.861
Step 4: Look up the values of z (called the critical value) i.e., the confidence level value is 2.58 with
the confidence interval of 0.99.
Step 5: Draw a conclusion
In this case the tested statistic value of z calculated is less than the critical value given in the problem
statement (i.e., 1.861 < 2.58). Therefore, the null hypothesis is accepted.
Conclusion: This means that all adults do not have the average sleep 7 hours a day.
Q5. Find the value of Linear Regression from the following set of data.