Professional Documents
Culture Documents
File 482621234 482621234 - Assignment 2 - 7378831553794248
File 482621234 482621234 - Assignment 2 - 7378831553794248
Module-2
Assignment-2
General Instructions:
1. The problem statement and the datasets for assignment can be downloaded from the
reference documents sections.
2. Python Programming assignment consists of 4 sections.
3. Learners will have to submit solutions for all the 4 sections.
4. Read the problem statement carefully before answering.
5. Provide appropriate comments in your code.
6. Perform all the mentioned tasks programmatically using Python libraries.
Submission Instructions:
1. Create separate Jupyter Notebook for all the 4 sections.
2. Name the Jupyter Notebook in the given format:
<Assignment2-Section>_<your email Id>.ipynb
(Eg: Assignment2-A_john@abc.com, Assignment2-B_john@abc.com, Assignment2-C_john@abc.com)
3. Create a folder with the name in the given format: <your email Id>
(Eg: john@abc.com)
4. Place all 4 Jupyter Notebooks in the folder.
5. Zip the folder and upload it into the ‘Upload Submission' section on the link page.
6. After uploading the assignment, click on the ‘Finish’ button for the final submission of your
assignment.
Note: Multiple submissions are allowed, only the latest submission (.zip file) will be considered
for evaluation.
Assignment – 2A
A health insurance company is developing new schemes. This requires forecasting the medical expenses
for the specific group of insured population. Even though, forecasting the medical expenses are difficult,
there are few predominant features that can help. Such predominant features of past instances are
available in “insurance_claim.csv”. This dataset has 1338 rows and 6 columns. The details of the
features are as follows:
Based on this data, company would like to build a predictive model that predicts the average medical
care expenses for the given group of people. Build the best model for the given scenario by importing
the dataset “insurance_claim.csv”.
Problem statement:
2. Select ‘charges as the target variable to be predicted and remaining features as predictors.
3. Split the data into training and testing data set in the ratio 75:25.
4. As part of model building, perform the following activities:
a. Based on the training data, build a Linear Regression model.
b. Find the train and the test score for the built model.
c. Calculate the adjusted R-Squared values on both the train and the test data.
5. Calculate the VIF values for all the features considered while building the model using the train data.
6. Based on the model built, predict the ‘charges’ of a new test sample which is given below:
age gender bmi dependent smoker
30 male 29 3 yes
Assignment – 2B
One of the applications require gender classification based on the voice signals collected. The voice
signals are already processed to extract the statistical properties and is available in the data set
'voice_record_data.csv'. The description of each column is available in “metadata.txt”.
Based on the given data, build a model to the classify whether the voice is of women’s or men’s.
Problem statement:
c. From the train and test data, select only the significant features (identified from the graph)
to create new train and test data. [Note: New Train and test data will have all the rows but only
selected features.]
d. Consider the new train and test data, to train a decision tree classifier with parameters
criterion='gini', number of nodes = optimum number of nodes identified from the problem
5c. Measure the classification performance.
Assignment – 2C
5000 employees in an IT company have undergone certifications in 4 different areas of Data science. The
company rolled out a survey to identify the interest/need of the employee for undertaking the
certification. 217 questions were asked in the survey. The survey report is available in
‘course_survey.csv’. The dataset has 5000 observations and 219 columns. The details of the columns are
as follows:
Course: certification course taken by the employee. It is categorical with values - ‘NLP’,
‘AI’, ‘ML’, ‘DL’ referring to ‘Natural Language Processing’, ‘Artificial Intelligence’,
‘Machine Learning’ and ‘Deep Learning’ respectively.
questions_responded: number of questions in the survey responded by the employee.
Q1 to Q217: Refers to questionaries. If a question is answered by the employee, it is
marked as ‘Y‘, otherwise it is left blank.
Note: The actual questions are not shown in the data, but only the responses are
recorded.
Based on this data, the company would like to build a predictive model that predicts the certification
course an individual employee would be interested to take up. Build the best model for the given
scenario.
Problem statement:
3. Split the data into training and testing data set in the ratio 80:20.
Assignment – 2D
An e-commerce site has collected data to analyze the customers online purchase details. It is available in
‘online_data_subset.csv’. There are 363087 purchase records and 5 columns. The column details are as
follows:
InvoiceNo – Invoice number for all the items being purchased. (multiple items are purchased
under the same invoice.)
Description – Description of the item purchased
Quantity – Number of units for each item being purchased
UnitPrice – price of single unit(piece) of an item being purchased
CustomerID – customer who purchased the item
Problem statement:
As a machine learning engineer help the company to group its customers using appropriate machine
learning techniques. Perform the below mentioned task to achieve this:
3. Group the filtered customers into 5 groups, based on ‘ItemsCount’ and ‘TotalExpense’. Find the
number of customers under each group.