Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 5

Global Foundries -Batch3

Module-2
Assignment-2

General Instructions:

1. The problem statement and the datasets for assignment can be downloaded from the
reference documents sections.
2. Python Programming assignment consists of 4 sections.
3. Learners will have to submit solutions for all the 4 sections.
4. Read the problem statement carefully before answering.
5. Provide appropriate comments in your code.
6. Perform all the mentioned tasks programmatically using Python libraries. 

Submission Instructions:
1. Create separate Jupyter Notebook for all the 4 sections.
2. Name the Jupyter Notebook in the given format: 
<Assignment2-Section>_<your email Id>.ipynb 
(Eg: Assignment2-A_john@abc.com, Assignment2-B_john@abc.com, Assignment2-C_john@abc.com)
3. Create a folder with the name in the given format: <your email Id> 
(Eg: john@abc.com)
4. Place all 4 Jupyter Notebooks in the folder.
5. Zip the folder and upload it into the ‘Upload Submission' section on the link  page.
6. After uploading the assignment, click on the ‘Finish’ button for the final submission of your
assignment.
Note: Multiple submissions are allowed, only the latest submission (.zip file) will be considered
for evaluation.

Assignment – 2A
A health insurance company is developing new schemes. This requires forecasting the medical expenses
for the specific group of insured population. Even though, forecasting the medical expenses are difficult,
there are few predominant features that can help. Such predominant features of past instances are
available in “insurance_claim.csv”. This dataset has 1338 rows and 6 columns. The details of the
features are as follows:

 Age: age of the client


 Sex: gender as female or male
 bmi : Body mass index that provides understanding of body in terms of weights that are
relatively high or low relative to height.
 dependent: number of children or any other dependent covered by insurance
 Smoker: whether the client is a smoker or not
 Charges: Individual medical costs billed by health insurance company

Based on this data, company would like to build a predictive model that predicts the average medical
care expenses for the given group of people. Build the best model for the given scenario by importing
the dataset “insurance_claim.csv”.

Problem statement:

Perform the following activities to build the model:

1. As part of data preprocessing, perform the following activities:


a. Count the number of beneficiaries falling under different ‘age’ and visualize the same.
b. Use pie charts to show the composition of the columns 'gender' and 'smoker'.
c. Use graphical approach to check whether ‘age’, 'bmi', 'dependent', and 'charges' columns
are normally distributed.
d. Check if there are any outliers in 'age', 'bmi', and 'children' columns.
e. Analyze the correlation between all the variables in the given data set. Drop the redundant
columns if any.
f. Encode all the categorical columns appropriately. #

[Note: The preprocessed dataset should be used further.]

2. Select ‘charges as the target variable to be predicted and remaining features as predictors.
3. Split the data into training and testing data set in the ratio 75:25.
4. As part of model building, perform the following activities:
a. Based on the training data, build a Linear Regression model.
b. Find the train and the test score for the built model.
c. Calculate the adjusted R-Squared values on both the train and the test data.
5. Calculate the VIF values for all the features considered while building the model using the train data.
6. Based on the model built, predict the ‘charges’ of a new test sample which is given below:
age gender bmi dependent smoker
30 male 29 3 yes

Assignment – 2B
One of the applications require gender classification based on the voice signals collected. The voice
signals are already processed to extract the statistical properties and is available in the data set
'voice_record_data.csv'. The description of each column is available in “metadata.txt”.

Based on the given data, build a model to the classify whether the voice is of women’s or men’s.

Problem statement:

Perform the following activities to build the model:


1. Import the data set ‘voice_record_data.csv’.
2. Encode the column ‘label ‘and normalize all the columns in the data set.
3. Split the data into training and testing data set in the ratio 70:30.
4. Consider ‘label’ column as the target variable to be predicted and the remaining all features as
predictors.
5. Build Machine learning model- 1:
a. Build a decision tree classifier by varying the number of nodes between 2 to 10. In each
iteration compute the prediction error of the built model based on the test data.
b. Visualize the prediction error computed against the number of nodes.
c. From the graph plotted, find the optimum number of nodes for which the error rate is
minimum. With the identified number of nodes, run the decision tree classifier and measure
the classification performance.

6. Build Machine learning model- 2:


a. Draw KDE plots considering 'Label' column against every other feature in the given data set.
[ Hint: for better visualization, use subplots to display all in one figure.]
b. From the graph, identify the significant features that exhibits the class separability based on
the 'Label' column (i.e., between Male and Female groups).
[Hint: Probability density of a continuous variable against a discrete variable shows almost separated
curve for each class]
For reference: a sample density plot between label and 3 columns

c. From the train and test data, select only the significant features (identified from the graph)
to create new train and test data. [Note: New Train and test data will have all the rows but only
selected features.]
d. Consider the new train and test data, to train a decision tree classifier with parameters
criterion='gini', number of nodes = optimum number of nodes identified from the problem
5c. Measure the classification performance.

Assignment – 2C
5000 employees in an IT company have undergone certifications in 4 different areas of Data science. The
company rolled out a survey to identify the interest/need of the employee for undertaking the
certification. 217 questions were asked in the survey. The survey report is available in
‘course_survey.csv’. The dataset has 5000 observations and 219 columns. The details of the columns are
as follows:

 Course: certification course taken by the employee. It is categorical with values - ‘NLP’,
‘AI’, ‘ML’, ‘DL’ referring to ‘Natural Language Processing’, ‘Artificial Intelligence’,
‘Machine Learning’ and ‘Deep Learning’ respectively.
 questions_responded: number of questions in the survey responded by the employee.
 Q1 to Q217: Refers to questionaries. If a question is answered by the employee, it is
marked as ‘Y‘, otherwise it is left blank.
Note: The actual questions are not shown in the data, but only the responses are
recorded.

Based on this data, the company would like to build a predictive model that predicts the certification
course an individual employee would be interested to take up. Build the best model for the given
scenario.

Problem statement:

Perform the following activities to build the model:


1. Import the data set ‘course_survey.csv’.
2. As part of data preprocessing, perform the following tasks:
a. Remove the survey responses if less than 10 questions are answered.
b. Eliminate the features/columns that has less than 50 responses.
c. Replace all the missing values in the data with ‘N’ and perform label encoding of the data.
d. ‘Course’ will be the target column to be predicted and the remaining columns will be the
predictors.
[Note: This preprocessed dataset should be used further.]

3. Split the data into training and testing data set in the ratio 80:20.

4. Build Machine learning model- 1:


a. Train a Support Vector Machine.
b. Find the train and the test score for model- 1.
c. Build confusion matrices for the built model, based on the train data and the test data.

5. Build machine learning model -2:


a. Train a AdaBoostClassifier algorithm. Use grid search with 5-fold cross validation, to find
the best value (between 35 to 45) for the parameter ‘n_estimators’.
b. Based on grid search results, choose the best value for ‘n_estimators’.

6. Build machine learning model -3:


a. Based on the best value for ‘n_estimators’, build the AdaBoostClassifier model.
b. Find the train and the test score for the built model.
c. Based on this model, find the importance of each predictor.

Assignment – 2D

An e-commerce site has collected data to analyze the customers online purchase details. It is available in
‘online_data_subset.csv’. There are 363087 purchase records and 5 columns. The column details are as
follows:

 InvoiceNo – Invoice number for all the items being purchased. (multiple items are purchased
under the same invoice.)
 Description – Description of the item purchased
 Quantity – Number of units for each item being purchased
 UnitPrice – price of single unit(piece) of an item being purchased
 CustomerID – customer who purchased the item

Problem statement:

As a machine learning engineer help the company to group its customers using appropriate machine
learning techniques. Perform the below mentioned task to achieve this:

1. Import the dataset ‘online_data_subset.csv’.

2. Perform data preprocessing according to the following rules:


a. Add a new attribute ‘TotalPrice’ to the dataset, whose value is computed using the
‘Quantity’ and ‘UnitPrice’ attributes.
b. Create a new dataframe that contains the following columns:
o CustomerID – unique customer-ids from original dataframe has to be populated
o ItemsCount – contains the count of DISTINCT items purchased by the customer
o TotalExpense – contains the total amount spent by the customer in all the
purchases
c. From the new dataframe created, filter the customers, whose ‘ItemsCount’ is less than 300
and ‘TotalExpense’ is less than 10000.
d. Display the number of customers who satisfy the conditions stated above.

3. Group the filtered customers into 5 groups, based on ‘ItemsCount’ and ‘TotalExpense’. Find the
number of customers under each group.

4. Use a scatter plot to visualize the 5 groups.

-------------------------------------------------Good Luck -------------------------------------------------------

You might also like