Practise Questions

You might also like

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 26

LINEAR REGRESSION PRACTISE

QUESTIONS

Question 1.
Calculate the waste of the year 1995,1996,2000
The table is:
Year Tons of Solid Waste
Generated (in thousands)
1990 19,358
1991 19,484
1992 20,293
1993 21,499
1994 23,561

ANSWER: -
1995: 23965.3
1996: 25007.4
2000: 29175.8

Question 2.
Calculate the numbers of insured commercial banks y (in thousands) in the United States for the years
1987 to 1996 are shown in the table. Find the values for 2005,2010

The table is:


Year 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996
y 13.70 13.12 12.71 12.34 11.92 11.46 10.96 10.45 9.94 9.53

ANSWER: -
2005: 5.42018182
2006: 4.96145455
2007: 4.50272727
2008: 4.044
2009: 3.58527273
2010: 3.12654545

Question 3.
Calculate the acres of the farm from year 2000,2002
The table is:
Year Average Acreage Per
Farm
1910 139
1920 149
1930 157
1940 175
1950 216
1959 303
1969 390
1978 449
1987 462
1997 487

ANSWER: -
2000:509.72564103
2002: 519.16153846

Question 4.
Calculate the height when time is 1.000
The table is:
Time (sec) Height (m)
0.0000 1.03754
0.1080 1.40205
0.2150 1.63806
0.3225 1.77412
0.4300 1.80392
0.5375 1.71522
0.6450 1.50942
0.7525 1.21410
0.8600 0.83173

ANSWER: -
1.000: 1.28564909

Question 5.
Calculate the stopping distance when speed is 60,100
The table is:Speed (mph) Stopping Distance (ft)
10 15.1
20 39.9
30 75.2
40 120.5
50 175.9
ANSWER: -
60: 205.98
100: 366.86

LINEAR REGRESSION PRACTICE


MULTI VAR QUESTIONS

33,21 = 0.5708976

Q2. Thunder Basin Antelope


3.7,6.4,11 = 9.33028516
Q3. Systolic BP

141,51 = 184.739

Q4. Hollywood Movies

96.08,9.7,3.101 = 17.84577

Q2. Crimes
7,23,5,2.19 = 81.22555333

LOGISTIC REGRESSION PRACTICE


QUESTION
Q-1) Linear variable
Hours_Studied Pass
12 1
10 1
2 0
11 1
9 1
6 0
7 0
Take test_size=0.1
Ans:- output

model.predict(x_test) : array([0], dtype=int64)


model.predict([[12]]): array([1], dtype=int64)

Q-2) CGPA AND Admission.


CGPA Admission
8.0 T
7.5 T
5.0 N
6.5 T
6.0 T
4.9 N
5.8 N
Take test_size=0.1

Ans:- output

model.predict([[3.7]]): array(['N'], dtype=object)


model.predict([[6.2]]): array(['T'], dtype=object)

Q-3 Multiple variable


Area Age price
2600 20 0
3000 15 1
3200 18 1
3600 30
4000 8 0

Independent variables:-area,age

Dependent:-price

Ans:- output
model.predict(x_test): array([0], dtype=int64)

model.predict([[4000,8]]): array([1], dtype=int64)

model.predict([[4000,30]]): array([0], dtype=int64)

Q-4 Multiple variable


Age Number of cigarettes Lung cancer
12 2 N
60 2 Y
45 4 Y
25 5 N
35 3 N
40 5 Y

Ans:- output
model.score(x_test,y_test): 1.0

model.predict(x_test): array(['Y'], dtype=object)

model.predict([[60,2]]): array(['Y'], dtype=object)

Q-5 Multiple variables


Maths and English (out of 50)
Maths English PASS
30 20 F
40 50 P
20 19 F
25 34 F
33 33 P

Ans:- output
model.predict(x_test): array(['F'], dtype=object)

model.predict([[31,30]]): array(['P'], dtype=object)

DECISION TREE PRACTICE QUESTION


Q1. Reptile Checking

model.predict([[1,0,1,0]]) : array(['R'], dtype=object)

Q2. Breast Cancer


model.predict([[156,303,208,179,329,390,365,316,314,370]]) : array([1], dtype=int64)

Q3. Football playing conditions

Q4. Iris Detection

Import Dataset using: from sklearn.datasets import load_iris


#Output:
Out: Accuracy Score on train data: 0.9553571428571429
Accuracy Score on test data: 0.9736842105263158

Q4: Breast Cancer 2

RANDOM FOREST PRACTICE


QUESTION
Q1. Temperatre

RANDOM FOREST

The problem we will tackle is predicting the max temperature for tomorrow in our city using one
year of past weather data. I am using Seattle, WA but feel free to find data for your own city using
the NOAA Climate Data Online tool. We are going to act as if we don’t have access to any
weather forecasts (and besides, it’s more fun to make our own predictions rather than rely on
others). What we do have access to is one year of historical max temperatures, the temperatures
for the previous two days, and an estimate from a friend who is always claiming to know
everything about the weather. This is a supervised, regression machine learning problem. It’s
supervised because we have both the features (data for the city) and the targets (temperature) that
we want to predict. During training, we give the random forest both the features and targets and it
must learn how to map the data to a prediction. Moreover, this is a regression task because the
target value is continuous (as opposed to discrete classes in classification). That’s pretty much all
the background we need, so let’s start!

Roadmap

Before we jump right into programming, we should lay out a brief guide to keep us on track. The
following steps form the basis for any machine learning workflow once we have a problem and
model in mind:

1. State the question and determine required data


2. Acquire the data in an accessible format

3. Identify and correct missing data points/anomalies as required

4. Prepare the data for the machine learning model

5. Establish a baseline model that you aim to exceed

6. Train the model on the training data

7. Make predictions on the test data

8. Compare predictions to the known test set targets and calculate performance metrics

9. If performance is not satisfactory, adjust the model, acquire more data, or try a different
modeling technique

10. Interpret model and report results visually and numerically

Step 1 is already checked off! We have our question: “can we predict the max temperature
tomorrow for our city?” and we know we have access to historical max temperatures for the past
year in Seattle, WA.

Data Acquisition

First, we need some data. To use a realistic example, I retrieved weather data for Seattle, WA
from 2016 using the NOAA Climate Data Online tool. Generally, about 80% of the time spent in
data analysis is cleaning and retrieving data, but this workload can be reduced by finding high-
quality data sources. The NOAA tool is surprisingly easy to use and temperature data can be
downloaded as clean csv files which can be parsed in languages such as Python or R. The
complete data file is available for download for those wanting to follow along.

The information is in the tidy data format with each row forming one observation, with the
variable values in the columns.

Following are explanations of the columns:

year: 2016 for all data points

month: number for month of the year


day: number for day of the year

week: day of the week as a character string

temp_2: max temperature 2 days prior

temp_1: max temperature 1 day prior

average: historical average max temperature

actual: max temperature measurement

friend: your friend’s prediction, a random number between 20 below the average and 20 above
the average

Identify Anomalies/ Missing Data

If we look at the dimensions of the data, we notice only there are only 348 rows, which doesn’t
quite agree with the 366 days we know there were in 2016. Looking through the data from the
NOAA, I noticed several missing days, which is a great reminder that data collected in the real-
world will never be perfect. Missing data can impact an analysis as can incorrect data or outliers.
In this case, the missing data will not have a large effect, and the data quality is good because of
the source. We also can see there are nine columns which represent eight features and the one
target (‘actual’).

Data Summary
There are not any data points that immediately appear as anomalous and no zeros in any of the
measurement columns. Another method to verify the quality of the data is make basic plots. Often
it is easier to spot anomalies in a graph than in numbers. I have left out the actual code here,
because plotting is Python is non-intuitive but feel free to refer to the notebook for the complete
implementation (like any good data scientist, I pretty much copy and pasted the plotting code
from Stack Overflow).
Examining the quantitative statistics and the graphs, we can feel confident in the high quality of
our data. There are no clear outliers, and although there are a few missing points, they will not
detract from the analysis.

Data Preparation

Unfortunately, we aren’t quite at the point where you can just feed raw data into a model and have
it return an answer (although people are working on this)! We will need to do some minor
modification to put our data into machine-understandable terms. We will use the Python
library Pandas for our data manipulation relying, on the structure known as a dataframe, which is
basically an excel spreadsheet with rows and columns.

The exact steps for preparation of the data will depend on the model used and the data gathered,
but some amount of data manipulation will be required for any machine learning application.

One-Hot Encoding

The first step for us is known as one-hot encoding of the data. This process takes categorical
variables, such as days of the week and converts it to a numerical representation without an
arbitrary ordering. Days of the week are intuitive to us because we use them all the time. You will
(hopefully) never find anyone who doesn’t know that ‘Mon’ refers to the first day of the
workweek, but machines do not have any intuitive knowledge. What computers know is numbers
and for machine learning we must accommodate them. We could simply map days of the week to
numbers 1–7, but this might lead to the algorithm placing more importance on Sunday because it
has a higher numerical value. Instead, we change the single column of weekdays into seven
columns of binary data. This is best illustrated pictorially. One hot encoding takes this:

and turns it into

So, if a data point is a Wednesday, it will have a 1 in the Wednesday column and a 0 in all other
columns. This process can be done in pandas in a single line!-hot encoding: Data after One-Hot
Encoding
The shape of our data is now 349 x 15 and all of the column are numbers, just how the algorithm
likes it!

Features and Targets and Convert Data to Arrays

Now, we need to separate the data into the features and targets. The target, also known as the
label, is the value we want to predict, in this case the actual max temperature and the features are
all the columns the model uses to make a prediction. We will also convert the Pandas dataframes
to Numpy arrays because that is the way the algorithm works. (I save the column headers, which
are the names of the features, to a list to use for later visualization).
Training and Testing Sets

There is one final step of data preparation: splitting data into training and testing sets. During
training, we let the model ‘see’ the answers, in this case the actual temperature, so it can learn
how to predict the temperature from the features. We expect there to be some relationship
between all the features and the target value, and the model’s job is to learn this relationship
during training. Then, when it comes time to evaluate the model, we ask it to make predictions on
a testing set where it only has access to the features (not the answers)! Because we do have the
actual answers for the test set, we can compare these predictions to the true value to judge how
accurate the model is. Generally, when training a model, we randomly split the data into training
and testing sets to get a representation of all data points (if we trained on the first nine months of
the year and then used the final three months for prediction, our algorithm would not perform well
because it has not seen any data from those last three months.) I am setting the random state to 42
which means the results will be the same each time I run the split for reproducible results.

It looks as if everything is in order! Just to recap, to get the data into a form acceptable for
machine learning we:

1. One-hot encoded categorical variables

2. Split data into features and labels

3. Converted to arrays

4. Split data into training and testing sets

Depending on the initial data set, there may be extra work involved such as removing
outliers, imputing missing values, or converting temporal variables into cyclical representations.
These steps may seem arbitrary at first, but once you get the basic workflow, it will be generally
the same for any machine learning problem. It’s all about taking human-readable data and putting
it into a form that can be understood by a machine learning model.

Establish Baseline

Before we can make and evaluate predictions, we need to establish a baseline, a sensible measure
that we hope to beat with our model. If our model cannot improve upon the baseline, then it will
be a failure and we should try a different model or admit that machine learning is not right for our
problem. The baseline prediction for our case can be the historical max temperature averages. In
other words, our baseline is the error we would get if we simply predicted the average max
temperature for all days.

We now have our goal! If we can’t beat an average error of 5 degrees, then we need to rethink our
approach.
Train Model

After all the work of data preparation, creating and training the model is pretty simple using
Scikit-learn. We import the random forest regression model from skicit-learn, instantiate the
model, and fit (scikit-learn’s name for training) the model on the training data. (Again setting the
random state for reproducible results). This entire process is only 3 lines in scikit-learn!

Make Predictions on the Test Set

Our model has now been trained to learn the relationships between the features and the targets.
The next step is figuring out how good the model is! To do this we make predictions on the test
features (the model is never allowed to see the test answers). We then compare the predictions to
the known answers. When performing regression, we need to make sure to use the absolute error
because we expect some of our answers to be low and some to be high. We are interested in how
far away our average prediction is from the actual value so we take the absolute value (as we also
did when establishing the baseline).

Determine Performance Metrics

To put our predictions in perspective, we can calculate an accuracy using the mean average
percentage error subtracted from 100 %.

Code:-
# Pandas is used for data manipulation
import pandas as pd# Read in data and display first 5 rows
features = pd.read_csv('temps.csv')
features.head(5)
Output:-

# Descriptive statistics for each column


features.describe()
Output:-
# One-hot encode the data using pandas get_dummies
features = pd.get_dummies(features)# Display the first 5 rows of the last 12 columns
features.iloc[:,5:].head(5)
Output:-

# Use numpy to convert to arrays


import numpy as np# Labels are the values we want to predict
labels = np.array(features['actual'])# Remove the labels from the features

# axis 1 refers to the columns


features= features.drop('actual', axis = 1)# Saving feature names for later use
feature_list = list(features.columns)# Convert to numpy array
features = np.array(features)

# Using Skicit-learn to split data into training and testing sets


from sklearn.model_selection import train_test_split# Split the data into training and testing sets
train_features, test_features, train_labels, test_labels = train_test_split(features, labels, test_size = 0.25,
random_state = 42)
print('Training Features Shape:', train_features.shape)
print('Training Labels Shape:', train_labels.shape)
print('Testing Features Shape:', test_features.shape)
print('Testing Labels Shape:', test_labels.shape)
Output:-
# The baseline predictions are the historical averages
baseline_preds = test_features[:, feature_list.index('average')] #Baseline errors, and display average
baseline error
baseline_errors = abs(baseline_preds - test_labels)print('Average baseline error: ',
round(np.mean(baseline_errors), 2))
Output:-

# Import the model we are using


from sklearn.ensemble import RandomForestRegressor # Instantiate model with 1000 decision
trees
rf = RandomForestRegressor(n_estimators = 1000, random_state = 42)# Train the model on training
data
rf.fit(train_features, train_labels);
# Use the forest's predict method on the test data
predictions = rf.predict(test_features)# Calculate the absolute errors
errors = abs(predictions - test_labels) # Print out the mean absolute error (mae)
print('Mean Absolute Error:', round(np.mean(errors), 2), 'degrees.')
Output:-

Our average estimate is off by 3.83 degrees. That is more than a 1 degree average improvement
over the baseline. Although this might not seem significant, it is nearly 25% better than the
baseline, which, depending on the field and the problem, could represent millions of dollars to a
company.
# Calculate mean absolute percentage error (MAPE)
mape = 100 * (errors / test_labels) # Calculate and display accuracy
accuracy = 100 - np.mean(mape)
print('Accuracy:', round(accuracy, 2), '%.')
Output:-
That looks pretty good! Our model has learned how to predict the maximum temperature for the
next day in Seattle with 94% accuracy.

Q2. Petrolium
In this section we will study how random forests can be used to solve regression problems using
Scikit-Learn. In the next section we will solve classification problem via random forests.

Problem Definition
The problem here is to predict the gas consumption (in millions of gallons) in 48 of the US states
based on petrol tax (in cents), per capita income (dollars), paved highways (in miles) and the
proportion of population with the driving license.

Solution
To solve this regression problem we will use the random forest algorithm via the Scikit-Learn
Python library. We will follow the traditional machine learning pipeline to solve this problem.
Follow these steps:

1. Import Libraries

2. Importing Dataset
The dataset for this problem is available at:

Execute the following command to import the dataset:

To get a high-level view of what the dataset looks like, execute the following command:

Petrol_tax Average_income Paved_Highways Population_Driver_license(%) Petrol_Co

0 9.0 3571 1976 0.525 541


Petrol_tax Average_income Paved_Highways Population_Driver_license(%) Petrol_Co

1 9.0 4092 1250 0.572 524

2 9.0 3865 1586 0.580 561

3 7.5 4870 2351 0.529 414

4 8.0 4399 431 0.544 410

We can see that the values in our dataset are not very well scaled. We will scale them down
before training the algorithm.

3. Preparing Data For Training


Two tasks will be performed in this section. The first task is to divide data into 'attributes' and
'label' sets. The resultant data is then divided into training and test sets.

Finally, let's divide the data into training and testing sets:

4. Feature Scaling
We know our dataset is not yet a scaled value, for instance the Average_Income field has values
in the range of thousands while Petrol_tax has values in range of tens. Therefore, it would be
beneficial to scale our data (although, as mentioned earlier, this step isn't as important for the
random forests algorithm).

5. Training the Algorithm


Now that we have scaled our dataset, it is time to train our random forest algorithm to solve this
regression problem. Execute the following code:

The RandomForestRegressor class of the sklearn.ensemble library is used to solve regression


problems via random forest. The most important parameter of the RandomForestRegressor class
is the n_estimators parameter. This parameter defines the number of trees in the random forest.
We will start with n_estimator=20 to see how our algorithm performs. You can find details for
all of the parameters of RandomForestRegressor.

Q3. Bill Authentication

Attribute Information:

1. variance of Wavelet Transformed image (continuous)


2. skewness of Wavelet Transformed image (continuous)
3. curtosis of Wavelet Transformed image (continuous)
4. entropy of image (continuous)
5. class (integer)

Problem Definition
The task here is to predict whether a bank currency note is authentic or not based on four
attributes i.e. variance of the image wavelet transformed image, skewness, entropy, and curtosis
of the image.

Solution
This is a binary classification problem and we will use a random forest classifier to solve this
problem. Steps followed to solve this problem will be similar to the steps performed for
regression.

1. Import Libraries

2. Importing Dataset
dataset.head()

Variance Skewness Curtosis Entropy C

0 3.62160 8.6661 -2.8073 -0.44699 0

1 4.54590 8.1674 -2.4586 -1.46210 0


Variance Skewness Curtosis Entropy C

2 3.86600 -2.6383 1.9242 0.10645 0

3 3.45660 9.5228 -4.0112 -3.59440 0

4 0.32924 -4.4552 4.5718 -0.98880 0

As was the case with regression dataset, values in this dataset are not very well scaled. The
dataset will be scaled before training the algorithm.

3. Preparing Data For Training


The following code divides data into attributes and labels:

4. Feature Scaling
As with before, feature scaling works the same way:

5. Training the Algorithm


And again, now that we have scaled our dataset, we can train our random forests to solve this
classification problem. To do so, execute the following code:

In case of regression we used the RandomForestRegressor class of the sklearn.ensemble library.


For classification, we will RandomForestClassifier class of the sklearn.ensemble
library. RandomForestClassifier class also takes n_estimators as a parameter. Like before, this
parameter defines the number of trees in our random forest. We will start with 20 trees again.
You can find details for all of the parameters of RandomForestClassifier.

K-MEANS PRACTICE QUESTION


Q1. Absenteesim at Work
1. Individual identification (ID)
2. Reason for absence (ICD).
Absences attested by the International Code of Diseases (ICD) stratified into 21 categories (I to
XXI) as follows:

I Certain infectious and parasitic diseases


II Neoplasms
III Diseases of the blood and blood-forming organs and certain disorders involving the immune
mechanism
IV Endocrine, nutritional and metabolic diseases
V Mental and behavioural disorders
VI Diseases of the nervous system
VII Diseases of the eye and adnexa
VIII Diseases of the ear and mastoid process
IX Diseases of the circulatory system
X Diseases of the respiratory system
XI Diseases of the digestive system
XII Diseases of the skin and subcutaneous tissue
XIII Diseases of the musculoskeletal system and connective tissue
XIV Diseases of the genitourinary system
XV Pregnancy, childbirth and the puerperium
XVI Certain conditions originating in the perinatal period
XVII Congenital malformations, deformations and chromosomal abnormalities
XVIII Symptoms, signs and abnormal clinical and laboratory findings, not elsewhere classified
XIX Injury, poisoning and certain other consequences of external causes
XX External causes of morbidity and mortality
XXI Factors influencing health status and contact with health services.

And 7 categories without (CID) patient follow-up (22), medical consultation (23), blood donation
(24), laboratory examination (25), unjustified absence (26), physiotherapy (27), dental
consultation (28).
3. Month of absence
4. Day of the week (Monday (2), Tuesday (3), Wednesday (4), Thursday (5), Friday (6))
5. Seasons (summer (1), autumn (2), winter (3), spring (4))
6. Transportation expense
7. Distance from Residence to Work (kilometers)
8. Service time
9. Age
10. Hit target
11. Disciplinary failure (yes=1; no=0)
12. Education (high school (1), graduate (2), postgraduate (3), master and doctor (4))
13. Son (number of children)
14. Social drinker (yes=1; no=0)
15. Social smoker (yes=1; no=0)
16. Pet (number of pet)
17. Weight
18. Height
19. Body mass index
20. Absenteeism time in hours (target)
Original(Reason Vs Hours of leave)

Without Optimization(Reason Vs Hours of leave)

With Optimization
SVM PRACTICE QUESTION
Q1. HandWritten digit

Support Vector Machines

This algorithm is normally a second stepping stone for those who have learned linear and logistic
regression. It is quite a popular algorithm used mostly in classification problems. It creates high-
performance accuracy models with fewer efforts and minimum resources. Though it can be used
for regression, its application is mostly found in classification scenario.

There is n number of data points which are features of our dataset. These points are plotted in an
n-dimensional graph or space. After classification, these features are divided by a
hyperparameter line which separates the two plains and helps us find the best one.

There are many possible criteria for finding the optimum hyperplane. Our objective is to find a
plane in which has the maximum distance between the data points of both the classes. The
dimension of a hyperplane is directly proportional to the number of features. If were input only 2
features, then the hyperplane is just a line. If it’s 3 or more, then the hyperplane becomes a two-
dimensional plane. It becomes more complex when it exceeds 3.

These data points are support vectors that are closer to hyperplane and influence the position and
orientation of the hyperplane. With the help of these vectors, we maximize the margin of the
classifier. If we change or delete support vectors, it will change the position of the hyperplane.

You might also like