SK Learn

SKLEARN (SciPy Toolkit):
● Scikit-learn (Sklearn) is the most useful and robust library for machine learning in
Python.
● It provides a selection of efficient tools for machine learning and statistical
modeling including classification, regression, clustering and dimensionality
reduction via a consistence interface in Python.
● It is easy to use, efficient, and well-documented.
● Scikit-learn is also compatible with a variety of other Python libraries, such as
NumPy and SciPy. This makes it easy to integrate scikit-learn into larger data
science workflows.
● It features various algorithms like : SVM, nearest neighbors, random forest for
classification ( Applications: for Spam detection, image
recognition) , SVR, nearest neighbors, random forest for Regression
( Applications: for Drug response Prediction, Stock prices Prediction), k-
Means, spectral clustering, mean-shift for clustering ( Applications: for Customer
segmentation, Grouping experiment outcomes), PCA, feature selection, non-
negative matrix factorization for dimensionality reduction( Applications: for
Visualization, Increased efficiency).
Supervised and Unsupervised learning:

Supervised learning:
Supervised learning, as the name indicates, has the presence of a supervisor as a
teacher. Basically, supervised learning is when we teach or train the machine using
data that is well labelled. Which means some data is already tagged with the correct
answer. After that, the machine is provided with a new set of examples(data) so that the
supervised learning algorithm analyses the training data (set of training examples) and
produces a correct outcome from labelled data.
For instance, suppose there is a basket filled with different kinds of fruits. Now the first
step is to train the machine with all the different fruits one by one like this:
● If the shape of the object is rounded and has a depression at the top, is red in
color, then it will be labeled as –Apple.
● If the shape of the object is a long curving cylinder having Green-Yellow color,
then it will be labeled as –Banana.
After training the data, we have given a new separate fruit, say Banana from the basket,
and asked to identify it.
Since the machine has already learned the things from previous data and this time has
to use it wisely.
It will first classify the fruit with its shape and color and would confirm the fruit name as
BANANA and put it in the Banana category.
Thus, the machine learns the things from training data (basket containing fruits) and
then applies the knowledge to test data (new fruit).
Supervised learning is classified into two categories of algorithms:

● Classification: A classification problem is when the output variable is a
category, such as “Red” or “blue” , “disease” or “no disease”.
● Regression: A regression problem is when the output variable is a real value,
such as “dollars” or “weight”.
Supervised learning deals with or learns with “labeled” data. This implies that
some data is already tagged with the correct answer.
Types: -
● Regression
● Logistic Regression
● Classification
● Naive Bayes Classifiers
● K-NN (k nearest neighbors)
● Decision Trees
● Support Vector Machine
Advantages: -
● Supervised learning allows collecting data and produces data output from
previous experiences.
● Helps to optimize performance criteria with the help of experience.
● Supervised machine learning helps to solve various types of real-world
computation problems.
Disadvantages: -
● Classifying big data can be challenging.

● Training for supervised learning needs a lot of computation time. So, it
requires a lot of time.
Unsupervised learning:
Unsupervised learning is the training of a machine using information that is neither
classified nor labeled and allowing the algorithm to act on that information without
guidance. Here the task of the machine is to group unsorted information according to
similarities, patterns, and differences without any prior training of data.
Unlike supervised learning, no teacher is provided that means no training will be given
to the machine.
Therefore, the machine is restricted to find the hidden structure in unlabeled data by
itself.
For instance, suppose it is given an image having both dogs and cats which it has never
seen.
Thus, the machine has no idea about the features of dogs and cats so we can’t
categorize it as ‘dogs and cats ‘. But it can categorize them according to their
similarities, patterns, and differences, i.e., we can easily categorize the above picture
into two parts. The first may contain all pics having dogs in them and the second part
may contain all pics having cats in them. Here we didn’t learn anything before, which
means no training data or examples.
It allows the model to work on its own to discover patterns and information that was
previously undetected. It mainly deals with unlabelled data.
Unsupervised learning is classified into two categories of algorithms:
● Clustering: A clustering problem is where we want to discover the inherent

groupings in the data, such as grouping customers by purchasing behavior.
● Association: An association rule learning problem is where we want to
discover rules that describe large portions of our data, such as people that
buy X also tend to buy Y.
Types of Unsupervised Learning: -
Clustering: -
1. Exclusive (partitioning)
2. Agglomerative
3. Overlapping
4. Probabilistic
Clustering Types: -
1. Hierarchical clustering
2. K-means clustering
3. Principal Component Analysis
4. Singular Value Decomposition
5. Independent Component Analysis
Comparison Between Supervised Vs Unsupervised Learning :
Supervised machine Unsupervised machine

Parameters
learning learning
Algorithms are trained Algorithms are used against

Input Data
using labeled data. data that is not labeled
Computational
Simpler method Computationally complex
Complexity
Accuracy Highly accurate Less accurate
No. of classes No. of classes is known No. of classes is not known
Uses real-time analysis of

Data Analysis Uses offline analysis
data
Linear and Logistics

regression, Random K-Means clustering,
Forest, Hierarchical clustering,
Algorithms used
Support Vector Machine, Apriori algorithm, etc.
Neural Network, etc.
Building a Machine Learning Classifier in Python with Scikit-learn
Step 1 — Importing Scikit-learn
Step 2 — Importing Scikit-learn’s Dataset or user defined dataset
Example: Breast Cancer Wisconsin Diagnostic Database. The dataset includes

various information about breast cancer tumors, as well as classification labels
of malignant or benign. The dataset has 569 instances, or data, on 569 tumors and
includes information on 30 attributes, or features, such as the radius of the tumor,
texture, smoothness, and area.
Using this dataset, we will build a machine learning model to use tumor information to
predict whether or not a tumor is malignant or benign.
Scikit-learn comes installed with various datasets which we can load into Python, and
the dataset we want is included. Import and load the dataset.
from sklearn.datasets import load_breast_cancer
data = load_breast_cancer()
label_names = data['target_names']
labels = data['target']
feature_names = data['feature_names']
features = data['data']
Step 3 — Organizing Data into Sets
To evaluate how well a classifier is performing, we should always test the model on
unseen data. Therefore, before building a model, split our data into two parts: a training
set and a test set.
we use the training set to train and evaluate the model during the development stage.
we then use the trained model to make predictions on the unseen test set. This
approach gives us a sense of the model’s performance and robustness.
Fortunately, sklearn has a function called train_test_split(), which divides data into

these sets. Import the function and then use it to split the data:
from sklearn.model_selection import train_test_split
# Split our data
train,test,train_labels,test_labels=
train_test_split(features,labels,test_size=0.33,random_state=42)
Step 4 — Building and Evaluating the Model
There are many models for machine learning, and each model has its own strengths
and weaknesses. A simple algorithm that usually performs well in binary classification
tasks, namely Naive Bayes (NB).
First, import the GaussianNB module. Then initialize the model with

the GaussianNB() function, then train the model by fitting it to the data using gnb.fit():
from sklearn.naive_bayes import GaussianNB
# Initialize our classifier
gnb = GaussianNB()
# Train our classifier
model = gnb.fit(train, train_labels)
After we train the model, we can then use the trained model to make predictions on our
test set, which we do using the predict() function. The predict() function returns an
array of predictions for each data instance in the test set. We can then print our
predictions to get a sense of what the model determined.
Use the predict() function with the test set and print the results:
# Make predictions
preds = gnb.predict(test)
print(preds)
Step 5 — Evaluating the Model’s Accuracy
Using the array of true class labels, we can evaluate the accuracy of our model’s
predicted values by comparing the two arrays (test_labels vs. preds). We will use
the sklearn function accuracy_score() to determine the accuracy of our machine
learning classifier.
from sklearn.metrics import accuracy_score
# Evaluate accuracy
print(accuracy_score(test_labels, preds))

SK Learn

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

SK Learn

Uploaded by

Copyright:

Available Formats

SKLEARN (SciPy Toolkit):

Supervised and Unsupervised learning:

Supervised learning is classified into two categories of algorithms:

● Classifying big data can be challenging.

● Clustering: A clustering problem is where we want to discover the inherent

Types of Unsupervised Learning: -

Supervised machine Unsupervised machine

Algorithms are trained Algorithms are used against

Accuracy Highly accurate Less accurate

No. of classes No. of classes is known No. of classes is not known

Uses real-time analysis of

Linear and Logistics

Step 2 — Importing Scikit-learn’s Dataset or user defined dataset

Example: Breast Cancer Wisconsin Diagnostic Database. The dataset includes

from sklearn.datasets import load_breast_cancer

Step 3 — Organizing Data into Sets

Fortunately, sklearn has a function called train_test_split(), which divides data into

# Split our data

Step 4 — Building and Evaluating the Model

First, import the GaussianNB module. Then initialize the model with

from sklearn.naive_bayes import GaussianNB

# Initialize our classifier

# Train our classifier

model = gnb.fit(train, train_labels)

Use the predict() function with the test set and print the results:

Step 5 — Evaluating the Model’s Accuracy

from sklearn.metrics import accuracy_score

You might also like