unit1

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 15

MACHINE

LEARNING
WITH PYTHON
SEMESTER 5
UNIT - 1

HI COLLEGE
SYLLABUS
UNIT - 1

HI COLLEGE
INTRODUCTION TO MACHINE LEARNING
Machine learning is a rapidly growing field in artificial intelligence (AI) that
focuses on enabling computers to learn and improve from experience without
being explicitly programmed. In other words, machine learning algorithms can
automatically learn and make predictions or decisions based on data.

Machine learning algorithms are trained on large datasets, which they use to
learn patterns and relationships in the data. The algorithms then use this
knowledge to make predictions or decisions about new, unseen data.

WHY MACHINE LEARNING


1. Large amounts of data: With the rise of big data, there is an overwhelming
amount of data being generated every day. Machine learning algorithms can
process and analyze this data to extract insights and make predictions.

2. Accurate predictions: Machine learning algorithms can make accurate


predictions based on historical data, which can help businesses make informed
decisions. For example, in finance, machine learning algorithms can predict
stock prices or credit risk.

3. Faster decision-making: Machine learning algorithms can make decisions


faster than humans, as they can process large amounts of data quickly and
accurately. This can help businesses respond quickly to changing market
conditions or customer needs.

4. Personalization: Machine learning algorithms can personalize products and


services based on a user's preferences and past behavior, which can improve
the user experience and increase customer satisfaction.

5. Cost savings: By automating tasks that would typically require human


intervention, machine learning algorithms can save businesses time and
money. For example, in healthcare, machine learning algorithms can help
diagnose diseases faster and more accurately than humans, which can save
lives and reduce healthcare costs.

6. Improved efficiency: Machine learning algorithms can improve the efficiency


of various processes by automating repetitive tasks or identifying areas for
improvement. For example, in manufacturing, machine learning algorithms can
optimize production processes by predicting equipment failures before they
occur.
HiCollege Click Here For More Notes 01
TYPES OF MACHINE LEARNING PROBLEMS
Machine learning algorithms can be applied to various types of problems,
depending on the nature of the data and the desired output. Here are some
common types of machine learning problems:

1. Classification: In classification problems, the algorithm is trained to predict


the category or class of a new, unseen data point based on its features. For
example, in image recognition, the algorithm is trained to classify images into
different categories, such as cats or dogs.

2. Regression: In regression problems, the algorithm is trained to predict a


continuous numerical value based on input features. For example, in finance,
the algorithm can predict stock prices based on historical data.

3. Clustering: In clustering problems, the algorithm is trained to group similar


data points together based on their features. For example, in customer
segmentation, the algorithm can group customers with similar buying
behaviors together.

4. Anomaly detection: In anomaly detection problems, the algorithm is trained


to identify unusual or abnormal data points that do not fit the normal pattern
or distribution of the data. For example, in fraud detection, the algorithm can
identify unusual credit card transactions that may be fraudulent.

5. Dimensionality reduction: In dimensionality reduction problems, the


algorithm is trained to reduce the number of input features while preserving
most of the important information in the data. This can help simplify complex
problems and make them easier to understand and analyze.

6. Reinforcement learning: In reinforcement learning problems, the algorithm


learns by interacting with its environment and receiving feedback in the form
of rewards or penalties. The goal is to find a policy that maximizes the
cumulative reward over time. For example, in robotics, the algorithm can learn
to navigate a maze by interacting with it and receiving feedback from sensors.

HiCollege Click Here For More Notes 02


APPLICATIONS OF MACHINE LEARNING.
Machine learning has a wide range of applications across various industries and
domains. Here are some examples:

1. Finance: Machine learning algorithms are used in finance for tasks such as
fraud detection, credit scoring, stock price prediction, and portfolio
optimization.

2. Healthcare: Machine learning algorithms are used in healthcare for tasks such
as medical image analysis, disease diagnosis, drug discovery, and personalized
medicine.

3. Retail: Machine learning algorithms are used in retail for tasks such as
demand forecasting, inventory optimization, personalized recommendations,
and pricing optimization.

4. Manufacturing: Machine learning algorithms are used in manufacturing for


tasks such as predictive maintenance, quality control, and supply chain
optimization.

5. Transportation: Machine learning algorithms are used in transportation for


tasks such as traffic prediction, route optimization, and autonomous driving.

6. Education: Machine learning algorithms are used in education for tasks such
as student performance prediction, personalized learning, and intelligent
tutoring systems.

7. Energy: Machine learning algorithms are used in energy for tasks such as
wind turbine performance prediction, energy demand forecasting, and smart
grid optimization.

8. Cybersecurity: Machine learning algorithms are used in cybersecurity for tasks


such as network intrusion detection, malware detection, and anomaly
detection.

HiCollege Click Here For More Notes 03


SUPERVISED MACHINE LEARNING- REGRESSION AND
CLASSIFICATION.
Supervised machine learning is a type of machine learning algorithm that is
trained on labeled data to make predictions or decisions for new, unseen data
points. In supervised learning, the algorithm learns a function that maps input
features to output labels or values. There are two main types of supervised
learning problems: regression and classification.

1. Regression: In regression problems, the output label is a continuous


numerical value. The goal is to predict this value based on the input features.
For example, in finance, regression can be used to predict stock prices based on
historical data. The output label is the stock price, and the input features might
include factors such as company earnings, economic indicators, and market
trends.

2. Classification: In classification problems, the output label is a category or


class. The goal is to predict which category a new, unseen data point belongs to
based on its input features. For example, in image recognition, the algorithm is
trained to classify images into different categories, such as cats or dogs. The
input features might include pixel values or features extracted from the image
using techniques such as convolutional neural networks (CNNs).

BINARY CLASSIFIER, MULTICLASS CLASSIFICATION,


MULTILABEL CLASSIFICATION
In binary classification, the output label is a binary value, indicating whether a
data point belongs to one of two classes or not. For example, in spam filtering,
the algorithm is trained to classify emails as either spam or not spam.

In multiclass classification, the output label is a categorical value, indicating


which of multiple classes a data point belongs to. For example, in image
recognition, the algorithm is trained to classify images into one of several
categories, such as cats, dogs, birds, and cars.

HiCollege Click Here For More Notes 04


In multilabel classification, the output label is a set of binary values, indicating
which of multiple labels apply to a data point. For example, in document
classification, the algorithm is trained to classify documents into multiple
categories simultaneously.

The techniques used for training and evaluating binary classifiers can be
adapted for multiclass and multilabel classification problems as well. However,
multiclass and multilabel classification problems can be more challenging than
binary classification due to the increased number of classes and labels that
need to be considered. Techniques such as one-vs-rest (OVR) and one-vs-one
(OVO) are commonly used for multiclass classification, while techniques such as
binary relevance (BR) and label powerset (LP) are commonly used for multilabel
classification.

PERFORMANCE MEASURESCONFUSION MATRIX, ACCURACY,


PRECISION & RECALL, ROC CURVE.
In supervised machine learning, the performance of a binary classifier can be
evaluated using various metrics, depending on the specific problem and the
desired outcome. Some commonly used performance measures are:

1. Confusion Matrix: A confusion matrix is a tabular representation of the


performance of a binary classifier on a test set. It shows the number of true
positives (TP), true negatives (TN), false positives (FP), and false negatives (FN)
for the classifier's predictions.

2. Accuracy: Accuracy is the fraction of correctly classified data points in the test
set. It is calculated as (TP + TN) / (TP + TN + FP + FN). Accuracy is a simple and
intuitive metric, but it may not be meaningful for imbalanced datasets or when
the cost of misclassification is unequal for different classes.

3. Precision & Recall: Precision and recall are measures of how well the classifier
performs for each class separately. Precision is the fraction of true positives
among all positive predictions, while recall is the fraction of true positives
among all actual positives. They are calculated as precision = TP / (TP + FP) and
recall = TP / (TP + FN), respectively. Precision and recall are important because
they provide information about the classifier's ability to correctly identify
positive and negative examples, respectively.

HiCollege Click Here For More Notes 05


4. ROC Curve: The receiver operating characteristic (ROC) curve is a graphical
representation of the trade-off between true positive rate (TPR = recall) and
false positive rate (FPR = 1 - TNR = 1 - specificity) for different threshold values
used to classify data points as positive or negative. The area under the ROC
curve (AUC) is a single metric that summarizes the overall performance of the
classifier, with higher values indicating better performance. The ROC curve is
particularly useful for imbalanced datasets or when the cost of misclassification
is unequal for different classes, as it allows us to evaluate the classifier's ability
to distinguish between positive and negative examples while controlling for
false positives.

ADVANCED PYTHON- NUMPY, PANDAS


NumPy and Pandas are two popular libraries in Python that are widely used for
data manipulation, analysis, and visualization. Here's a brief overview of their
features:

1. NumPy: NumPy (Numerical Python) is a library for the Python programming


language, adding support for large, multi-dimensional arrays and matrices,
along with a large collection of high-level mathematical functions to operate
on these arrays. Some key features of NumPy include:

N-dimensional array objects (ndarrays)


Fast arithmetic operations on arrays
Broadcasting (automatic extension of arrays to match the shape required by
an operation)
Ufunc (universal functions) for element-wise mathematical operations
Linear algebra functions (matrix multiplication, determinant,
eigenvalues/vectors, etc.)
Fourier transforms (FFT)
Random number generation

2. Pandas: Pandas is a library for data manipulation and analysis in Python. It


provides data structures designed to make working with structured and
labeled data easier and more efficient. Some key features of Pandas include:

HiCollege Click Here For More Notes 06


DataFrame: A 2D labeled data structure with columns of potentially
different data types. It is a flexible and generalized extension of the concept
of a spreadsheet or SQL table.
Series: A 1D labeled data structure with an index and values of any data type.
It is similar to a column in a DataFrame or a vector in NumPy.
Index: A label or key for rows or columns in a DataFrame or Series. It can be
integer, float, or string based, and can be used to efficiently slice and index
DataFrames and Series.
Merging and joining: Pandas provides efficient methods for merging and
joining multiple DataFrames based on common columns or keys.
Time series: Pandas has built-in support for time series data, including
resampling, rolling window calculations, and date/time handling functions.
Grouping: Pandas provides powerful grouping functions that allow us to
group rows based on one or more columns and perform aggregate
calculations on each group.

PYTHON MACHINE LEARNING LIBRARY SCIKIT-LEARN


Scikit-Learn is a popular open-source machine learning library for Python. It
provides a wide range of tools for data preprocessing, model selection, and
model fitting, as well as support for grid search and cross-validation. Some key
features of Scikit-Learn include:

1. Supervised Learning: Scikit-Learn provides a variety of supervised learning


algorithms for regression and classification tasks, including linear regression,
logistic regression, decision trees, random forests, support vector machines
(SVMs), and neural networks.

2. Unsupervised Learning: Scikit-Learn also includes a range of unsupervised


learning algorithms for clustering and dimensionality reduction tasks, such as
k-means clustering, hierarchical clustering, principal component analysis (PCA),
and t-distributed stochastic neighbor embedding (t-SNE).

3. Model Selection: Scikit-Learn provides tools for model selection, including


grid search and randomized search to find the best hyperparameters for a
given algorithm

4. Preprocessing: Scikit-Learn includes functions for data preprocessing, such


as scaling, normalization, feature selection, and missing value imputation.

HiCollege Click Here For More Notes 07


5. Evaluation: Scikit-Learn provides functions for evaluating the performance of
machine learning models, such as confusion matrix, classification report, and
ROC curve.

6. Cross-validation: Scikit-Learn supports cross-validation techniques such as k-


fold cross-validation, leave-one-out cross-validation (LOOCV), and stratified k-
fold cross-validation for model evaluation and selection.

7. Ensemble Methods: Scikit-Learn includes ensemble methods such as


bagging (bootstrap aggregating) and boosting to improve the performance of
machine learning models by combining multiple weak learners into a strong
learner.

8. Pipelines: Scikit-Learn provides a pipeline interface to chain together


multiple transformers and estimators in a single object for easy model training
and evaluation.

LINEAR REGRESSION WITH ONE VARIABLE


Linear regression is a statistical technique used to model the relationship
between a dependent variable (y) and an independent variable (x) by fitting a
linear equation to the data. In this case, we'll be discussing linear regression
with one variable.

The formula for linear regression with one variable is:

y = mx + b

where m is the slope of the line, b is the y-intercept, and x is the independent
variable. The coefficient of determination (R²) is used to measure the goodness
of fit of the model.

In Python, we can use NumPy and Scipy libraries to perform linear regression
with one variable. Here's an example:

HiCollege Click Here For More Notes 08


LINEAR REGRESSION WITH ONE VARIABLE
Linear regression is a statistical technique used to model the relationship
between a dependent variable (y) and an independent variable (x) by fitting a
linear equation to the data. In this case, we'll be discussing linear regression
with one variable.

The formula for linear regression with one variable is:

y = mx + b

where m is the slope of the line, b is the y-intercept, and x is the independent
variable. The coefficient of determination (R²) is used to measure the goodness
of fit of the model.

In Python, we can use NumPy and Scipy libraries to perform linear regression
with one variable. Here's an example:

HiCollege Click Here For More Notes 09


LINEAR REGRESSION WITH ONE VARIABLE
Linear regression is a statistical technique used to model the relationship
between a dependent variable (y) and an independent variable (x) by fitting a
linear equation to the data. In this case, we'll be discussing linear regression
with one variable.

The formula for linear regression with one variable is:

y = mx + b

where m is the slope of the line, b is the y-intercept, and x is the independent
variable. The coefficient of determination (R²) is used to measure the goodness
of fit of the model.

In Python, we can use NumPy and Scipy libraries to perform linear regression
with one variable. Here's an example:

In this example, we first import necessary libraries like NumPy and matplotlib.
We then create a sample dataset with x and y values. Next, we calculate the
coefficients using scipy's linregress function which returns slope, intercept, R²
value along with some other statistics. Finally, we plot the data and regression
line using matplotlib library.

HiCollege Click Here For More Notes 10


LINEAR REGRESSION WITH MULTIPLE VARIABLES
Linear regression with multiple variables is used to model the relationship
between a dependent variable (y) and multiple independent variables (x1, x2, ...,
xn). The formula for linear regression with multiple variables is:

y = b0 + b1*x1 + b2*x2 + ... + bn*xn

where b0 is the y-intercept, and bi (i = 1, 2, ..., n) are the coefficients for each
independent variable.

In Python, we can use NumPy, Scipy, and Pandas libraries to perform linear
regression with multiple variables. Here's an example:

In this example, we first import necessary libraries like NumPy, Pandas and
matplotlib. We then create a sample dataset in a pandas DataFrame with x1, x2
and y values. Next, we calculate the coefficients using scipy's linregress function
separately for each independent variable and also calculate the intercept using
linregress function with np.ones() array as input to get the y-intercept. Finally,
we print the coefficients and R² values separately for each independent variable
and also print the R² value for both models combined using hstack function
from NumPy library to concatenate the two columns into a single array.

HiCollege Click Here For More Notes 11


LOGISTIC REGRESSION
Logistic regression is a statistical analysis technique used to predict the
probability of a binary outcome (dependent variable) based on one or more
independent variables. Let's consider a simple example to understand logistic
regression theoretically.

Let's say we want to predict whether a person will buy a product or not based
on their age and income. We have a dataset with these variables and the binary
outcome (whether the person bought the product or not).

First, we calculate the odds of buying the product for each age and income
level using the formula:

Odds = Probability of buying / Probability of not buying

For example, if the probability of buying for people aged 25 with an income of
$25,000 is 0.6, then the odds would be:

Odds = 0.6 / (1 - 0.6) = 2.4

Next, we calculate the logarithm of these odds using the natural logarithm (ln)
to get the logit values. The logit is the logarithm of the odds and is used as a
linear predictor in logistic regression:

Logit = ln(Odds) = ln(Probability of buying / Probability of not buying)

For example, if the odds for people aged 25 with an income of $25,000 is 2.4,
then the logit would be:

Logit = ln(2.4) = 1.213

Now, we can use linear regression to find the relationship between these logit
values and age and income levels:

Logit = b0 + b1*Age + b2*Income

Here, b0 is the intercept (the logit value when age and income are both zero),
b1 is the coefficient for age, and b2 is the coefficient for income. By
exponentiating both sides of this equation, we can convert it back to odds:

Odds = e^(b0 + b1*Age + b2*Income) / (1 + e^(b0 + b1*Age + b2*Income))

HiCollege Click Here For More Notes 12


This formula gives us the probability of buying by dividing the odds by (1 +
odds). By fitting this model to our data using logistic regression, we can predict
the probability of buying based on age and income levels for new individuals.
This helps us understand which factors are most important in predicting
whether someone will buy our product and how we can target our marketing
efforts accordingly.

HiCollege Click Here For More Notes 13

You might also like