Download as pdf or txt
Download as pdf or txt
You are on page 1of 17

Module – 3: Machine Learning Algorithm and Usage in Applications

Machine Learning Algorithm and Usage in Applications: Motivating application: Filtering


Spam

Spam filters are designed to identify emails that attackers or marketers use to send unwanted or
dangerous content.

Let's delve deeper into how machine learning is applied specifically in the context of spam
filtering:

1. Data Collection:

 Gather a dataset of emails, labeling each email as spam or non-spam (ham).

 The dataset should include various features, such as the email content, sender
information, metadata, and potentially other contextual information.

2. Feature Extraction:

Extract relevant features from the emails. Features could include:

 Word frequencies: Frequency of specific words or phrases associated with spam.

 Header information: Sender email address, IP address, etc.

 Structural features: HTML content, links, images, etc.

 Metadata: Timestamps, number of recipients, etc.

3. Preprocessing:

Clean and preprocess the data.

 Removing stop words and irrelevant characters.

4. Choosing a Machine Learning Algorithm:

Select a suitable algorithm for classification. Common choices include:

 Naive Bayes: Based on probability and conditional independence assumptions.

 Decision Trees or Random Forest: Suitable for feature-rich datasets.

5. Training the Model:

 Split the dataset into training and testing sets.


 Train the machine learning model using the training set, allowing it to learn patterns and
associations between features and spam/non-spam labels.

6. Model Evaluation

7. Integration into Spam Filter:

 Implement the trained model into the spam filter system.

 Incoming emails are processed through the model, and a prediction is made whether each
email is spam or not.

8. Continuous Learning:

 Spam filters often employ continuous learning mechanisms to adapt to evolving spam
techniques.

 Regularly update the model with new data to ensure it remains effective against new
spam patterns.

9. User Feedback:

 Incorporate user feedback to improve the system. Users can mark false positives
(legitimate emails classified as spam) and false negatives (spam emails that bypassed the
filter), helping the system improve over time.

And here are a few suggestions regarding code you could write to identify spam:

• Try a probabilistic model. In other words, should you not have simple rules, but have many
rules of thumb that aggregate together to provide the probability of a given email being spam?
This is a great idea.

• What about k-nearest neighbors or linear regression? You learned about these techniques in the
previous chapter, but do they apply to this kind of problem? (Hint: the answer is “No.”)

In this chapter, we’ll use Naive Bayes to solve this problem.


Why Won’t Linear Regression Work for Filtering Spam?

Linear regression may not be the most suitable model for filtering spam due to the nature of the
problem and the characteristics of the data. Here are several reasons why linear regression is not
commonly used for spam filtering and few are mentioned below:

1. Output Range:

 Linear regression is designed for predicting continuous values, not classification


problems. Its output is a continuous range of values, making it challenging to
interpret results for a binary classification task like spam or non-spam.

2. Assumption of Linearity:

 Linear regression assumes a linear relationship between the independent variables


and the dependent variable. In spam filtering, the relationships between features
(e.g., words, patterns) and the outcome (spam or not spam) are often more
complex and nonlinear.

3. Violation of Assumptions:

 Linear regression assumes that errors are normally distributed and have constant
variance. In spam filtering, the distribution of features and the presence of outliers
might violate these assumptions.

4. Binary Classification vs. Regression:

 Spam filtering is a binary classification problem where the goal is to categorize


messages into two classes (spam or non-spam). Linear regression, being a
regression technique, is not directly suited for this type of task.

5. Sensitivity to Outliers:

 Linear regression can be sensitive to outliers, which may be present in spam


filtering datasets. Outliers, if not handled properly, can disproportionately
influence the regression line and impact the model's performance.
How About k-nearest Neighbors?

While K-Nearest Neighbors (KNN) is a versatile algorithm, it may not be the most suitable
choice for filtering spam in some cases. Here are some reasons why KNN might not be the
optimal algorithm for spam filtering:

1. High Dimensionality:

 In spam filtering, the feature space (the set of characteristics used to represent
messages) can be high-dimensional, especially when considering the frequency of
words or patterns. KNN can suffer from the "curse of dimensionality," where the
distance between data points becomes less meaningful in high-dimensional
spaces.

2. Computational Complexity:

 KNN requires computing distances between the test instance and all training
instances. As the dataset grows, the computation becomes more computationally
expensive and may not be efficient for large-scale spam filtering applications.

3. Imbalanced Data:

 Spam filtering datasets are often imbalanced, with a small proportion of messages
being spam. KNN can be sensitive to imbalanced datasets, and the majority class
can dominate the predictions.

 KNN can be sensitive to noise and outliers in the data. In spam filtering, where
the presence of outliers or noisy data points is possible, this sensitivity might lead
to suboptimal performance.

4. Difficulty Handling Categorical Features:

 KNN is primarily designed for numerical data and might not handle categorical
features well without appropriate encoding. In spam filtering, textual data often
involves categorical features (e.g., words) that require careful preprocessing.

5. Storage Requirements:

 KNN requires storing the entire training dataset for predictions, making it
memory-intensive, especially for large datasets.
Naive Bayes and why it works for Filtering Spam

Bayes Classification

Bayesian classification, specifically Naive Bayes classification, is a probabilistic machine


learning algorithm based on Bayes' theorem. It is widely used for classification tasks,
particularly in text classification and spam filtering. The "Naive" part in its name comes from
the assumption that features are conditionally independent given the class label, which
simplifies the calculation.

The algorithm is based on Bayes' theorem, which calculates the probability of a hypothesis
given the evidence:

Naive Bayes assumes that features are conditionally independent given the class label, which
simplifies the calculation of the likelihood.

Solution:

Bayes theorem: P(A|B) = (P(B|A) * P(A) )/ P(B)

Mango:
P(X | Mango) = P(Yellow | Mango) * P(Sweet | Mango) * P(Long | Mango) equation 1

a)P(Yellow | Mango) = (P(Yellow | Mango) * P(Yellow) )/ P (Mango)

= ((350/800) * (800/1200)) / (650/1200)

P(Yellow | Mango)= 0.53

b) P(Sweet | Mango) = (P(Sweet | Mango) * P(Sweet) )/ P (Mango)

= ((450/850) * (850/1200)) / (650/1200)

P(Sweet | Mango)= 0.69

c) P(Long | Mango) = (P(Long | Mango) * P(Long) )/ P (Mango)

= ((0/650) * (400/1200)) / (800/1200)

P(Long | Mango)= 0

By substituting in equation 1 we get, P(X | Mango) = 0.53 * 0.69 * 0

P(X | Mango) = 0

Banana:

P(X | Banana) = P(Yellow | Banana) * P(Sweet | Banana) * P(Long | Banana) equation 2

a) P(Yellow | Banana) = (P( Banana | Yellow ) * P(Yellow) )/ P (Banana)

= ((400/800) * (800/1200)) / (400/1200)

P(Yellow | Banana) = 1

b) P(Sweet | Banana) = (P( Banana | Sweet) * P(Sweet) )/ P (Banana)

= ((300/850) * (850/1200)) / (400/1200)

P(Sweet | Banana) = .75

c)P(Long | Banana) = (P( Banana | Yellow ) * P(Long) )/ P (Banana)

= ((350/400) * (400/1200)) / (400/1200)

P(Yellow | Banana) = 0.875

By substituting in equation 2 we get, P(X | Banana) = 1 * .75 * 0.875


P(X | Banana) = 0.6562

Others:

P(X | Others) = P(Yellow | Others) * P(Sweet | Others) * P(Long | Others) equation 3

a) P(Yellow | Others) = (P( Others| Yellow ) * P(Yellow) )/ P (Others)

= ((50/800) * (800/1200)) / (150/1200)

P(Yellow | Others) = 0.34

b) P(Sweet | Others) = (P( Others| Sweet ) * P(Sweet) )/ P (Others)

= ((100/850) * (850/1200)) / (150/1200)

P(Sweet | Others) = 0.67

c) P(Long | Others) = (P( Others| Long) * P(Long) )/ P (Others)

= ((50/400) * (400/1200)) / (150/1200)

P(Long | Others) = 0.34

By substituting in equation 3 we get, P(X | Others) = 0.34 * 0.67* 0.34

P(X | Others) = 0.07742

So finally from P(X | Mango) == 0 , P(X | Banana) == 0.65 and P(X| Others) == 0.07742.

We can conclude Fruit {Yellow, Sweet, Long} is Banana.

Why it works for Filtering Spam

1. Simplicity and Efficiency:

 Naive Bayes is a simple and computationally efficient algorithm. Its simplicity


makes it easy to implement and train on large datasets, which is crucial for real-
time spam filtering in email systems.

2. Probabilistic Framework:

 Naive Bayes is based on Bayesian probability, providing a solid probabilistic


framework for classification tasks. It calculates the probability of an email being
spam or not, given the observed features.

3. Independence Assumption:
 The algorithm assumes that features (words or attributes) are conditionally
independent given the class label (spam or non-spam). Although this assumption
is "naive" and may not hold in reality, it simplifies the modeling process and often
works surprisingly well for text classification tasks.

4. Text Classification Suitability:

 Naive Bayes is particularly well-suited for text classification tasks, where the
input data can be represented as a bag of words. In the context of spam filtering,
emails can be treated as a collection of words, and the algorithm can effectively
model the likelihood of certain words occurring in spam or non-spam emails.

Data Wrangling: APIs and other tools for scrapping the Web

Data wrangling is the act of and mapping raw data into another format suitable for another
purpose.

As a data scientist, you’re not always just handed some data and asked to go figure something
out based on it. Often, you have to actually figure out how to go get some data you need to ask a
question, solve a problem, do some research, etc. One way you can do this is with an API. For
the sake of this discussion, an API (application programming interface) is something websites
provide to developers so they can download data from the website easily and in standard format.

In this case you might want to use something like the Firebug extension for Firefox. You can
“inspect the element” on any web page, and Firebug allows you to grab the field inside the
HTML. In fact, it gives you access to the full HTML document so you can interact and edit.

After locating the stuff you want inside the HTML, you can use curl, wget, grep, awk, perl, etc.,
to write a shell script and to grab what you want, especially for a one-off grab. If you want to be
more systematic, you can also do this using Python or R.

Other parsing tools you might want to look into include:

lynx and lynx --dump

Beautiful Soup: Robust but kind of slow.

Mechanize (or here): Super cool as well, but it doesn’t parse JavaScript.

PostScript: Image classification.


Extracting Meaning from Data: The Kaggle Model

Kaggle is a company whose tagline is, “We’re making data science a sport.” Kaggle forms
relationships with companies and with data scientists.

Creating a successful Kaggle model involves a combination of data exploration, feature


engineering, model selection, and optimization. Below is a more detailed breakdown of the steps
involved in building a Kaggle model:

1. Understanding the Problem:

 Carefully read the competition description or project requirements.

 Identify the type of problem (classification, regression, etc.) and the evaluation
metric.

2. Exploratory Data Analysis (EDA):

 Load and explore the provided datasets.

 Visualize data distributions, identify outliers, and understand the relationships


between variables.

 Analyze missing values and decide on strategies for handling them.

3. Data Preprocessing:

 Clean the data by addressing missing values, outliers, and inconsistencies.

 Convert categorical variables into a suitable format (one-hot encoding, label


encoding, etc.).

4. Feature Engineering:

 Create new features that might capture relevant information.

 Consider interactions between variables or transformations to improve model


performance.

 Use domain knowledge to engineer features that make sense for the problem.

5. Model Selection:

 Choose a set of candidate models based on the problem type (e.g., Random
Forest, Gradient Boosting, Neural Networks).
6. Model Training:

 Split the data into training and validation sets.

 Train the chosen models on the training set.

7. Validation and Hyperparameter Tuning:

 Validate models on a separate validation set to assess performance.

 Fine-tune hyperparameters to improve model accuracy.

8. Ensemble Methods:

 Create model ensembles to combine predictions from multiple models.

 Experiment with techniques such as bagging.

9. Testing and Submission:

 Evaluate the models on the competition's test set.

 Ensure that the model generalizes well to new, unseen data.

 Generate predictions on the test set and submit them to Kaggle.

10. Community Engagement:

 Participate in Kaggle forums, discussions, and kernels.

 Learn from other participants, share insights, and incorporate best practices.

Feature Generation (brainstorming, role of domain expertise, and place for imagination)

Feature Generation involves creating new features (variables) from the existing data to improve
the model's performance

Brainstorming:
Divergent Thinking: Encourage open-minded, creative thinking to generate a variety of
potential features. This can involve team discussions, brainstorming sessions, or individual
idea generation.
Feature Ideas: Think about different aspects of the data that might be useful for predicting the
target variable. Consider transformations, combinations, or aggregations of existing features.

Role of Domain Expertise:


Understanding the Data: Domain experts have a deep understanding of the subject matter.
Their knowledge helps in identifying features that may have a significant impact on the target
variable.
Identifying Relevant Information: Domain experts can guide the feature generation process
by suggesting which variables are likely to be important, what interactions between variables
should be considered, and whether certain transformations are meaningful in the context of
the domain.

Place for Imagination:


Creative Transformations: Imagination comes into play when considering innovative ways to
transform or combine existing features. This could involve creating ratios, applying
mathematical functions, or introducing novel ways to represent information.

Feature Engineering: Imagine how the data could be represented differently to capture
patterns that might not be obvious initially. This might include encoding cyclical patterns
(like time of day) or representing geographic data in a way that reflects real-world
relationships.

Feature Selection algorithms: Filters; Wrappers

Filter methods evaluate the intrinsic characteristics of features, often without involving a
specific machine learning model. These methods are generally faster and less computationally
intensive compared to wrappers. Common filter methods include:

 Variance Threshold:

 Idea: Remove features with low variance, assuming that features with little
variation are less informative.

 Implementation: Set a threshold for variance and discard features with variance
below that threshold.

 Correlation-based Feature Selection:

 Idea: Evaluate the correlation between each feature and the target variable or
between features themselves.

 Implementation: Calculate the correlation coefficient (e.g., Pearson correlation)


and select features with high correlation.

 Chi-squared (χ²) Test:


 Idea: Assess the dependence between each feature and the target variable for
categorical data.

 Implementation: Use the chi-squared statistic to measure the association


between categorical variables. Features with high chi-squared values are
considered more important.

Wrapper methods

Wrapper methods use a specific machine learning model to evaluate the performance of different
subsets of features.

Forward Selection: Start with an empty set of features and iteratively add the most important
feature at each step.

1. Train the model with each individual feature and select the one that provides the
best performance.

2. Add the selected feature to the set.

3. Repeat steps 1 and 2 until a predefined stopping criterion (e.g., a specific number
of features) is met.

Cons:

 Prone to overfitting, especially with a large number of features.

Backward Elimination: Start with all features and iteratively remove the least important feature
at each step.

1. Train the model with all features and evaluate performance.

2. Remove the least important feature (e.g., based on feature importance scores).

3. Repeat steps 1 and 2 until a predefined stopping criterion is met.

Cons:

 May discard important interactions between features.

Combination of Both: Combine forward selection and backward elimination to find an optimal
subset of features.

1. Start with an empty set of features.

2. Perform forward selection until adding new features does not improve
performance.
3. Perform backward elimination until removing features does not degrade
performance.

4. Repeat steps 2 and 3 until convergence or a predefined stopping criterion.

Decision Trees
A decision tree is a type of supervised learning algorithm that is commonly used in machine
learning to model and predict outcomes based on input data. It is a tree-like structure where each
internal node tests on attribute, each branch corresponds to attribute value and each leaf node
represents the final decision or prediction. The decision tree algorithm falls under the category
of supervised learning. They can be used to solve both regression and classification problems.

Root/Parent Node: This is where splitting of the tree starts. .

Decision/Child Node): It is where decisions or rules are made.

LeafTerminal Node: This node cannot be split further as it is the final node.

Note:

Entropy: It is the measure of impurity in a dataset


The decision to split at each node is made according to the metric called purity. A node is 100%
impure when a node is split evenly 50/50 and 100% pure when all of its data belongs to a single
class.

The value of entropy ranges from 0 to 1 for 2 class dataset

For n class data set, entropy range from 0 to log n2 ,where n is the number of classes.

Where

S Entropy

c  Number of class label in S

Pi  Proportion of the ith class

Example: Draw decision tree for the following dataset

Height Weight class


170 75 Female
170 90 Male
185 85 Male
155 50 Female
175 78 Female
Decision tree

Height Weight class


170 75 Female
170 90 Male
185 85 Male
155 50 Female
175 78 Female
Height>180 Height<180

Height Weight class


Height Weight class 170 75 Female
185 85 Male 170 90 Male
155 50 Female
Weight >80kg 175 78 Female

Weight <80kg
Height Weight class
Height Weight class
170 75 Female
170 90 Male
155 50 Female
175 78 Female
Consider the following subset (S) from the decision tree

Height Weight class


170 75 Female
170 90 Male
155 50 Female
175 78 Female Female =3

Male =1

Total =4

Random forest

Random forest or random decision forests is an ensemble method for classification, regression.

The following steps explain the working Random Forest Algorithm:

Step 1: Select random samples (Bootstrap sample) from a given data or training set.

Step 2: This algorithm will construct a decision tree for every training data.

Step 3: Voting will take place by averaging (Bagging) the decision tree.

Step 4: Finally, select the most voted prediction result as the final prediction result.

You might also like