Professional Documents
Culture Documents
Module3 Ids
Module3 Ids
Spam filters are designed to identify emails that attackers or marketers use to send unwanted or
dangerous content.
Let's delve deeper into how machine learning is applied specifically in the context of spam
filtering:
1. Data Collection:
The dataset should include various features, such as the email content, sender
information, metadata, and potentially other contextual information.
2. Feature Extraction:
3. Preprocessing:
6. Model Evaluation
Incoming emails are processed through the model, and a prediction is made whether each
email is spam or not.
8. Continuous Learning:
Spam filters often employ continuous learning mechanisms to adapt to evolving spam
techniques.
Regularly update the model with new data to ensure it remains effective against new
spam patterns.
9. User Feedback:
Incorporate user feedback to improve the system. Users can mark false positives
(legitimate emails classified as spam) and false negatives (spam emails that bypassed the
filter), helping the system improve over time.
And here are a few suggestions regarding code you could write to identify spam:
• Try a probabilistic model. In other words, should you not have simple rules, but have many
rules of thumb that aggregate together to provide the probability of a given email being spam?
This is a great idea.
• What about k-nearest neighbors or linear regression? You learned about these techniques in the
previous chapter, but do they apply to this kind of problem? (Hint: the answer is “No.”)
Linear regression may not be the most suitable model for filtering spam due to the nature of the
problem and the characteristics of the data. Here are several reasons why linear regression is not
commonly used for spam filtering and few are mentioned below:
1. Output Range:
2. Assumption of Linearity:
3. Violation of Assumptions:
Linear regression assumes that errors are normally distributed and have constant
variance. In spam filtering, the distribution of features and the presence of outliers
might violate these assumptions.
5. Sensitivity to Outliers:
While K-Nearest Neighbors (KNN) is a versatile algorithm, it may not be the most suitable
choice for filtering spam in some cases. Here are some reasons why KNN might not be the
optimal algorithm for spam filtering:
1. High Dimensionality:
In spam filtering, the feature space (the set of characteristics used to represent
messages) can be high-dimensional, especially when considering the frequency of
words or patterns. KNN can suffer from the "curse of dimensionality," where the
distance between data points becomes less meaningful in high-dimensional
spaces.
2. Computational Complexity:
KNN requires computing distances between the test instance and all training
instances. As the dataset grows, the computation becomes more computationally
expensive and may not be efficient for large-scale spam filtering applications.
3. Imbalanced Data:
Spam filtering datasets are often imbalanced, with a small proportion of messages
being spam. KNN can be sensitive to imbalanced datasets, and the majority class
can dominate the predictions.
KNN can be sensitive to noise and outliers in the data. In spam filtering, where
the presence of outliers or noisy data points is possible, this sensitivity might lead
to suboptimal performance.
KNN is primarily designed for numerical data and might not handle categorical
features well without appropriate encoding. In spam filtering, textual data often
involves categorical features (e.g., words) that require careful preprocessing.
5. Storage Requirements:
KNN requires storing the entire training dataset for predictions, making it
memory-intensive, especially for large datasets.
Naive Bayes and why it works for Filtering Spam
Bayes Classification
The algorithm is based on Bayes' theorem, which calculates the probability of a hypothesis
given the evidence:
Naive Bayes assumes that features are conditionally independent given the class label, which
simplifies the calculation of the likelihood.
Solution:
Mango:
P(X | Mango) = P(Yellow | Mango) * P(Sweet | Mango) * P(Long | Mango) equation 1
P(Long | Mango)= 0
P(X | Mango) = 0
Banana:
P(Yellow | Banana) = 1
Others:
So finally from P(X | Mango) == 0 , P(X | Banana) == 0.65 and P(X| Others) == 0.07742.
2. Probabilistic Framework:
3. Independence Assumption:
The algorithm assumes that features (words or attributes) are conditionally
independent given the class label (spam or non-spam). Although this assumption
is "naive" and may not hold in reality, it simplifies the modeling process and often
works surprisingly well for text classification tasks.
Naive Bayes is particularly well-suited for text classification tasks, where the
input data can be represented as a bag of words. In the context of spam filtering,
emails can be treated as a collection of words, and the algorithm can effectively
model the likelihood of certain words occurring in spam or non-spam emails.
Data Wrangling: APIs and other tools for scrapping the Web
Data wrangling is the act of and mapping raw data into another format suitable for another
purpose.
As a data scientist, you’re not always just handed some data and asked to go figure something
out based on it. Often, you have to actually figure out how to go get some data you need to ask a
question, solve a problem, do some research, etc. One way you can do this is with an API. For
the sake of this discussion, an API (application programming interface) is something websites
provide to developers so they can download data from the website easily and in standard format.
In this case you might want to use something like the Firebug extension for Firefox. You can
“inspect the element” on any web page, and Firebug allows you to grab the field inside the
HTML. In fact, it gives you access to the full HTML document so you can interact and edit.
After locating the stuff you want inside the HTML, you can use curl, wget, grep, awk, perl, etc.,
to write a shell script and to grab what you want, especially for a one-off grab. If you want to be
more systematic, you can also do this using Python or R.
Mechanize (or here): Super cool as well, but it doesn’t parse JavaScript.
Kaggle is a company whose tagline is, “We’re making data science a sport.” Kaggle forms
relationships with companies and with data scientists.
Identify the type of problem (classification, regression, etc.) and the evaluation
metric.
3. Data Preprocessing:
4. Feature Engineering:
Use domain knowledge to engineer features that make sense for the problem.
5. Model Selection:
Choose a set of candidate models based on the problem type (e.g., Random
Forest, Gradient Boosting, Neural Networks).
6. Model Training:
8. Ensemble Methods:
Learn from other participants, share insights, and incorporate best practices.
Feature Generation (brainstorming, role of domain expertise, and place for imagination)
Feature Generation involves creating new features (variables) from the existing data to improve
the model's performance
Brainstorming:
Divergent Thinking: Encourage open-minded, creative thinking to generate a variety of
potential features. This can involve team discussions, brainstorming sessions, or individual
idea generation.
Feature Ideas: Think about different aspects of the data that might be useful for predicting the
target variable. Consider transformations, combinations, or aggregations of existing features.
Feature Engineering: Imagine how the data could be represented differently to capture
patterns that might not be obvious initially. This might include encoding cyclical patterns
(like time of day) or representing geographic data in a way that reflects real-world
relationships.
Filter methods evaluate the intrinsic characteristics of features, often without involving a
specific machine learning model. These methods are generally faster and less computationally
intensive compared to wrappers. Common filter methods include:
Variance Threshold:
Idea: Remove features with low variance, assuming that features with little
variation are less informative.
Implementation: Set a threshold for variance and discard features with variance
below that threshold.
Idea: Evaluate the correlation between each feature and the target variable or
between features themselves.
Wrapper methods
Wrapper methods use a specific machine learning model to evaluate the performance of different
subsets of features.
Forward Selection: Start with an empty set of features and iteratively add the most important
feature at each step.
1. Train the model with each individual feature and select the one that provides the
best performance.
3. Repeat steps 1 and 2 until a predefined stopping criterion (e.g., a specific number
of features) is met.
Cons:
Backward Elimination: Start with all features and iteratively remove the least important feature
at each step.
2. Remove the least important feature (e.g., based on feature importance scores).
Cons:
Combination of Both: Combine forward selection and backward elimination to find an optimal
subset of features.
2. Perform forward selection until adding new features does not improve
performance.
3. Perform backward elimination until removing features does not degrade
performance.
Decision Trees
A decision tree is a type of supervised learning algorithm that is commonly used in machine
learning to model and predict outcomes based on input data. It is a tree-like structure where each
internal node tests on attribute, each branch corresponds to attribute value and each leaf node
represents the final decision or prediction. The decision tree algorithm falls under the category
of supervised learning. They can be used to solve both regression and classification problems.
LeafTerminal Node: This node cannot be split further as it is the final node.
Note:
For n class data set, entropy range from 0 to log n2 ,where n is the number of classes.
Where
S Entropy
Weight <80kg
Height Weight class
Height Weight class
170 75 Female
170 90 Male
155 50 Female
175 78 Female
Consider the following subset (S) from the decision tree
Male =1
Total =4
Random forest
Random forest or random decision forests is an ensemble method for classification, regression.
Step 1: Select random samples (Bootstrap sample) from a given data or training set.
Step 2: This algorithm will construct a decision tree for every training data.
Step 3: Voting will take place by averaging (Bagging) the decision tree.
Step 4: Finally, select the most voted prediction result as the final prediction result.