Natural Language Processing

You might also like

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 11

Contents

Unigram, Bigram and Trigram.........................................................................................................................................1


Model Accuracy...............................................................................................................................................................2
Confusion Matrix.............................................................................................................................................................2
AUC ROC Curve:..............................................................................................................................................................3
Gini Coefficient................................................................................................................................................................6
Lift Chart..........................................................................................................................................................................6
Classification: Thresholding.............................................................................................................................................6
Text classification............................................................................................................................................................7
Data cleaning...................................................................................................................................................................7
a. Data Quality.........................................................................................................................................................7
1. Validity.............................................................................................................................................................7
2. Accuracy..........................................................................................................................................................7
3. Completeness..................................................................................................................................................8
4. Consistency......................................................................................................................................................8
5. Uniformity.......................................................................................................................................................8
b. The workflow.......................................................................................................................................................8
1. Inspection........................................................................................................................................................8
2. Cleaning...........................................................................................................................................................9
3. Verifying........................................................................................................................................................11
4. Reporting.......................................................................................................................................................11

Unigram, Bigram and Trigram


 TF-IDF in NLP stands for Term Frequency – Inverse document frequency.
 Popular topic in Natural Language Processing which generally deals with human languages.
 During any text processing, cleaning the text (pre-processing) is vital.
 Further, the cleaned data needs to be converted into a numerical format where each word is represented by
a matrix (word vectors). This is also known as word embedding.
 Term Frequency (TF) = (Frequency of a term in the document)/ (Total number of terms in documents)
 Inverse Document Frequency (IDF) = log ((total number of documents)/ (number of documents with term t))
 Bigrams: Bigram is 2 consecutive words in a sentence.
E.g. “The boy is playing football”. The bigrams here are:
o The boy
o Boy is
o Is playing
o Playing football
 Trigrams: Trigram is 3 consecutive words in a sentence. For the above example trigrams will be:
o The boy is
o Boy is playing
o Is playing football
 From the above bigrams and trigram, some are relevant while others are discarded which do not contribute
value for further processing.
 Let us say from a document we want to find out the skills required to be a “Data Scientist”. Here, if we
consider only unigrams, then the single word cannot convey the details properly. If we have a word like
‘Machine learning developer’, then the word extracted should be ‘Machine learning’ or ‘Machine learning
developer’. The words simply ‘Machine’, ‘learning’ or ‘developer’ will not give the expected result.
 Stop Words: Commonly used words (such as “the”, “a”, “an”, “in”) that a search engine has been
programmed to ignore, both when indexing entries for searching and when retrieving them as the result of a
search query.

Model Accuracy
There are many ways to determine the accuracy of your model. Some off these may include:

1. Divide your dataset into a training set and test set. Build the model on the training set and then use the test set as
a holdout sample to test your trained model using the test data. Compare the predicted values with the actual values
by calculating the error using measures such as the "Mean Absolute Percent Error" (MAPE) for example. If your
MAPE is less than 10% you have a reasonable/good model.

2. Another thing you may one to use is to compute "Confusion Matrix" (Misclassification Matrix) to determine the
False Positive Rate and the False Negative Rate, The overall Accuracy of the model, The sensitivity, Specificity, etc.
These measures will help you to determine whether to accept the model or not. Taking into account the cost of the
errors is a very important part of your decision whether to accept or reject the model.

3. Computing Receiver Operating Characteristic Curve (ROC) or the Lift Chart or the Gains Chart or the Area under the
curve (AUC) chart also other ways that help you to determine whether you should accept or reject your model.

Confusion Matrix
A confusion matrix is a table that is often used to describe the performance of a classification model (or "classifier")
on a set of test data for which the true values are known.

An example confusion matrix for a binary classifier:

What can we learn from this matrix?

 There are two possible predicted classes: "yes" and "no". If we were predicting the presence of a disease, for
example, "yes" would mean they have the disease, and "no" would mean they don't have the disease.
 The classifier made a total of 165 predictions (e.g., 165 patients were being tested for the presence of that
disease).
 Out of those 165 cases, the classifier predicted "yes" 110 times, and "no" 55 times.
 In reality, 105 patients in the sample have the disease, and 60 patients do not

Let's now define the most basic terms, which are whole numbers (not rates):

 True positives (TP): These are cases in which we predicted yes (they have the disease), and they do have the
disease.
 True negatives (TN): We predicted no, and they don't have the disease.
 False positives (FP): We predicted yes, but they don't actually have the disease. (Also known as a "Type I
error.")
 False negatives (FN): We predicted no, but they actually do have the disease. (Also known as a "Type II
error.")

This is a list of rates that are often computed from a confusion matrix for a binary classifier:

 Accuracy: Overall, how often is the classifier correct?


o (TP+TN)/total = (100+50)/165 = 0.91
 Misclassification Rate: Overall, how often is it wrong?
o (FP+FN)/total = (10+5)/165 = 0.09, equivalent to 1 minus Accuracy, also known as "Error Rate"
 True Positive Rate: When it's actually yes, how often does it predict yes?
o TP/actual yes = 100/105 = 0.95, also known as "Sensitivity" or "Recall"
 False Positive Rate: When it's actually no, how often does it predict yes?
o FP/actual no = 10/60 = 0.17
 True Negative Rate: When it's actually no, how often does it predict no?
o TN/actual no = 50/60 = 0.83, equivalent to 1 minus False Positive Rate, also known as "Specificity"
 Precision: When it predicts yes, how often is it correct?
o TP/predicted yes = 100/110 = 0.91
 Prevalence: How often does the yes condition actually occur in our sample?
o actual yes/total = 105/165 = 0.64

A couple other terms are also worth mentioning:

 Null Error Rate: This is how often you would be wrong if you always predicted the majority class. (In our
example, the null error rate would be 60/165=0.36 because if you always predicted yes, you would only be
wrong for the 60 "no" cases.) This can be a useful baseline metric to compare your classifier against.
However, the best classifier for a particular application will sometimes have a higher error rate than the null
error rate, as demonstrated by the Accuracy Paradox.
 Cohen's Kappa: This is essentially a measure of how well the classifier performed as compared to how well it
would have performed simply by chance. In other words, a model will have a high Kappa score if there is a
big difference between the accuracy and the null error rate.
 F Score: This is a weighted average of the true positive rate (recall) and precision.
 ROC Curve: This is a commonly used graph that summarizes the performance of a classifier over all possible
thresholds. It is generated by plotting the True Positive Rate (y-axis) against the False Positive Rate (x-axis) as
you vary the threshold for assigning observations to a given class.

AUC ROC Curve:


 In Machine Learning, performance measurement is an essential task. So, when it comes to a classification
problem, we can count on an AUC - ROC Curve.
 When we need to check or visualize the performance of the multi - class classification problem, we use AUC
(Area Under the Curve) ROC (Receiver Operating Characteristics) curve.
 It is one of the most important evaluation metrics for checking any classification model’s performance.
 It is also written as AUROC (Area Under the Receiver Operating Characteristics).
What is AUC - ROC Curve?

 AUC - ROC curve is a performance measurement for classification problem at various thresholds settings.
 ROC is a probability curve.
 AUC represents degree or measure of separability. It tells how much model is capable of distinguishing
between classes.
 Higher the AUC, better the model is at predicting 0s as 0s and 1s as 1s. By analogy, Higher the AUC, better
the model is at distinguishing between patients with disease and no disease.
 The ROC curve is plotted with TPR against the FPR where TPR is on y-axis and FPR is on the x-axis.

 Defining terms used in AUC and ROC Curve.


 TPR (True Positive Rate) / Recall /Sensitivity

 Specificity

 FPR

 An excellent model has AUC near to the 1 which means it has good measure of separability. A poor model
has AUC near to the 0 which means it has worst measure of separability. In fact, it means it is reciprocating
the result. It is predicting 0s as 1s and 1s as 0s. And when AUC is 0.5, it means model has no class separation
capacity whatsoever.
 Let’s interpret above statements.

As we know, ROC is a curve of probability. So, let’s plot the distributions of those probabilities: Note: Red
distribution curve is of the positive class (patients with disease) and green distribution curve is of negative class
(patients with no disease).
 This is an ideal situation. When two curves don’t overlap at all means model has an ideal measure of
separability. It is perfectly able to distinguish between positive class and negative class.

 When two distributions overlap, we introduce type 1 and type 2 error. Depending upon the threshold, we
can minimize or maximize them. When AUC is 0.7, it means there is 70% chance that model will be able to
distinguish between positive class and negative class.

 This is the worst situation. When AUC is approximately 0.5, model has no discrimination capacity to
distinguish between positive class and negative class.
 When AUC is approximately 0, model is actually reciprocating the classes. It means, model is predicting
negative class as a positive class and vice versa.

Relation between Sensitivity, Specificity, FPR and Threshold.

 Sensitivity and Specificity are inversely proportional to each other. So, when we increase Sensitivity,
Specificity decreases and vice versa.

Sensitivity⬆️, Specificity⬇️and Sensitivity⬇️,


Specificity⬆️
 When we decrease the threshold, we get more positive values thus it increases the sensitivity and
decreasing the specificity.
 Similarly, when we increase the threshold, we get more negative values thus we get higher specificity and
lower sensitivity.
 As we know FPR is 1 - specificity. So, when we increase TPR, FPR also increases and vice versa.

TPR⬆️, FPR⬆️and TPR⬇️, FPR⬇️


Gini Coefficient
The Gini coefficient is a metric that indicates the model’s discriminatory power, namely, the effectiveness of the
model in differentiating between “bad” borrowers, who will default in the future, and “good” borrowers, who won’t
default in the future. This metric is often used to compare the quality of different models and evaluate their
prediction power.

Lift Chart
 The lift chart, or specifically the cumulative lift chart, shows how much more likely the company will get the
buyers than if the company targets to customers randomly.
 Each model has its own lift chart.
 A higher lift indicates a better model. The least value of a lift is 1.0.

Classification: Thresholding
 Logistic regression returns a probability. You can use the returned probability "as is" (for example, the
probability that the user will click on this ad is 0.00023) or convert the returned probability to a binary value
(for example, this email is spam).
 A logistic regression model that returns 0.9995 for a particular email message is predicting that it is very
likely to be spam. Conversely, another email message with a prediction score of 0.0003 on that same logistic
regression model is very likely not spam. However, what about an email message with a prediction score of
0.6?
 In order to map a logistic regression value to a binary category, you must define a classification threshold
(also called the decision threshold). A value above that threshold indicates "spam"; a value below indicates
"not spam." It is tempting to assume that the classification threshold should always be 0.5, but thresholds
are problem-dependent, and are therefore values that you must tune.
Text classification
 Text classification also known as text tagging or text categorization is the process of categorizing text into
organized groups. By using Natural Language Processing (NLP), text classifiers can automatically analyse text
and then assign a set of pre-defined tags or categories based on its content.
 Unstructured text is everywhere, such as emails, chat conversations, websites, and social media but it’s hard
to extract value from this data unless it’s organized in a certain way.
 Document/Text classification is one of the important and typical tasks in supervised machine learning (ML).
o Assigning categories to documents (web page, library book, media articles, gallery etc.)
o Many applications like spam filtering, email routing, sentiment analysis etc.
 Some of the most common examples and use cases for automatic text classification include the following:
o Sentiment Analysis: the process of understanding if a given text is talking positively or negatively
about a given subject (e.g. for brand monitoring purposes).
o Topic Detection: the task of identifying the theme or topic of a piece of text (e.g. know if a product
review is about Ease of Use, Customer Support, or Pricing when analysing customer feedback).
o Language Detection: the procedure of detecting the language of a given text (e.g. know if an
incoming support ticket is written in English or Spanish for automatically routing tickets to the
appropriate team).

Data cleaning
a. Data Quality
Properties of quality data:

1. Validity
The degree to which the data conform to defined business rules or constraints.

 Data-Type Constraints: values in a particular column must be of a particular datatype, e.g., boolean,
numeric, date, etc.
 Range Constraints: typically, numbers or dates should fall within a certain range.
 Mandatory Constraints: certain columns cannot be empty.
 Unique Constraints: a field, or a combination of fields, must be unique across a dataset.
 Set-Membership constraints: values of a column come from a set of discrete values, e.g. enum values. For
example, a person’s gender may be male or female.
 Foreign-key constraints: as in relational databases, a foreign key column can’t have a value that does not
exist in the referenced primary key.
 Regular expression patterns: text fields that have to be in a certain pattern. For example, phone numbers
may be required to have the pattern (999) 999–9999.
 Cross-field validation: certain conditions that span across multiple fields must hold. For example, a patient’s
date of discharge from the hospital cannot be earlier than the date of admission.

2. Accuracy
 The degree to which the data is close to the true values.
 While defining all possible valid values allows invalid values to be easily spotted, it does not mean that they
are accurate.
 A valid street address mightn’t actually exist. A valid person’s eye colour, say blue, might be valid, but not
true (doesn’t represent the reality).
 Another thing to note is the difference between accuracy and precision. Saying that you live on the earth is,
actually true. But, not precise. Where on the earth? Saying that you live at a particular street address is more
precise.

3. Completeness
 The degree to which all required data is known.
 Missing data is going to happen for various reasons. One can mitigate this problem by questioning the
original source if possible, say re-interviewing the subject.
 Chances are, the subject is either going to give a different answer or will be hard to reach again.

4. Consistency
 The degree to which the data is consistent, within the same data set or across multiple data sets.
 Inconsistency occurs when two values in the data set contradict each other.

5. Uniformity
 The degree to which the data is specified using the same unit of measure.
 The weight may be recorded either in pounds or kilos. The date might follow the USA format or European
format. The currency is sometimes in USD and sometimes in YEN. So, data must be converted to a single
measure unit.

b. The workflow
 The workflow is a sequence of three steps aiming at producing high-quality data and taking into account all
the criteria we’ve talked about.
o Inspection: Detect unexpected, incorrect, and inconsistent data.
o Cleaning: Fix or remove the anomalies discovered.
o Verifying: After cleaning, the results are inspected to verify correctness.
o Reporting: A report about the changes made and the quality of the currently stored data is
recorded.
 What you see as a sequential process is, in fact, an iterative, endless process. One can go from verifying to
inspection when new flaws are detected.

1. Inspection
Inspecting the data is time-consuming and requires using many methods for exploring the underlying data for error
detection:

Data profiling

 A summary statistic about the data, called data profiling, is really helpful to give a general idea about the
quality of the data.
o For example, check whether a particular column conforms to particular standards or pattern. Is the
data column recorded as a string or number?
o How many values are missing? How many unique values in a column, and their distribution? Is this
data set is linked to or have a relationship with another?

Visualizations

 By analysing and visualizing the data using statistical methods such as mean, standard deviation, range, or
quantiles, one can find values that are unexpected and thus erroneous.
o For example, by visualizing the average income across the countries, one might see there are some
outliers (link has an image). Some countries have people who earn much more than anyone else.
Those outliers are worth investigating and are not necessarily incorrect data.

Software packages

 Several software packages or libraries available at your language will let you specify constraints and check
the data for violation of these constraints.
 Moreover, they can not only generate a report of which rules were violated and how many times but also
create a graph of which columns are associated with which rules.
o The age, for example, can’t be negative, and so the height. Other rules may involve multiple columns
in the same row, or across datasets.

2. Cleaning
Incorrect data is either removed, corrected, or imputed.

Irrelevant data
 Irrelevant data are those that are not actually needed, and don’t fit under the context of the problem we’re
trying to solve.
o For example, if we were analysing data about the general health of the population, the phone
number wouldn’t be necessary — column-wise.
 Only if you are sure that a piece of data is unimportant, you may drop it. Otherwise, explore the correlation
matrix between feature variables.
 And even though you noticed no correlation, you should ask someone who is domain expert. You never
know, a feature that seems irrelevant, could be very relevant from a domain perspective such as a clinical
perspective.

Duplicates

 Duplicates are data points that are repeated in your dataset.


 It often happens when for example
o Data are combined from different sources
o The user may hit submit button twice thinking the form wasn’t actually submitted.
o A request to online booking was submitted twice correcting wrong information that was entered
accidentally in the first time.
 A common symptom is when two users have the same identity number. Or, the same article was scrapped
twice.
 And therefore, they simply should be removed.

Type conversion

 Make sure numbers are stored as numerical data types. A date should be stored as a date object, or a Unix
timestamp (number of seconds), and so on.
 Categorical values can be converted into and from numbers if needed.
 A word of caution is that the values that can’t be converted to the specified type should be converted to NA
value (or any), with a warning being displayed. This indicates the value is incorrect and must be fixed.

Syntax errors

 Remove white spaces: Extra white spaces at the beginning or the end of a string should be removed.
 Pad strings: Strings can be padded with spaces or other characters to a certain width. For example, some
numerical codes are often represented with prepending zeros to ensure they always have the same number
of digits.
 Fix typos: Strings can be entered in many different ways, and no wonder, can have mistakes.
o Gender
 m
 Male
 fem.
 FemalE
 Femle
o This categorical variable is considered to have 5 different classes, and not 2 as expected: male and
female since each value is different.
o A bar plot is useful to visualize all the unique values. One can notice some values are different but do
mean the same thing i.e. “information_technology” and “IT”. Or, perhaps, the difference is just in
the capitalization i.e. “other” and “Other”.
o Therefore, our duty is to recognize from the above data whether each value is male or female. How
can we do that?
 The first solution is to manually map each value to either “male” or “female”.
dataframe['gender'].map({'m': 'male', fem.': 'female', ...})
 The second solution is to use pattern match. For example, we can look for the occurrence of
m or M in the gender at the beginning of the string.
re.sub(r"\^m\$", 'Male', 'male', flags=re.IGNORECASE)
 The third solution is to use fuzzy matching: An algorithm that identifies the distance
between the expected string(s) and each of the given one. Its basic implementation counts
how many operations are needed to turn one string into another.

Standardize

 Our duty is to not only recognize the typos but also put each value in the same standardized format.
 For strings, make sure all values are either in lower or upper case.
 For numerical values, make sure all values have a certain measurement unit.

Scaling / Transformation

 Scaling means to transform your data so that it fits within a specific scale, such as 0–100 or 0–1.
 For example, exam scores of a student can be re-scaled to be percentages (0–100) instead of GPA (0–5).

Normalization

 While normalization also rescales the values into a range of 0–1, the intention here is to transform the data
so that it is normally distributed. Why?
 In most cases, we normalize the data if we’re going to be using statistical methods that rely on normally
distributed data.

Missing values

 Given the fact the missing values are unavoidable leaves us with the question of what to do when we
encounter them.
 There are three, or perhaps more, ways to deal with them.
o Drop. If the missing values in a column rarely happen and occur at random, then the easiest and most
forward solution is to drop observations (rows) that have missing values.
o Impute.It means to calculate the missing value based on other observations. There are quite a lot of
methods to do that:
 using statistical values like mean, median. However, none of these guarantees unbiased data,
especially if there are many missing values.
 Using a linear regression. Based on the existing data, one can calculate the best fit line between two
variables, say, house price vs. size m².
 Hot-deck: Copying values from other similar records. This is only useful if you have enough available
data. And, it can be applied to numerical and categorical data.
o Flag. Some argue that filling in the missing values leads to a loss in information, no matter what
imputation method we used. That’s because saying that the data is missing is informative in itself, and
the algorithm should know about it. Otherwise, we’re just reinforcing the pattern already exist by other
features.

Outliers

 They are values that are significantly different from all other observations.
 Outliers are innocent until proven guilty. With that being said, they should not be removed unless there is a
good reason for that.
o For example, one can notice some weird, suspicious values that are unlikely to happen, and so
decides to remove them. Though, they worth investigating before removing.
 It is also worth mentioning that some models, like linear regression, are very sensitive to outliers. In other
words, outliers might throw the model off from where most of the data lie.

In-record & cross-datasets errors

 These errors result from having two or more values in the same row or across datasets that contradict with
each other.
o For example, if we have a dataset about the cost of living in cities. The total column must be
equivalent to the sum of rent, transport, and food.
o Similarly, a child can’t be married. An employee’s salary can’t be less than the calculated taxes.

3. Verifying
 When done, one should verify correctness by re-inspecting the data and making sure it rules and constraints
do hold.
 For example, after filling out the missing data, they might violate any of the rules and constraints.
 It might involve some manual correction if not possible otherwise.

4. Reporting
 Reporting how healthy the data is, is equally important to cleaning.
 As mentioned before, software packages or libraries can generate reports of the changes made, which rules
were violated, and how many times.
 In addition to logging the violations, the causes of these errors should be considered.

LIME

XGBoost

https://towardsdatascience.com/how-to-determine-the-best-model-6b9c584d0db4

You might also like