Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

1

Review of N-grams approach for Language


Detection, Correction and Distance Calculation
16010420127 Aditya Phatak, 16010420075 Guneet Sura,
16010420092 Malay Thakkar, 16010420146 Tarash Budhrani

Abstract—This paper is a review research paper comprising of (DLD), Modified DLD (MDLD) and finally N-Dist that will be
an analysis of four different applications of N-gram distance and used in the paper. The N-Dist method is simply an improved
N-gram model for detecting, correcting and finding relationships method over all the other methods, that is, it provides an
between words of different language models. One of the papers
aims at finding similarities in 41 European languages using N- overall increase in accuracy.
gram distance using Common N-Grams model. Its main use case The third paper uses the concept of IDS to keep searching
involves authorship attribution tasks which works fairly well with for attack signatures i.e. arrangement of information that can
the proposed model. Rest of the papers shift their approach to be used to identify an attacker’s attempt to exploit a known
provide new concepts based on N-gram distance. This helps us operating system or application vulnerability. For executing
understand the real-world use cases of this concept and how
it was further analyzed to provide a forthcoming between the such a test we need distance measure to determine the dif-
outputs of synthesized language correction models. The extended ference between attack signature and traffic packet. In this
applications not only gave us a new approach, but provided a paper we use q gram distance to detect the difference between
gateway for new technologies to build their foundations and attack signature and packet. The results are then arranged in
expand itself in the future. Since, n-gram distance does not increasing order of q gram distance and finer inspection is
provided complete transposition operations, the concept of E-
N-DIST was introduced for providing a clearer search result done on that packet. This paper tells that efficient algorithms
in one of the papers. In another paper, it was suggested that for exact string search were found several years ago but the
an another method a two-phase index-less search procedure for problem with this algorithm were they are not fault tolerant.
application in misuse detection-based IDS that makes use of But the q gram distance continues operating uninterrupted
q-gram distance since the complexity is significantly high as despite the failure of one or more of its components. For faster
compared to constraint edit distance. In the final paper, the study
suggests a technique for speech recognition that uses a syllable response to query it uses indexing but the drawback is it is
lattice and n-gram array indices to identify words that are not memory and time consuming.
commonly used. Compared to other methods, the methodology The fourth paper says that textual search engines can be
has high accuracy and efficiency. used to find information on the web if the subject data has
textual information such as transcribed broadcasts and news.
Spoken Term Detection (STD) methods use Large Vocabu-
I. I NTRODUCTION
lary Continuous Speech Recognizer (LVCSR) transcripts to

T HE first research paper suggests the approach for cal-


culating the distance between two languages using n-
grams, which are contiguous sequences of n letters or words.
perform textual search, but Out of Vocabulary (OOV) terms
are not recognized as the LVCSR doesn’t have the given
word as an output, for various languages, the Levenshtein
The authors contend that conventional methods for comparing distance between two syllables, syllable unit as basic units
languages, such as lexical and phonetic distance, are not of recognition and elastic matching between two syllable
necessarily reliable or useful. The common n-grams approach, sequences have been tried to reduce recognition errors for
which counts the amount of shared n-grams between two OOV terms and ignore grammatical constraints, all of which
languages and uses this number as a gauge of how similar are better suited to specific languages based on their syllable
they are, is what they recommend using as an alternative. The properties. Phenome based n-grams have also been suggested,
tests in this study demonstrate that, in terms of accuracy and Dynamic Time Warping (DTW) has also been used but is more
computational economy, the common n-grams approach per- time consuming than index-based processing. An n-gram array
forms better than the more conventional metrics of language has been used in the paper with distance measures which has a
distance. syllable recognition lattice to reduce recognition errors, along
The second paper begins with noting down the common with a pruning method for n-gram array indices based on the
use cases of each type of name matching in different fields. It probability of recognition.
addresses name-errors that can occur when we spell things and
how they can be fixed using different techniques of distance
II. L ITERATURE S URVEY
and similarity. Each word can be classified based on its
distance and similarity to the error word that was initially used. The first paper begins with the concept of Automatic
The distance between both the words can be used for either Language Identification (LID) that refers to the process of
replacing, transposing or adding a letter for correction. The identifying the language of a spoken utterance or text au-
authors state Edit Distance, Damerau-Levenshtein Distance tomatically. It is an active area of research. It involves the
2

process of applying a probability distribution over a sequence length is one of its drawbacks. This problem has been solved in
of tokens. Although the language can be identified, we need a number of experiments by standardising the input strings. For
to understand how different the language is from a particular instance, a string normalisation algorithm based on character
language. That is, the degree of difference between two or frequency and positioning information was proposed by Wang
more languages. We can tell if the two languages are similar et al. in 2007. The method gives less frequently occurring
or different discreetly, either yes or no. But the degree of characters and those that are in the middle of the string more
difference gives us the relation between those languages. weights.
This paper focuses on extending the work done on language The classic N-gram distance algorithm’s inability to handle
similarity analysis using the Common N-Grams(CNG) by synonyms and homonyms is another drawback. The use of
taking a different approach. outside resources like dictionaries and thesauri has been sug-
In the earlier versions of language similarity models, Mul- gested in several research as a way to address this problem.
tiple Discriminant Analysis was used to test the results where For instance, Han et al. (2017) introduced a name matching
some character level features like the presence of diphthongs technique that maps ambiguous terms to the appropriate rep-
or phrases which were exclusive to some language were resentation using a predetermined synonym/homonym dictio-
used. The first usage of unigram and bigram frequencies was nary.
implemented while exploring the LID. It was an ensemble of Another crucial difficulty with N-gram distance algorithms
seven using majority voting. The classifier was constructed is weighting. For the informative and uncommon N-gram sub-
on Kolmogor-Smirnov and Yule K test. The accuracy was as sequences, greater weights have been suggested by a number
high as 89 percent but it only classified English and Spanish of studies. For instance, Luan et al. (2017) developed a
and did not give degree of difference between the two. Later, weighting scheme that gives N-grams that appear at the start
a character trigram Bayesian Model was used for language or end of the string larger weights and those that appear in
identification. The best known work by far on LID is by the middle of the string lower weights.
Cavnar et al where they used an out of order similarity metric
In order to provide precise and reliable matching for many
on th n-gram ranks. Their model approached an accuracy of
sorts of names and variations, the work by Al-Hagree proposes
99.8 percent. This implementation later served as a base for
an enhanced N-gram distance method that includes normali-
TextCat, which supports 69 languages.
sation, synonym/homonym management, and weighting com-
Language Analysis by previous researchers have analysed
ponents. The authors test their algorithm against numerous
some basic rules for evaluating languages. In any large corpus
cutting-edge techniques, including Jaro-Winkler distance, Lev-
of data, frequency of the word is inversely proportional to
enshtein distance, and Smith-Waterman algorithm, using two
its ranking in the frequency table. Some researchers had built
benchmark datasets of Arabic and English names. The findings
15 linguistic complex networks based on their corresponding
demonstrate the proposed algorithm’s superior performance in
syntactic treebanks. The results were astonishing.English came
both datasets, where it achieves an F-measure of up to 0.98.
out as more flexible and powerful in expression, Spanish and
In order to increase the effectiveness of misuse detection
French turned out to be constrained by rules and Chinese
using the q-gram distance measure, a novel strategy is put
showed that although it has less characters, it has got more
forward in the study of third paper.
meaning than any other language.
Asgari et al. build word co-occurrence networks for fifty Intrusion detection systems (IDSs) and their difficulties are
languages. The edges between nodes are weighted using cosine introduced in the beginning of the paper. IDSs have a high
similarity between word embeddings. In addition to that, they false-positive rate, which the authors note is one of their
perform word alignment between two graphs, which means key problems. This problem can be solved by increasing the
that the words from different language networks are aligned effectiveness of the detection process. The q-gram distance
by semantic similarity. metric, a gauge of the similarity between two strings that takes
Generating language distance based on perplexity measure into account the frequency of substrings of length q in each
was much computationally less expensive compared to the string, is then introduced.
previous approaches.Perplexity is an evaluation metric for The authors next go over their suggested strategy for boost-
language models used to measure fitness of test data built with ing misuse detection’s effectiveness when employing the q-
n-grams which was, in this case, adapted to measure distance gram distance metric. The method entails employing a q-gram
between languages. index to pre-process the input data in order to effectively locate
The study in second paper suggests an enhanced name substrings that match recognised attack patterns. The authors
matching technique based on N-gram distance. In several also suggest a variety of modifications to boost the strategy’s
areas, including record linking, data cleaning, and information effectiveness even more.
retrieval, name matching is a crucial task. An effective way The proposed approach is assessed in the research using
to compare two strings based on their shared N-gram-length a network traffic dataset that is available to the public. The
subsequences is to utilise the N-gram distance. Nonetheless, outcomes demonstrate that the method maintains a low false-
multiple research have pointed out the shortcomings of the positive rate while achieving a high detection rate. Addi-
conventional N-gram distance algorithm and suggested various tionally, the authors compare their strategy to other cutting-
methods to enhance its functionality. edge strategies and demonstrate that it beats them in terms of
The classic N-gram distance algorithm’s sensitivity to string efficiency and detection rate.
3

The problem of out-of-vocabulary term detection in the that the E-N-DIST algorithm can provide computations for a
fourth paper, which refers to the challenge of identifying terms greater number of states as compared to the original algorithm.
that are not present in a language model or dictionary, is The second being that the estimated cost of performing op-
described in the paper’s introduction. According to the authors, erations (transposition, substitution, insertion and deletion) on
this issue is particularly severe in languages with intricate the word for name matching will depend on the number of
morphology, such as Japanese, which has several potential states of the word, rather than diving towards a fixed cost as
morpheme combinations. provided by the N-DIST algorithm. This minimizes the error
The authors suggest a technique that uses a syllable lattice in correction and provides highly reliable results.
with n-gram array indices to solve this issue. An effective The author puts the equations used for the E-N-DIST algo-
data structure for storing and retrieving n-gram frequencies is rithm into algorithms to compute the cost of each operation
the n-gram array indices. The conceivable syllabic sequences that takes place for name matching. This is done by comparing
in a language are represented by the syllable lattice, a graph weights for each N that depends on the word’s error. The
structure. results vary from the N-DIST algorithm and it provides a
wider range of solutions. The pseudo code for insertion and
deletion is much simpler as compared to the pseudo code of
III. C OMPARISON AND M ETHODOLOGY U SED
transposition and substitution due to the complexity of the
The first paper compares two approaches, using the Perplex- titles.
ity Method and the CNG method. CNG stands for Common Moreover, there were experiments conducted on the thesis.
N-Grams Language Distance Measure. It is a text classification It begins by the preparation of datasets which explains the
algorithm which compares the frequencies of character n- names of multi languages that were put into test the algorithms
grams, that is the strings of characters of length n that are highlighted in the paper. In total, 11 datasets of different
most common in the considered documents. To get a higher languages were added and the performance of the E-N-
level of understanding, it can be interpreted as a k nearest DIST algorithm was examined, each with spelling errors and
neighbor with the k value as 1, where instead of the standard corrections. Overall, the E-N-DIST algorithm provided the
Euclidean distance, the CNG distance is used. For each defined best results (as highlighted by the authors in a given table).
class and a new unlabeled document, the algorithm builds Using two benchmark datasets of Arabic and English names,
a class profile, which consists of frequencies of the most the authors assess the suggested algorithm and compare it
common character n-grams with the length of n. The n-gram to many cutting-edge techniques, including Jaro-Winkler dis-
frequencies are normalized: n-gram counts divided by the tance, Levenshtein distance, and Smith-Waterman algorithm.
total number of n-grams. They used this CNG algorithm on They quantify the algorithms’ effectiveness using F-measure,
the train subsets, that they prepared language profiles for, by recall, and precision. The findings demonstrate the proposed
using character n-gram range 3–7 and word uni-grams. For the algorithm’s superior performance in both datasets, where it
profile length hyperparameter, for all experiments they used achieves an F-measure of up to 0.98. The authors also do a
the maximum length of the smallest profile in the dataset. sensitivity analysis to examine the impact of several factors
They evaluated the test subsets by preparing test profiles and on the algorithm’s performance, including N-gram size, a
applied pairwise CNG between each train-test language pair. synonym/homonym dictionary, and a weighting method. For
The result of this is asymmetrical distance matrix among train further analysis, the author provided the formula for calculat-
and test language profiles. Then, using the evaluation method ing the correctness using the F-measure, precision and recall
described by Gamallo et al. They obtained accuracy in the techniques, and with no surprise, the E-N-DIST outperformed
reference to gold standard. To evaluate the significance and the other algorithms.
stability of the results, they employ McNemar’s statistic and The third paper discusses that there are two phases in the
Spearman correlation coefficient. These parameters are also algorithm. First phase is detecting differences between packet
calculated for the Perplexity Method. On comparing the two, and attack signatures.Then q gram distance are arranged in
CNG tends to perform way better than the old, time consuming increasing order.Second phase to deep dive search into packets.
and complex Perplexity Method. In this algorithm we focused on first phase. Steps of algorithm
In the second paper the author clearly states that the methods used in this paper are:
used are varied as orthographic distance and orthographic • Pre-processing: The raw network traffic data is prepro-
similarity for the name matching approach. By comparing cessed to extract features, which are subsequences of
various methods and summing their equations, the author length q. These features are represented as a set of q-
has provided a clear contrast of how the N-DIST approach grams, where a q-gram is defined as a sub-string of length
has been the most efficient one. This approach has been q.
used to provide name matching in English as well as the • Computing the q-gram distance: The q-gram distance is
Arabic language (A-N-DIST). The purpose of this paper is computed between each pair of network traffic samples
to improve the N-DIST algorithm using Latin based language in the dataset. The q-gram distance between two samples
by considering multiple transpositions. This will be called the is defined as the number of unique q-grams that are
E-N-DIST Algorithm (Enhanced N-DIST). present in both samples. This distance measure captures
This algorithm proposed by the author promises improve- the similarity between the two samples based on the q-
ment over the original N-DIST algorithm. The first one being grams they share.
4

• Feature selection: The q-gram distance values are used array is made using the combination of syllables in the m-best
to select a subset of the most discriminative q-grams for lattice, for one position in the lattice, there are mn kinds of n-
use as features in the classification model. The authors gram. For insertion error: make an n-gram array which permits
propose using a threshold value to determine which q- one distant n-gram, considering the gap between appearance
grams are most discriminative. The threshold value is positions. For deletion error: search the query as above while
set based on the desired number of features and the allowing for the case where one syllable in the query is deleted.
distribution of the q-gram distances in the dataset. Pruning Method for array indices : to attack the problem of a
• Classification: The selected q-grams are used as features large index, tri-gram indices are pruned based on probability of
in a classification model to distinguish between normal recognition results, difference in 1-best and m-best likelihood
and anomalous network traffic. The authors evaluate is calculated for every position in the lattice, The indexer
the performance of several classification algorithms, in- compares the pre-set threshold with the difference and adds
cluding Naive Bayes, decision trees, and support vector the tri-gram to the index if it is smaller than the threshold.
machines (SVM). Decision Maker : a new distance measure for the number of
errors allowed is defined, to allow fast pruning of unreliable
Then an experiment was conducted.For every string con- candidates, the syllable in the 1 best result is always zero
strained edit distance and the q-gram distance were found in distance Next, the syllable distance for an insertion error is
advance. Threshold distance was known prior,for each string represented by 1. Finally, the syllable distance for deletion
if the constrained edit distance/q-gram distance between the errors is equal to the number of syllables that were eliminated
record and the search string was calculated. If this condition in a query.
was true then record/string will move in second phase.The
approach is based on a well-established metric (q-gram dis- IV. C ONTRIBUTION
tance) and is therefore easy to implement.The experimental
evaluation is extensive and uses both synthetic and real-world V. P ROPOSED W ORK
datasets. The results show that the approach outperforms other The authors in first paper claim that a number of appli-
state-of-the-art approaches in terms of efficiency and accuracy. cations, including language categorization and machine trans-
In the fourth paper, to find candidate positions for a known lation, may benefit from their method. This model has the
sub-word sequence, we used syllable as a sub-word unit, the capability of finding and asserting the relevance of a conversa-
recurrence equation of a DTW between an input sub-word tion between two individuals speaking two different languages.
sequence (a1 a2 a3 a4. . . ) and a query sub-word sequence This is especially useful in training and giving guidance to
(b1 b2 b3 b4. . . ) is required. The position where the query (human) translators, who can perform the translation, evaluate
appears is the result and the neighborhood of the spotting their progress and improve.
position is taken as an input configuration. Distance with In the second paper, the scope widens to practical usage of
the query is measured by DTW. Distance between the query day-to-day spell checking in devices like “Auto-correct” that
and candidate is measured by edit distance or Bhattacharya also uses this concept to bring out the most of user experience.
distance. A Spoken Document Retrieval (SDR) for OOV words Overall, this paper has provided a better future outlook by
with a syllable lattice with mis-recognized syllables:- The adding phonetic and other conditions for name-matching.
spoken document is recognized by an LVCSR with a syllable In the third paper, Q-gram distance has the potential to
recognition system for OOV words, then indexing is applied increase the effectiveness of misuse detection. To further
to the lattice which contains plural candidates at every best improve the precision and effectiveness of intrusion detection
candidate. OOV terms are searched using the top m-best in systems, future research can concentrate on investigating the
the syllable lattice. Syllables are first assigned their appearance usage of additional distance measures and machine learning
position in a spoken document. The syllable is then built as an approaches. Another interesting area of research is examining
n-gram in each appearance place. The n-gram is then sorted the efficiency of incorporating domain-specific knowledge into
in lexical order in order to swiftly search it with a binary the detection process.
search method. Three steps make up the search procedure on An important topic for future work in fourth paper is to
an n-gram array. A query is first transformed into a series of improve the retrieval precision. Using just low confidence
syllables. Then, the query is turned into an n-gram. The final portions as OOV candidates from the LVCSR results is one
step is to retrieve the query from the n-gram array. A mixture technique to increase the retrieval accuracy. Combining the
of n-grams is used to obtain a query that has more than n+1 output of various decoders is another technique to raise the
syllables. A query is divided into two n-grams in the first half pace of syllable recognition. Finally, to increase retrieval
and the second half if it has less than 2n syllables and more accuracy, we can substitute the syllable probability received
than n+1 syllables. As a result, the query is twice taken from from the decoder for the syllable distance.
the n-gram array. The positions where the detection results
occurred in the first half and the latter half are taken into VI. C ONCLUSION
account when combining the findings to determine whether The CNG algorithm is addressed in the first paper as
they appeared within one position or not. A query with less a suitable approach for author identification tasks, offering
than 3n syllables and more than 2n+1 syllables can be found an automated and effective technique to represent linguistic
by splitting it into three parts. For substitution errors: n-gram similarity.
5

The second study offers a novel notion of n-gram distance,


which has applications in Arabic and other languages, includ-
ing useful ones like name-matching and spell-checking.
In the third study, an unique way for increasing the effec-
tiveness and precision of misuse detection in IDS is presented.
This method makes use of the q-gram distance measure, which
has been shown to be more effective and precise than existing
approaches for datasets with a lot of features.
Last but not least, the fourth research suggests a pruning
method employing a trigram array index that outperformed
earlier approaches in terms of search time in retrieval tasks.
In conclusion, these four publications introduce cutting-edge
ideas and strategies to tackle a range of problems. These
are significant contributions to their respective domains in
information retrieval, network security, and other disciplines
of natural language processing. They serve as an example of
the ongoing work being done to create new algorithms and
methods that will enhance our capacity to understand natural
language, secure networks, and rapidly retrieve information.

R EFERENCES
[1] Dijana Kosmajac, Vlado Keselj, ”Language Distance using Common
N-Grams”, 19th International Symposium INFOTEH-JAHORINA, 18-20
March 2020
[2] Salah AL-Hagree,Maher Al-Sanabani,Mohammed Hadwan,Mohammed
A.Al-Hagery, ”An Improved N-gram Distance for Names Matching”, 2019
First International Conference of Intelligent Computing and Engineering
(ICOICE)
[3] Maher Al-Sanabani, Slobodan Petrovic, Sverre Bakke, ”Improving the
Efficiency of Misuse Detection by Means of the q-gram Distance”, The
Fourth International Conference on Information Assurance and Security
[4] Keisuke wami, Yasuhisa Fujii , Kazumasa Yamamoto, Seiichi Nakagawa,
”Efficient Out-of-Vocabulary Term Detection by N-Gram array Indices
With Distance from a syllable Lattice” .Department of Computer Science
and Engineering, Toyohashi University of Technology, Japan

You might also like