Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 13

ISR Assignment 1

objectives from my part:-

 Study the statistical property of the text
• Calculate the frequency of words
• Rank words according to their frequency
• Plot the graph of frequency Vs rank
• Calculate the product of rank and frequency
• Give explanation on the distribution: does it follow Zipfian
distribution, does it follow Zipfs law, etc.
 Based on Luhn’s idea:
• What are the words that will be removed/will not be considered as
index terms, what is your upper and lower cut-off point, how did
you decide on the cut-off points
• Which words are used for indexing
Study the statistical property of the text
• Under the given alternative as a group we have chosen Afaan
Oromo to be our base to this project. As tokenization and mark up
removal are language dependent or language based tasks to be
performed Afaan oroma have almost the same markup used and
character used as of English except that of “ ’ ” or known as
• As the technicality of the operation we have used python as our base
of code to be executed. Why Python? We have used python
because it enables use to a matlablib library which plots the
graphs in ranking order and other functionalities it offers.
Python is the most suitable programming language specially for
data analysis and data manipulation.
• Based on Zipf’s law, Luhan’s and heap’s law to determine
the word distribution, word significant and to show how
vocabulary size grows with the growth the corpus we
have extracted data form about 12 documents and about
23,094 words as a total.
• Zipf’s Law
• Zipf's law is an empirical law that describes the distribution of
frequencies of words in a language. Named after linguist George
Zipf, this law states that in a given corpus of natural language text,
the frequency of any word is inversely proportional to its rank in the
frequency table. In other words, the second most frequent word will
occur approximately half as often as the most frequent word, the
third most frequent word will occur one-third as often, and so on.
• Key Points of Zipf's Law:
• 1. Rank-Frequency Relationship:
• 2. Frequency Distribution:
• Zipf's law implies that a small number of words are used
very frequently, while the vast majority are used rarely.
• 3. Log-Log Plot:
• When the rank and frequency of words are plotted on a
log-log scale, Zipf's law predicts that the plot will be a
straight line with a slope of approximately -1.
• Since Zipf's law indicates that a few words are very
common, data compression algorithms can use this
property to encode text more efficiently by using shorter
codes for frequent words.
Rank Words Word frquency r*f/tf

1 hin 1932 0.083658093

2 kan 1568 0.135792847

3 akka 1525 0.198103403

4 fi 1483 0.256863255

5 ta 1371 0.296830346

6 a 926 0.240581969

7 isaa 832 0.252186715

Does it follow Zipfian distribution, does it follow Zipfs law?

yes,it follows zipfs law.

Zipf's Law states that when the distinct words in a text are
arranged in decreasing order of their frequency of occuerence
frequent words first), the occurence characterstics of the
can be characterized by the constant rank-frequency law of Zipf.
that is r * f = c
in above table 0.083658093,0.135792847,0.198103403...
almost have approximately constant
• Luhn’s Law
• Luhn's method, proposed by Hans Peter Luhn in 1958, is a technique used in
information retrieval and text summarization to decide a cut-off point for selecting
significant words (terms) from a document. The idea is to identify the most
informative words that can be used for indexing, summarizing, or further analysis.
The method is based on the observation that words with extremely high or low
frequencies tend to be less informative, while words with medium frequencies tend to
carry more significant content.

• Steps to Decide a Cut-off Point Using Luhn’s Method:

• 1. Frequency Distribution Analysis:

• Calculate the frequency of each word in the document or corpus.
• -Rank the words in descending order of their frequency.

• 2.Determine the Upper and Lower Frequency Thresholds:

• Lower Cut-off: Words that occur very infrequently are often not informative because they
might be rare terms or misspellings.
• Upper Cut-off: Words that occur very frequently are usually stop words which are
common in all texts and do not provide specific information about the document's content.

• 3.Set the Lower and Upper Cut-off Points:

• Luhn suggested that the most informative words are those whose frequency lies between
the lower and upper cut-off points.
• Lower Cut-off (L):** Often determined by ignoring words that occur less than a certain
number of times (e.g., less than 3 times in the document).
• Upper Cut-off (U): Often determined by ignoring words that are among the topmost
frequent words, typically stop words.
• In this example, the function `luhns_method` takes a
document, a lower cut-off, and an upper cut-off ratio. It
returns the list of informative words based on Luhn's
method. Adjust the cut-off parameters as needed for
different corpora or use cases.

• Examples: Lower cutoff point begins from frequency: 0.5

• Upper cutoff point begins from frequency: 2.0

• Examples: hubachiisa, eegamedha, eyyama,

You might also like