Search Engine - HW-module2 - CPEG657

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

Assignment #2

Question #1

A.-

tr '[A-Z]' '[a-z]' < assign2-article1.txt| awk '{for (i=1;i<=NF;i++) print $i;}' | sort | uniq -c |sort -rn -k1 >
assign2-article1countsorted.txt

tr '[A-Z]' '[a-z]' < assign2-article2.txt| awk '{for (i=1;i<=NF;i++) print $i;}' | sort | uniq -c |sort -rn -k1 >
assign2-article2countsorted.txt

The results of the word count give me a brief idea of the articles after the most common word it
shows some key words related with each article .
B.-

the articles are similar with the almost same number of word with the most common words the
difference is the key words that makes the argument of the article.

Question # 2

Part I.

1)

While BM25/Okapi and pivoted normalization are among the most effective ways to implement TF-IDF,
it remains the single most challenging research question in information retrieval what is the optimal way
of implementing TF-IDF

These formulas have the frequency of words in the query and the document, as well as the length of the
document.

2)

okapi

double k1=1.2;

double k3=1000;

double b=0.75;

for(all)

score+=log((docN-DF[i]+0.5)/(DF[i]+0.5))*(((k1+1)*tf[i])/(k1*((1-
b)+b*(docLength/docLengthAvg))+tf[i]))*(((k3+1)*qf[i])/(k3+qf[i]));

Pivote normaliztion

double s=0.75;

for(all)

score+= (((1+log(1+log(tf[i])))/((1-s)+s*(docLength/docLengthAvg))))+qf[i]*log((docN+1)/DF[i]);
}

3.

Okapi

doe robust04
MAP P30 MAP P30 b
0.1771 0.2152 0.2525 0.249 0
0.1816 0.22 0.2311 0.2819 0.25
0.1836 0.2267 0.237 0.2896 0.5
0.1811 0.2276 0.2346 0.2898 0.75
0.1699 0.2124 0.2122 0.2614 1

pivoted normalization

doe robust04
MAP P30 MAP P30 b
0.0895 0.119 0.1422 0.21 0
0.0867 0.1257 0.1656 0.242 0.25
0.0876 0.12 0.1555 0.2281 0.5
0.0862 0.1162 0.1419 0.2107 0.75
0.0805 0.1076 0.1271 0.1942 1
By the result , the okapi function have not to much change if the variables change however tth pivoted
normalization change and going up and down , I can deduct that okapi have the best perfomance.

Okapi have the same maximun value for the same variable of the diferent collection beside pivote
normalization that have diferent maximu values at diferet variable .

Part II. Search Engine Competition

Word Document Density and Relevance Scoring (SIGIR 2000)

double k1=1.2;

double k2=1000;

double c1=0.75;

double c2=0.75;

double dd1,idf,ddc,dd;

for(all)

dd1+=tf[i]/docLength;

dd=dd1/DF[i];

idf=log((docN-DF[i]+0.5)/(DF[i]+0.5));
ddc=c1+log(dd)+c2;

score=(tf[i]/((k1+k2)*(docLength/docLengthAvg)+tf[i]))*idf+ddc;

Generalized Inverse Document Frequency (CIKM 2008)

double k1=1.2;

double k3=1000;

double b=0.75;

double gamma=0.5;

double IDFr,IDFrhat;

for(all)

IDFrhat=log((docN-DF[i]+gamma)/(DF[i]+gamma));

IDFr=gamma/(1-gamma);

score+=((IDFr+IDFrhat));

A Study of Poisson Query Generation Model for Information Retrieval (SIGIR 2007

double dirMu=[1500 2000 2500];

for(all)

{
score+=qf[i]*log((1+tf[i])/dirMu*(TFC[i]/MaxDF))+log (dirMu/(docN+dirMu));

These formulas have the frequency of words in the query and the document, as well as the length of
the document except Generalized Inverse Document Frequency that use the relevance of the
document

You might also like