Professional Documents
Culture Documents
Search Engine - HW-module2 - CPEG657
Search Engine - HW-module2 - CPEG657
Search Engine - HW-module2 - CPEG657
Question #1
A.-
tr '[A-Z]' '[a-z]' < assign2-article1.txt| awk '{for (i=1;i<=NF;i++) print $i;}' | sort | uniq -c |sort -rn -k1 >
assign2-article1countsorted.txt
tr '[A-Z]' '[a-z]' < assign2-article2.txt| awk '{for (i=1;i<=NF;i++) print $i;}' | sort | uniq -c |sort -rn -k1 >
assign2-article2countsorted.txt
The results of the word count give me a brief idea of the articles after the most common word it
shows some key words related with each article .
B.-
the articles are similar with the almost same number of word with the most common words the
difference is the key words that makes the argument of the article.
Question # 2
Part I.
1)
While BM25/Okapi and pivoted normalization are among the most effective ways to implement TF-IDF,
it remains the single most challenging research question in information retrieval what is the optimal way
of implementing TF-IDF
These formulas have the frequency of words in the query and the document, as well as the length of the
document.
2)
okapi
double k1=1.2;
double k3=1000;
double b=0.75;
for(all)
score+=log((docN-DF[i]+0.5)/(DF[i]+0.5))*(((k1+1)*tf[i])/(k1*((1-
b)+b*(docLength/docLengthAvg))+tf[i]))*(((k3+1)*qf[i])/(k3+qf[i]));
Pivote normaliztion
double s=0.75;
for(all)
score+= (((1+log(1+log(tf[i])))/((1-s)+s*(docLength/docLengthAvg))))+qf[i]*log((docN+1)/DF[i]);
}
3.
Okapi
doe robust04
MAP P30 MAP P30 b
0.1771 0.2152 0.2525 0.249 0
0.1816 0.22 0.2311 0.2819 0.25
0.1836 0.2267 0.237 0.2896 0.5
0.1811 0.2276 0.2346 0.2898 0.75
0.1699 0.2124 0.2122 0.2614 1
pivoted normalization
doe robust04
MAP P30 MAP P30 b
0.0895 0.119 0.1422 0.21 0
0.0867 0.1257 0.1656 0.242 0.25
0.0876 0.12 0.1555 0.2281 0.5
0.0862 0.1162 0.1419 0.2107 0.75
0.0805 0.1076 0.1271 0.1942 1
By the result , the okapi function have not to much change if the variables change however tth pivoted
normalization change and going up and down , I can deduct that okapi have the best perfomance.
Okapi have the same maximun value for the same variable of the diferent collection beside pivote
normalization that have diferent maximu values at diferet variable .
double k1=1.2;
double k2=1000;
double c1=0.75;
double c2=0.75;
double dd1,idf,ddc,dd;
for(all)
dd1+=tf[i]/docLength;
dd=dd1/DF[i];
idf=log((docN-DF[i]+0.5)/(DF[i]+0.5));
ddc=c1+log(dd)+c2;
score=(tf[i]/((k1+k2)*(docLength/docLengthAvg)+tf[i]))*idf+ddc;
double k1=1.2;
double k3=1000;
double b=0.75;
double gamma=0.5;
double IDFr,IDFrhat;
for(all)
IDFrhat=log((docN-DF[i]+gamma)/(DF[i]+gamma));
IDFr=gamma/(1-gamma);
score+=((IDFr+IDFrhat));
A Study of Poisson Query Generation Model for Information Retrieval (SIGIR 2007
for(all)
{
score+=qf[i]*log((1+tf[i])/dirMu*(TFC[i]/MaxDF))+log (dirMu/(docN+dirMu));
These formulas have the frequency of words in the query and the document, as well as the length of
the document except Generalized Inverse Document Frequency that use the relevance of the
document