Professional Documents
Culture Documents
Project Report
Project Report
Project Report
CHAPTER 01 ............................................................................................................................ 3
INTRODUCTION ..................................................................................................................... 3
Approaches ............................................................................................................................ 7
CHAPTER 02 .......................................................................................................................... 13
BACKGROUND ..................................................................................................................... 13
History.................................................................................................................................. 17
CHAPTER 03 .......................................................................................................................... 22
METHODOLOGY .................................................................................................................. 22
Jaccard Similarity............................................................................................................. 27
Euclidean Distance........................................................................................................... 28
Methodology ........................................................................................................................ 36
CHAPTER 04 .......................................................................................................................... 50
Conclusion ........................................................................................................................... 51
CHAPTER 01
INTRODUCTION
The electronic business is growing rapidly, which has increased the available information on
the internet due to which it becomes difficult to access the required information on internet
timely. This problem can be solved by introducing the recommender systems. Currently, the
recommender systems are becoming more popular in the business world. Almost every
consumer website has their own recommender systems. Recommender systems can be used
in many areas such as research papers, movies, videos, newspaper, books, songs, products
and so on.
The recommender systems can be defined as an information filtering system which is used to
predict the user preferences about his choices and other user’s behavior as well or it is a
discovery assistant that helps theirs users in identifying the items they may like. These
systems basically make suggestions of items for the users on the basis of his past purchases
and searches, and on different user's conduct. It creates ease for the customer to discover
The recommendation system allows the customers to give input about their preferences or
abhorrence's. Look at Netflix as an object lesson, where customers can provide the feedback
around their options by using a solitary snap to rate the items. The users use the numerical
values to rate the items. The five star rating system is a salient example.
In addition to the rating system, feedback can also be obtained by recording the buying and
browsing behavior of customers. The online traders like Amazon.com,, Facebook, YouTube,
Daraz.pk, and OLX.com etc. use these types of feedbacks. The substance to which the
suggestion is given is referred to as the customer, and the goods which are being prescribed
are likewise considered as items. As the past interests of the users are considered as the best
indicators for the future preferences so that’s why the recommender system uses the users
past behaviors to prepare an analytic thinking of their tastes. Suppose, a customer came on
website and orders shampoo, conditioner and body wash. His buying pattern will be recorded
in the database and when another customer came and orders for example shampoo, the
recommender system will starts suggesting him to order conditioner and body wash as well.
Let us consider the example of YouTube, where the system suggests his customers the clips
A prominent exemption is the situation of a knowledge based recommender system which did
not consider the past preferences of the users instead it considers the requirements suggested
by the users. The knowledge based recommender systems are intelligent enough to think and
able to arrive at a decision that which product should be ranked high or low. For instance, a
product which is not commonly purchased by the customers lies in the lowest ratings, but
suddenly people start purchasing that item, in this situation the knowledge based
recommender system will automatically raise its ratings and will start suggesting that item to
The recommender systems work in a perfect world in one of two ways. It can depend on the
properties of the items that a customer likes, which are broken down to figure out what else
the customer may like; or, it can depend on the preferences of different customers, which the
recommender system at that point uses to measure similarity index amongst customers and
prescribed items to them as needs be. It is additionally conceivable to consolidate both these
what a customer might possibly like among a rundown of given items. The basic rule of the
recommendations is that there must be a significant relationship between the user and the
item which he selects. Suppose, a client who is occupied with an educational program will
probably be keen on another program related to the subject, as opposed to in an action film,
etc. In many cases, where the products are classified into groups of similar items, these
groups may have significant correlations among them which can be utilized to make perfect
recommendations to the users. But on the other hand, the dependency among the user and the
item might be available at the better granularity of individual things instead of different
classes of items. The dependencies can be viewed in the form of ratings matrix, and the future
predictions will be established on the ground of these rating matrix for target users. If the
quantity of rated things that are accessible to a node is larger than it is more simple to
A wide range of models can be utilized to establish future predictions about the user
preferences. For example, the aggregate purchasing or rating conduct of different clients can
be utilized to make clusters of similar clients who experience a similar choice of items. The
interests and activities of these clusters can be utilized to make suggestions to the singular
individuals from the cluster. If the new user outside of that cluster came and order an item
from that cluster the system will start suggesting him the other items from that cluster.
There are some approaches which can be used in the recommender system to make
Most recommender systems adopt one of two essential strategies: collaborative filtering or
exist.
Collaborative filtering
Collaborative Filtering algorithms are based on the client’s behavior. The model of
collaborative filtering can be developed from individual user behavior, but in order to start
out more beneficial results it can likewise be prepared along the base of other users conduct
who experience similar choices or tastes. Collaborative filtering is used to predict the missing
ratings. The ratings provided by the multiple users about the items will be used to make
recommendations for the new users and also for the existing ones. Let us consider an example
of digital library, user ratings about the books specify their likes or dislikes about the specific
books. Most users would have seen just a little portion from the immense availability of
accessible records. Consequently, a big part of the ratings are unobserved (or unspecified).
Instead of that the ratings, which are identified or determined by the users can be termed as
observed ratings while the unspecified ratings are also known as the unobserved or missing
ratings.
The collaborative filtering technique is founded on a single fundamental rule that is if there
are some ratings, which are unspecified than these unspecified ratings can be attributed on the
ground that the observed ratings are highly correlated crosswise over different users and
items. Let’s take an example to explain this concept in detail, consider the two users who
have very common preferences. Suppose, they have the same specified ratings for an item
than at that point their likeness can be recognized by the basic algorithm. It is probable to
consider obvious in a case where one of them has rated the value, the other one also bears the
same feeling. This familiarity can be applied to make deductions about not completely
indicated esteems.
Suppose another example of collaborative filtering, suppose we’re making a website for
suggesting articles to the readers. In collaborative filtering technique the information of the
users about their preferences will be recorded that who have subscribed and read the article.
The system will create the groups of the users who prefer the similar articles. From this
available information, it is now easy to identify the most popular articles and than that article
can be recommended to the group member who have not read or subscribe that article yet.
filtering. This method is used to anticipate the rating which are unspecified yet on the basis of
their neighborhoods. The user based collaborative filtering and the item based collaborative
filtering are the two different techniques to explain the neighborhood. In user based
collaborative filtering technique the users will be identified who experience the similar
preferences in the past and will prove to predict that if two different users like the same affair
in a same fashion than in future one of them’s unobserved ratings can be found along the
footing of the observed ratings of the other one. While in item based collaborative filtering
technique the items will be identified which are more similar to the target point and will
prove to predict that if a particular user likes the target product than he will more likely to
In the content-based recommender system the explanations related to the target item plays an
essential role in order to make predictions. These explanations are termed as Content. In the
content based recommender system, the past purchasing designs and the senior ratings of the
users along with the content of the item are collectively utilized keeping in mind the end goal
to arrive at predictions. The basic thought behind the content-based recommender system is
that the user interests can be determined on the basis of features or properties of the items
they have graded or used previously. Content based systems, suggests items in view of a
close examination between the description of the items and a client's profile. The component
of items is mapped with highlight of clients keeping in mind the end goal to get client – item
recommendations of documents to the users such as articles, papers, web logs, web pages,
publications and so on, the content based recommender systems are regarded as the most
successful filtering technique. CBF used many different models such as the vector space
model, neural networks, decision trees etc, to find or measure the document similarity.
This approach may utilize authentic perusing data, for example, which blogs the client
peruses and the attributes of those blogs. Imagine a client normally reads the history based
articles and comment the blogs about software, and then the content based filtering will
Another example of the CBF is the News Dude which is an individual news system that uses
blended discourse to peruse news stories to clients. The TF-IDF model is applied to depict
news stories so as to decide the fleeting suggestions which are then contrasted with the
Hybrid Filtering
manage the large measure of inputs as the recommender system has the flexibility to treat
different type of task simultaneously. The point where a large number of inputs are available,
there is a golden opportunity for hybridization where the different angles from various sorts
Problem Statement
After leading an analysis of JBR of CUST, we have identified that it neglects the need of
recommender system. The recommender system guides the choices of the users. It increases
the probability of receiving the relevant info for the users and benefit him in the context of
the time saving and increasing efficiency of work. The recommender system will create an
experience. The recommender system uses the client’s preferences to make accurate
recommendations to the user related to their pursuits. The purpose of our study is to introduce
a recommender system for JBR, in which the similarity between the articles can be measured
in order to provide the ease to the users. In order to evaluate the similarity index of text files
In this study, our point of interest is the content-based filtering techniques for the
recommender system because we are working on the recommender system for the Jinnah
Business Review. The motivation behind this action is to present the system for JBR through
which the similarity between the articles can be measured in order to provide the ease to the
users. Due to this recommender system the users will be able to find out the relevant
information quickly. Generally the content files are considered as similar if both of the
articles are portraying similar ideas and they are semantically close too. Then again,
"similarity" can be practiced in a setting of copy discovery. We will consider the similarity
with regards to the comparable ideas between the articles. We will quantify the articles
similarity in light of the vector space model which is based on the cosine similarity. In this
model, articles will be considered as vectors rather than content. The angle between the given
vectors will characterize their similarity. On the off chance that the angle is little, it shows
that the records are more similar and if the angle is huge, it demonstrates that the articles are
less similar. Cosine similarity involves contrasting client profiles, item profiles or content
records. Utilizing cosine similarity we will discover whether the two articles are of similar
idea or not.
CHAPTER 02
BACKGROUND
Recommender system is used to suggest the relevant items or products to user. The
merchants by increasing sales. Increase in sales is possible only when the recommender
system shows items which are of customer’s interest or relevant to customer’s interest. The
profit can be increased through showing user more relevant items. As much the recommender
system shows relevant items as user will purchase more items which leads to higher
profitability for trader. The recommender system gain attention of user by showing more and
more relevant items. A strong recommender system which has a clear and more relevant
suggestions of items is more profitable to merchant. Through suggesting different items the
recommender system attracts the user so it is very important to choose the relevant suggesting
The very first or fundamental of any recommended item is that it must be relevant to the user.
It must be attractive because generally users buy the products which are interesting.
Relevancy is a basic goal of recommender system, because if the product is not relevant, it
will never attract the customer, recommender system can gain the attention of user by
when it is never seen by user before. Items which are common does not attract user. If the
recommender system shows any same items again, it also make customer bore. Sales
recommender system.
Another way to attract user is to show something which is unexpected for user, and user feel
it luck to discover the item. Showing new things may not attract the user as much as the
unexpected item. It works when recommender system totally amaze the user by showing any
surprising item, instead of showing anything that they did not seen before. Because in some
cases items may be new but according to user’s interest. For instance a new English language
languages learning schools, it may not attract the user although it is new for user but not
surprising. On the other hand if any Chinese or Japanese language learning institute
suggestion is made, it may surprise the user and gain the user’s attention. Showing surprising
item is beneficial for increase in sale diversity of any item which are not much popular
system commonly suggest a list of top-k things. At the period when all these suggested things
are fundamentally the same as, it builds the hazard that the user dislike any of these matters.
Then again, when the suggested list contains things of various sorts, there is a more
prominent possibility of choosing more than one from the list. Decent variant has a
competitive edge of making assure that user should not get bore from repeated
recommendation system, which are useful for both the user and the merchant. From the point
of view of the user recommendations can assist enhance general user fulfilment with the Web
webpage. For instance, a user who more than once gets more suggestions from Amazon.com
will be happier with these suggestions and will probably utilize the site once more. This can
enhance the sincerity of user and further increment the deals at the site. At the trader end, the
recommendation process can give bits of knowledge into the requirements of the user and
help alter the user encounter, further. At long last, giving the user a clarification to why a
There is a wide assorted variety in the sorts of items suggested by such systems. A few
recommender systems, for example, Facebook, don't specifically suggest items. Or maybe
they may suggest social connections, which have a backhanded advantage to the site by
expanding its convenience and publicizing benefits. Keeping in mind the end goal to
comprehend the idea of these objectives, we will talk about some well-known cases of
recorded and current recommender systems. These cases will likewise exhibit the wide
decent variety of recommender systems that were established either as research models, or
can be used today as business systems for solving different business problems.
GroupLens was a very first recommender system, which was worked as an examination
model for suggestion of Usenet news. The system gathered evaluations from Usenet readers
and utilized them to predict regardless of whether different readers might want an article
before they read it. A portion of the collaborative filtering algorithms were created in the
GroupLens setting. The general thoughts created by this gathering were additionally stretched
out to other item settings, for example, books and films. The comparing recommender
systems were primarily known as BookLens and MovieLens, individually. Besides its
pioneering contributions to collaborative filtering research, the GroupLens research group
was striking for discharging a few informational indexes amid the early years of this field,
Amazon.com was also in one of the innovators of recommender system, particularly in the
business setting. Amid the early years, it was one of only a handful couple of retailers that
had the forward planning to understand the value of this innovation. Initially established as a
book e-retailer, the business extended to for all intents and purposes all types of items. Thus,
Amazon.com now offers for all intents and suggests all classifications of items, for example,
books, CDs, programming, hardware, electronics and so on. The suggestions in Amazon.com
are given on the premise of unequivocally gave evaluations, purchasing conduct, and
perusing conduct. The evaluations in Amazon.com are indicated on a 5-point scale, with most
minimal rating being 1-star, and the most elevated rating being 5-star. The user particular
purchasing and perusing information can be effortlessly gathered when users are signed in
with a record validation system upheld by Amazon. Suggestions are likewise given to users
on the basic Web page of the webpage, at whatever point they sign into their records. As a
rule, clarifications for suggestions are given. For instance, the relationship of a prescribed
thing to already obtained things might be incorporated into the recommender system
interface. The buy or perusing conduct of a user can be seen as a sort of understood rating,
instead of an express appraising, which is indicated by the user. Numerous business systems
allow the adaptability of giving suggestions both on the premise of express and understood
input. Truth be told, a few models have been intended to together record for unequivocal and
certain input in the suggestion procedure. A portion of the calculations utilized by early forms
Social networks regularly recommend potential friends to users with a specific end goal to
build the number of social associations at the site. Facebook is one such case of a long range
interpersonal communication Web webpage. This sort of proposal has somewhat unexpected
builds the benefit of the dealer by encouraging item deals, an expansion in the quantity of
social connections enhances the experience of a user at an informal organization. This, thus,
energizes the development of the informal community. Informal organizations are intensely
reliant on the development of the system to build their promoting incomes. In this manner,
the suggestion of potential friends (or connections) empowers better development and
availability of the system. This issue is also known as link prediction in the arena of informal
History
Recommender system is very popular and powerful tool of digital world in the present era.
As recommender system helps one to find out the stuff they are looking for, from the options
of millions, because there is a lot of choices and alternatives are provided to single user. In
present we can easily find a solution of it on amazon.com or by going to Netflix but in the
past it was not that much easy. It is a fact that human beings follow each other most of the
time, if one is a going in a specific direction soon you may find that it creates a line of other
followers behind, similarly if a user choose any one product and then gradually an increase in
the demand of product can observed. On the basis of similar concept recommender system
works. Information retrieval evolved in response to the need to be able to ask questions about
a large collection of documents. Now much of the computing here was actually done because
of large lawsuits that were being handled in the computer industry companies like IBM, but
the same technology applies to libraries and their card catalogues or even to companies that
are building indexes of the world wide web. The principles are the same. You have a static
content base, or mostly static. We don't publish new books that often, compared to how often
we read them. Or we don't publish new webpages as often as people navigate to them. But we
have a dynamic information need. That information need is what we sometimes call a query
and a femoral interest that we want an answer to. Because of this balance, we spend our time
The introduction of GroupLens done by Paul Resnick, John Riddle and their understudies
was that users who are perusing news articles through GroupLens would rate the articles as
they read them. They simply put in a speedy number for a one through five and users would
be coordinated to each other to discover other individuals who had comparative tastes. When
you went to the newsgroup to choose what articles to peruse, you would get a customized
forecast of which articles you might want or aversion and how much utilizing a closest
neighbour approach where consolidated together the evaluations of other individuals like you.
This is the thing that that resembled. In the mid and late 90s, organizations were jumping up
left and right. GroupLens wasn't an advance months a short time later, there was a framework
called Ringo and Homer from the MIT Media Lab that turned out to be Firefly Networks.
The GroupLens framework turned into the organization Net Perceptions, Firefly progressed
toward becoming Agents Inc. Work was being done left and right, and people went out, and
got these things into business hone. Amazon was only one case of the actually handfuls and
recommender innovation beginning in the mid-90s and advancing to today. At the point when
the user demands proposals, the framework utilizes its connections to discover an area of
users who concur with this user on what films are great. It at that point utilizes those
neighbors' sentiments to give suggestions to this user. The thought being that in the event that
you concede to a great deal of the things you've seen officially, at that point the things seen
by this individual who concurs with you may be a decent suggestion among the things that
you haven't seen. So on the off chance that it sounds good to you to see this in a table instead
of only a major blob of users, how about we take a gander at this arrangement of evaluations
and say, I'm endeavoring to choose whether I need to watch Blimp or Rocky XV today
around evening time. So we will search for users that concur with me in our past appraisals
and we see that Ben, and I concur generally well on two or three films that we've seen, and
Nathan, and I likewise concur swapping out one of the motion pictures. What's more, we'll
discover that Joe and I differ unequivocally, however we likewise differ decently reliably.
Perhaps I may like something that he doesn't care for. So at that point Pat and I concur,
however Pat hasn't seen either Rocky XV or Blimp. So despite the fact that we concur, his
suppositions aren't that helpful for making sense of what I should watch next. So now, we
Recommendation systems assume a critical role in the present web. Here is a parcel of the
The recommender system depends on real user conduct, i.e. target reality. This is the
preferences.
The recommendation system helps the users to identify the data connected to their
interest or preferences. It benefits him in the context of the time saving and increasing
efficiency of study.
clients can be utilized to make clusters of similar clients who experience a similar
choice of items. The interests and activities of these clusters can be utilized to make
suggestions to the singular individuals from the cluster. If the new user outside of that
cluster came and order an item from that cluster the system will start suggesting him
User preferences never remain same and they may change it time to time according to
need but recommender system can only recommend the items on the basis of past
preferences.
Another problem in recommender system is change in data. Because trend are not
constant they always change, so an algorithm may not get the clear data for
recommendation.
A lot of variables are required to make a single recommendation, because it need
recommendation.
Sometimes user’s interest are dynamic and past preferences are totally opposite from
METHODOLOGY
Recommender system
The trend of e-commerce is at peak in the modern era. Use of Recommendation system is
systems become very important aspect of e-commerce systems. The main purpose of
different items available. Although the recommendation system is used at very large scale but
there are some shortcomings and problems which are faced by recommender system.
The first important point or approach in the recommendation system is the prediction of
rating value for the compound of user item. In this case the assumption is to predict data
through the preferences of the user for specific items. An "𝑀 × 𝑁" matrix is created to record
the m user and n items where recorded values are used for training model. The problem has
occurred in the system due to lack of accuracy as the system is based on an assumption. This
problem is often known as the matrix completion problem because the matrix of values is
recorded incomplete and all other values are forecasted by learning algorithms. The other
necessary that predicted items must be based on user’s past preferences because a user may
not like any item again, which he liked in the past may be he is looking for a new item. As
recommendations cannot be always made on the basis of ratings of user specific preferences.
So merchant should present something new to user which may attract him more than past.
May be a merchant also want to promote a newly introduce the product. The merchant also
wants to promote a newly introduce the product. The merchant should recommend the top-k
items for the promotion to selected top-k user. But the problem is due to selection of top-k
items which should be included in the top-k items. In this case the accuracy of the rating it is
not necessary. If the problem of the first method is reduced the second problem is
automatically resolved because the solutions of second are automatically resolved from the
solutions of the first problem, Sometime this problem can be solved without the help of the
first problem.
To make effective recommendations, the recommender system requires a lot of data. I may
be the biggest problem in the Recommender system to collect and manage this huge data.
Because all the big companies or e-commerce systems using a recommender system has
millions of users and data accordingly. Those recommender systems which have outstanding
recommendations has a huge data of users like amazon.com, Facebook, Netflix, Google, etc.
To make algorithm work the recommender system has to create "𝑀 × 𝑁" matrix which can
only be created with large data. When many items and user are included in the matrix a
strong recommendation is formed. So to get good recommendation you have to manage huge
data.
Another problem is previous data. The recommender system may show same item again and
again, and ignore the new ones. It may create biasness toward the previous preferences. It is
tomorrow. He may look for any other product he needs, but the recommender system is
recommending the previous items. For instance a user searched for T-shirts on amazon.com
today, but on the next day he may look for any good literature book. But the Recommender
system is not able to recommend the books automatically. That is why there is very few e-
The recommender systems are a discovery assistant that helps theirs users in distinguishing
the details they may like. Generally the recommender systems can be separated into two
types,
In this study, our point of focus is the content based filtering models. Before we go on, it’s
The items whose traits can be applied every bit a component of the recommender
Attributes or the traits of the items are the description of the particular item. For
Now let’s explain what actually the content-based recommender systems are?
Suppose someone has approached you for a book recommendation, it's truly normal to solicit
what sorts of books they like. From that point, you could think about a couple of titles that are
like the things they've enjoyed before. This procedure, of suggesting content in light of its
attributes, is the fundamental element of content based filtering, the innovation behind Netflix
The content-based recommender system requires the related data about various accessible
things as the content alongside the profile of the client which must follow the client
preferences. Or in the content based recommender system, the past purchasing designs and
the senior ratings of the users along with the content of the item are collectively utilized
keeping in mind the end goal to arrive at predictions. This technique is recommended the
ideas to the users by making a comparison between the user preferences and the content
The basic thought behind the content-based recommender system is that the user interests can
be determined on the basis of features or properties of the items they have graded or used
previously. In the content based recommender system the explanations related to the target
item plays an essential role in order to make predictions. These explanations are termed as
Content. Content based systems, suggests items in view of a close examination between the
description of the items and a client's profile. The component of items is mapped with
highlight of clients keeping in mind the end goal to get client – item similarity. The best
documents to the users such as articles, papers, web logs, web pages, publications and so on,
the content based recommender systems are regarded as the most successful filtering
technique.
A content based recommender systems work with the information that the client gives, either
explicitly, i.e. in the form of ratings or implicitly i.e. simply by snapping on a link. In the
illumination of that information, a customer profile is created by the system, which is then
utilized to prepare recommendations to the client. As the client gives more sources of
information or takes activities on the suggestions, the system turns out to be increasingly
accurate.
The fundamental thought behind the content based recommender system is to provide the
recommendations to the users related to their choice. In CUST, user who needs articles to
study can visit their JBRC. Let’s assume that a user reads an article from JBR and then he
wants to review the more articles related to the article which he read before. Now there is a
need for a system through which the user can easily access the similar articles, so in order to
achieve that we can use the content based recommender system. These systems will use the
past reading patterns of the users to make suggestions of similar articles to the users. By
using the mathematical approach the accuracy of recommendations can be achieved. The
Jaccard Similarity
Euclidean Distance
Cosine Similarity
Jaccard Similarity
The ratio between the union and the intersection of two objects is known as Jaccard
similarity.
sets contains the only similar objects from the two sets.
Euclidean Distance
Euclidean distance is the difference between the two points in a plane. In mathematics, the
sum of the square over the root of all the given vectors is called the Euclidean distance.
Figure 7 Euclidean Distance
Cosine Similarity
The closeness among the two non-zero vectors of the product or item is known as cosine
similarity which measures the cosine of the angles among the vectors. The value will be 1 for
the Cosine of 0° while the for all the other angles rather than 0° the value will be less than 1.
Cosine similarity is used to find the normalized dot product of the two vectors. The cosine
similarity determined the cosine between the angles. If the two vectors are parallel it means
the cosine similarity among these vectors is 1 while the vectors which makes an angle of 90°
𝑉(𝑑1 ). (𝑉𝑑2 )
𝑠𝑖𝑚(𝑑1 , 𝑑2 ) =
‖𝑉(𝑑1 )‖‖𝑉(𝑑2 )‖
Where, V denotes the vectors while the 𝑑1 𝑎𝑛𝑑 𝑡ℎ𝑒 𝑑2 represents the documents.
All the above mentioned mathematical approaches can be used to measure the similarity
among the documents in order to make recommendations for the users. As we have a very
short time to implement our idea to introduce a recommender system for JBRC of CUST
that’s why we will focus only the cosine similarity at this stage. Cosine similarity is an
VSM is an algebraic model which is also termed as Term vector model. This model is used
for the representation of documents in the form of vectors. These vectors can be added to one
another and can also be multiplied by the scalar numbers. In this model the whole document
will be taken as the bag full of words and the main goal is to discover the more similar
documents. The documents are considered as vectors instead of texts. In a vector space model
each term of the document will have its own axis. We have to imagine all the documents as
vector in order to get an angle between them so that we can measure the difference by
calculating cosine of the angle which is used to measure the distance between the documents.
This distance will indicate the similarity between the document. Distance between he two
documents id either zeroes or positive, it can never negative. If the two documents makes an
angle of zero degree between them its means they are perfectly similar to each other.
Here Vetor A is equal to {"H" , "E" , "L" , "L" , "O"} and the vector B is also equal to
{"H" , "E" , "L" , "L" , "O"}. These two vectors are parallel and the anglr between them is zero
and cos 0° = 1. While the angle between Vetor A {"H" , "E" , "L" , "L" , "O"} and the vector B
i.e. {"X" , "Y"} will be approximately 90° and cos 90° = 0. Which means that the documents
are totally different.
What is a vector? Well, a vector is a quantity which has some direction and along with the
direction it has some magnitude as well.Weight, force, velocity and momentum are some
common examples of vector as they have the both the properties of vector i.e. magnitude and
the direction. Vector representation can be done in two and three both dimensions. A vector
is a simple single line, the length of that line is known as its magnitude while the orientation
of that line is known as its direction. The vector line has an arrowhead at one end of the line
Figure 12 Vector
Figure 11 Vector representation in space
Afetr making a clear understanding of vector now it’s the time to place the set of documents
in a space considering these documents as the vector. Vectors can be represented in more
than 100 dimentsional plane as each term of the document will have a separate dimension.
Figure 13 Documents representation as vector
In the diagram, 𝑉(𝑑1 ) and 𝑉(𝑑2 ) are the vectors obtained from the documents d1 and d2
while the V(Q) is the query vector. In space there are terms related to the documents but to
make it simple the only two terms are considered on each axis. The angle θ among the
document 1 and the query vector will determine the similarity of the document with the
query. Cosine of θ will be used to calculate the value of the angle between the document and
the query.
In information retrieval system, the term frequency and the inverse document frequency are
considered the important concepts. The term frequency denoted as Tf, is the frequency of the
specific terms in a particular document or simply it tells that how many times the particular
term appears in a document. While the inverse of the document is denoted by the IDF, which
considers the terms with the lowest frequency. Suppose a user make a search on Google for
“the rise of technology”, it is surely the term “the” will must have the high frequency than
the term “technology” but the importance of the term technology cannot be denied as well
from the query point of view. In such kind of situations, the tf-idf discredits the impact of
highly repeated words in a document in order to decide the significance of the document.
𝑁
𝑖𝑑𝑓 = log
𝑑𝑓
Where
N = No. of documents
So, if the specific term appears many times in a document, the will reduce its weight
automatically and similarly if a term appears few times in documents than its obvious for the
term to have a higher weight. In order to dampen the impact of high frequency terms, the idf
uses the log.
In the process of measuring the document similarity the most popular weight which is
considered is the dot product or combination of tf-idf.
𝑡𝑓 − 𝑖𝑑𝑓 = 𝑡𝑓 × 𝑖𝑑𝑓
So if the term t appears many times in a small number of document than it will have a high
weight and similarly if a term t appears less time in a maximum number of documents than it
means that the term have less weight. These tf-idf weights are used for the similarity
measure.
To measure the document similarity in a vector space model, we will use the cosine similarity
with tf-idf. The closeness among the two non-zero vectors of the product or item is known as
cosine similarity which measures the cosine of the angles among the vectors. The value will
be 1 for the Cosine of 0° while the for all the other angles rather than 0° the value will be less
than 1. Cosine similarity is used to find the normalized dot product of the two vectors. The
cosine similarity determined the cosine between the angles. If the two vectors are parallel it
means the cosine similarity between these vectors is 1 while the vectors which makes an
𝑉(𝑑1 ). (𝑉𝑑2 )
𝑠𝑖𝑚(𝑑1 , 𝑑2 ) =
‖𝑉(𝑑1 )‖‖𝑉(𝑑2 )‖
Here in the numerator there is a dot product of the two vector documents while the
denominator is the dot product of the Euclidean distance of the given vector documents.
Suppose there is a vector document denoted as 𝑑1 which has weights of, for example
𝑑1 . 𝑑2 = 𝑊1 × 𝑋1 +, 𝑊2 × 𝑋2
It is the dot product of the two vector documents. There is no third weight of vector document
𝑑2 therefore, 𝑊3 × ∅ = 0
𝑑1 . 𝑑2 = 𝑊1 × 𝑋1 +, 𝑊2 × 𝑋2+ 0
While in the denominator there is a product of Euclidean distance of the two vector
documents i.e.
𝐸𝑢𝑐𝑙𝑖𝑑𝑒𝑎𝑛 𝐷𝑖𝑠𝑡𝑎𝑛𝑐𝑒 𝑜𝑓 𝑑1 = √ 𝑊1 2 + 𝑊2 2 + 𝑊3 2
&
Let’s work with the query which has the word gossip and the word jealous in it, so there are
two term vocabulary gossip and jealous. And we have a query that has both these terms, we
have a document d that has gossip in it but it does not have word jealous. If you plot the
vector for this document it is going to close y axis because the gossip access it has only the
word gossip in it approximately. And you have a document d3 which has a word jealous in it,
but not gossip so d1 has a gossip, but not a jealous and d3 has a jealous, but not gossip and
then we have d2 which has gossip and jealous both but both appearing multiple times. So if
we do not convert these vectors into a unit vector, then the vectors are going to look
something like this d2 is going to the extent far out into the d2 space.
Euclidian distance between the query and the second document its more than the distance
between the query and d3 and query and d1.even though both d1 and d3 have just one of the
two query terms. Where is d2 has a both the q terms still the distance between q and d2 is
longer, and that’s because we have not normalized these vector normalization is important
once we normalized then d2 it will become equal to the query vector. Now when it gets close
Actually, there is a two way to think about this one is, first do the normalization and compute
the dot product and another way to think about this is not do the normalization but then use
the formula of cosine theta. The cosine of theta between the query vector and the document
vector is the dot product of the vector document and the query vector over the mod of vector
⃗ . ⃗⃗⃗⃗
𝑄 𝐷2
cos 𝜃 =
|𝐴||𝐵 ⃗|
AND IT’S A same thing of the cosine of the angle Q between Q vector and D vector is the
dot product in the Q and the document d divided by the magnitude of the Q vector and the
magnitude of the d vector (formula) so we can take this magnitude and combine this vector
and that will become unit vector. In the direction of Q likewise this vector d divided by the
magnitude of d has been just the unit vector in the direction of d i.e.
𝑄⃗
𝑄̂ =
⃗⃗⃗⃗⃗
|𝑄|
&
⃗⃗⃗⃗
𝐷2
̂=
𝐷
⃗⃗⃗⃗⃗⃗⃗
|𝐷2 |
We can either do the normalization or we can first take the dot product and then divide the
dot product by the product of magnitude. It does not matter which way we do it, but in either
case we will be measuring the angel between the two vectors almost specifically the cosine of
the angle and if we do that then we can see that 𝐷2 will appear somewhere near o the query
vector. So the angle between Q and 𝐷2 or even we are directly measuring the angle between
Q and 𝐷2 or even if we are directly measuring the angle without normalization, the angle
between Q and 𝐷2 is very small thus the angle between Q and 𝐷1 and the angle between Q
and 𝐷3 is large that means Q and 𝐷2 are closer to one another, then Q is closer to 𝐷1 or 𝐷3 .
The Cosine of the angle between the two vectors is already expressing the unit vectors we
just take the dot product. So if Q and D have already been length normalized then the cosine
of the angle between them is just a dot product and the dot product is nothing but the product
of their component added together. Take the product component by component and add it to
Q, D length normalized.
|𝑉|
⃗ .𝐷
cos(𝑄 ⃗ .𝐷
⃗ ) =𝑄 ⃗ =∑ 𝑄𝑖 𝐷𝑖
𝑖=1
And if these are not normalized than we take that actual dot product and divided by the
⃗ .𝐷
𝑄 ⃗ ⃗ 𝐷
𝑄 ⃗ ∑|𝑉|
𝑖=1 𝑄𝑖 𝐷𝑖
⃗ ⃗
cos 𝑄 . 𝐷 = = . =
⃗ ||𝐷
|𝑄 ⃗ | |𝐷
⃗ | |𝑄 ⃗|
√∑|𝑉|
𝑖=1 𝑄𝑖
2
√∑|𝑉|
𝑖=1 𝐷𝑖
2
⃗
𝑄 ⃗
𝐷
.
⃗ | |𝐷
⃗|
= Unit vectors
|𝑄
𝑄⃗ .𝐷
⃗
⃗ ||𝐷⃗|
= dot product
|𝑄
These vectors 𝑄𝑖 and 𝐷𝑖 are the vectors of tf-idf weight. The ith components of vectors Q viz
𝑄𝑖 is the tf-idf weight of the ith term in the vocabulary and similarly the ith components of
vectors D viz 𝐷𝑖 is the tf-idf weight of the ith term in the vocabulary which are being
⃗ .𝐷
multiplied with one another. I the above mentioned equation, cos 𝑄 ⃗ is the cosine similarity
of vector Q and vector D or eventually the cosine of the angle between vector Q and D.
If we convert the vectors into a unit vector than we will have a unit circle and then all the
vector align along this unit circle or in general along this v dimensional surface. All these
vactors will have the end point along with the unit circle.
Figure 15 Vectors represented in a v dimensional surface
Now lets take an example, suppose we have these three classical novel which are sense and
sensibility, pride and prejudice and wuthering heights. Our focus or attention will be on the
four random terms in the novels which includes affection, jealous, gossip, wuthering. In this
example we will not do the idf weighting. The term frequency count of the above mentioned
affection 115 58 20
jealous 10 7 11
gossip 2 0 6
wuthering 0 0 38
So these are three document and because there are four terms so we represent these four
documents in four dimension space and we will get three point for three document.
Now lets compute the tf score and idf score and than multiply the two.
Terms Log weighing of Log weighing of Log weighing of
Sas PaP WH
= 3.06
Now if we want to compute that how close these points are one another in v dimensional
space, we have to do length normalization also. Ofcourse we can directly compute angle
between the three letters but we just assume that first do length normalization before taking
cosine score.
Terms Sas PaP WH
= 0.335 = 0.405
wutherin 0 0 2.58
√(2.32 + 2.042 + 1.782 + 2.582 )
g
= 0.588
We can actually verify unit length of the vectors. The sum of the square’s of magnitude over
So we have three vectors (SaS, PaP, WH) in four dimensional space. First, we will compute
the cosine score between two novels sense and sensibility and pride and prejudice. In this
case, the cosine score will be just given by the dot product of the given vectors because they
have already been normalized. So its just a simple dot product of SaS an PaP.
If we compute the cosine score between sense and sensibility and Wuthering Heights, the
cosine score will be 0.79, and if we compute the score between the pride and prejudice and
Wuthering heights we will get a score of 0.69. The cosine scores between sense and
sensibility and pride and prejudice to be quite high around 0.94, its higher than the other two
cosine scores. Now, mathematically it is proved that the SaS and the PaP are closer to each
their because the cosine of the angle between them is greater an the other one i.e.
𝑆𝑎𝑆, 𝑃𝑎𝑃 > 𝑆𝑎𝑆, 𝑊𝐻. In other word he reason behind their similarity might be that the
The above discussed example uses the concept of normalization of vectors. Now we will
consider another example in which the similarity among the documents will be measured.
And will also have a query vector. We have to remove he stop words from the query vector
According to the above mentioned details, we have the total number f documents (n=3). Now
𝑛
𝑖𝑑𝑓 = log10
𝑑𝑓
tf-idf = 𝑡𝑓 × 𝑖𝑑𝑓
Terms Term Frequency Weights= tf*idf
Q D1 D2 D3 df n/df IDF Q D1 D2 D3
a 0 0 1 0 1 3 0.4771 0 0 0.4771 0
After calculating the tf-idf weights, we have to calculate the vector length which is actually
the Euclidean distance of the vectors.
|𝐷| = √∑ (𝑤𝑖,𝑗 2 )
𝑖
After calculating the Euclidean distance now we will compute the dot product of each
document vector with the query vector.
𝑄. 𝐷𝑖 = (𝑤𝑄,𝑗 × 𝑤𝑖,𝑗 )
The document 1 has the two terms similar to the query, i.e. people and accident so we will
add the dot products of the query vector and the document vector of term i.
The document 2 also has the two terms similar to the query, i.e. people and bus so we will
add the dot products of the query vector and the document vector of term i.
𝑄. 𝐷2 = √(0.1760 × 0.1760) + (0.1760 × 0.1760) = 0.2489
The document 3 also has the two terms similar to the query, i.e. accident and bus so we will
add the dot products of the query vector and the document vector of term i.
After computing the dot product, the final step to measure the similarity index between the
documents is to fine the cosine of the angle between the document vector and the query
vector.
Document 1
𝑄. 𝐷1
cos 𝜃(𝐷1 ) =
|𝑄| × |𝐷1 |
0.2489
cos 𝜃(𝐷1 ) =
0.3048 × 0.5930
Document 2
𝑄. 𝐷2
cos 𝜃(𝐷2 ) =
|𝑄| × |𝐷2 |
0.2489
cos 𝜃(𝐷2 ) =
0.3048 × 0.8809
𝑄. 𝐷3
cos 𝜃(𝐷3 ) =
|𝑄| × |𝐷3 |
0.2489
cos 𝜃(𝐷3 ) =
0.3048 × 0.7404
So we see that the document 𝐷1 has the greatest value of cosine 𝜃, so it is more similar to the
query document.
CHAPTER 04
Content based recommender system is more efficient as it uses only the content of
each item for making recommendations, so the problem of huge data does not occur
The biggest advantage of the content based recommender system is that its
The Recommendations of the content based system are made through interest of a
Content based recommender system can give the logic of their recommendations. This
The biggest drawback of the content based recommender system is the huge size of
data set for items. As content based use the sets of items that relates more to user
consider each and every term of the set in the content based system.
The results of any recommender system are not 100% accurate, so it is not easy to
The content based system is complex because there is no data which exactly define
the user’s interest. Choices of different users are unclear and vary time to time,
expensive, as it requires the highly professional experts to manage the system which i
This system increases the labour expenses that are not affordable for many businesses.
Conclusion
In this project we have studied Recommender system, the background, different types,
pioneers of recommender system, world famous examples of recommender system and how
does it work. Our main focus was JBR, we analyze the problems and flaws that exist in the
JBRC of CUST. We are just dumping the papers in JBRC as there is no recommender system
in it. In order to remove these shortcomings of JBR in practical we have made a prototype
example by using the vector space model which involves the cosine similarity in it. Due to
shortage of time and some other limitations, we could not make it in running form. We
identified that the JBR is lacking semantic system. The user may not find the relevant item,
because the system is not able to recommend exactly what user is looking for, or anything
that is away from the mind of the user at that moment but recommender system can diminish
all these flaws of JBR. It will work on the basis of past information. The user will type the
query and the system will allow the user to quickly access the articles which are closer to his
query. It will be helpful for the users and will also enhance the efficiency at work.
These recommender systems can also be utilized in a broader sense. The database can be
created or the searched articles and the system will rank the articles in the database on the
basis of user priority, in such a situation Knowledge based recommender systems will be
used. These systems are considered as intelligent as they will be able to change the priority
To introduce such type of system, it requires a huge time to work on and in this project we
are lacking with time so initially we have focused a very small portion. At this stage we will
consider the only abstracts of papers and compare their similarity using the vector space