Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 11

Below are some codes in python to get linked to database and get recommendations

Started with Neo4J

To create your own graph with Neo4J you will need to use Java/Groovy to explore it. I foundBulbflow, it is a open-source Python ORM for graph databases and supports puggable backends using Blueprints standards. In this post I used it to connect to Neo4j Servers. The snippet code below is a simple example of Bulbflow in action by creating some edges and vertexes.

>>> >>> >>> >>> >>> >>> >>> >>>

from people import Person, Knows from bulbs.neo4jserver import Graph g = Graph() g.add_proxy("people", Person) g.add_proxy("knows", Knows) james = g.people.create(name="James") julie = g.people.create(name="Julie") g.knows.create(james, julie)

Generating our tutorials Graph

I decided to define my graph schema in order to map the raw data into a property graph so the traversals required to get recommendations of which tutorials to check could be natural as possible.

SnapGuide Graph Schema

The data will be inserted into the graph database Neo4J The code belows creates a new Neo4J graph with all the data set. #-*from from from coding: utf-8 -*bulbs.neo4jserver import Graph nltk.tag.hunpos import HunposTagger nltk.tokenize import word_tokenize

ht = HunposTagger('en_wsj.model') likes = open('likes.csv') tutorials = open('tutorials.csv') users = open('users.csv') g = Graph() def filter_nouns(words): return [word.lower() for word, cat in words if cat in ['NN', 'NNP', 'NNPS']] #Loading tutorials and categories for tutorial in tutorials: tutorial = tutorial.strip() try: ID, title, likes, category = tutorial.split(';') except: try: ID, title, category = tutorial.split(';') except: t = tutorial.split(';') ID, title, category = t[0], t[1].replace('&Yuml', ''), t[-1] tut = g.vertices.create(type='Tutorial', tutorialId=int(ID), title=title) keywords = filter_nouns(ht.tag(word_tokenize(tutorial.split(';')[1]))) keywords.append(category) for keyword in keywords: resp = g.vertices.index.lookup(category=keyword) if resp is None: ct = g.vertices.create(type='Category', category = keyword) else: ct = resp.next() g.edges.create(tut,'hasCategory', ct) #Loading user dataset. for user in users: user = user.strip() username = user.split(';')[0] user = g.vertices.create(type='User', userId=username) #Loading the likes dataset.

for like in likes: like = like.strip() item_id, user_id = like.split(';') p = g.vertices.index.lookup(tutorialId=int(item_id)) q = g.vertices.index.lookup(userId=user_id) g.edges.create(q.next(), 'liked', p.next()) There are three input files: tutorials.dat, users.dat and likes.dat. The file tutorials.dat contains the list of tutorials. Each row has 2 columns: tutorialId, title and category. The file users.dat contains the list of users. Each row contains the columns: userID, user name. Finally the likes.dat includes the tutorials that a user marked their interest. Each row of the raw file has : userId and movieId. Given that there are more than 1 million likes, it will take some time to process all the data. An important note before going on. Don't forget to create the vertices indexes, if you forget your queries it will take ages to proccess.

1. //These indexes are a must, otherwise querying the graph database will take so looong 2. g.createKeyIndex('userId',Vertex.class) 3. g.createKeyIndex('tutorialId',Vertex.class) 4. g.createKeyIndex('category',Vertex.class) 5. g.createKeyIndex('title',Vertex.class)

Before moving on to recommender algorithms, let's make sure the graph is ok. For instance, what is the distribution of keywords amongst the tutorials repository ?

1. //Distribution frequency of Categories


def dist_categories(){ m = [:] g.V.filter{it.getProperty('type')=='Tutorial'}.out('hasCategory').category.groupCount(m).iter ate()

return m.sort{-it.value} }

>>> script = g.scripts.get('dist_categories') >>> categories = g.gremlin.execute(script, params=None) >>> sorted(categories.content.items(), key=lambda keyword: keyword[1])[:10] [(u'food', 4537), (u'make', 3840), (u'arts-crafts', 1609), (u'cook', 1362), (u'desserts', 1247), (u'beauty', 1108), (u'technology', 943), (u'drinks', 587), (u'home', 508), (u'chicken', 452)] What about the average number of likes per tutorial ?

1. //Get
def return

the

average

number

of

likes

per

tutorial avg_likes(){

g.V.filter{it.getProperty('type')=='Tutorial'}.transform{it.in('liked').count()}.mean() }

>>> script = g.scripts.get('avg_likes') >>>likes = g.gremlin.command(script, params=None) >>>likes 111.089116326 Trasversing the Tutorials Graph Now that the data is represented as graph, let's make some queries. Behind the scene what we make are some traversals. In recommender systems there are two general typs of recommendation approaches: the collaborative filtering and content-based one. In collaborative, the liking behavior of users is correlated in order to recommend the favorites of one user to another, in this case let's find the similar user. I like the tutorials Amanda preferred, what other tutorials does Amanda like that I haven't seen ?

Otherwise, the content-base strategy is based on the features of a recommendable item. So the attributes are analyzed in order to find other similar items with analogous features.

I like food tutorials, what other food tutorials are there ?

Making Recommendations Let's begin with collaborative filtering. I will use some complex traversal queries at our graph. Let's start with the tutorial: "How to Make Sous Vide Chicken at Home". Yes, I love chicken! :)

Great dish by the way!

Which users liked Make Sous Vide Chicken at Home ? 1.


//Get def v } the = users a tutorial users_liked(tutorial){ g.V.filter{it.getPropery('title') == tutorial} return v.inE('liked').outV.userId[0..4] who liked

>>> tuts = g.vertices.index.lookup(title='Make Sous Vide Chicken at Home') >>> tut = tuts.next() >>> tut.title Make Sous Vide Chicken at Home >>> tut.tutorialId 11890 >>> tut.type Tutorial >>> script = g.scripts.get('n_users_liked')

>>> users_liked = g.gremlin.command(script, params={'tutorial': 'Make Sous Vide Chicken at Home'}) >>> users_liked 1000

This traversal doesn't provide us useful information, but we could put in action now the collaborative filtering with a extended query: Which users liked Make Sous Vide Chicken at Home and what other tutorials did they liked most in common to ?

1.

//Get the def v return }

users =

who

liked

tutorials did they like too ? similar_tutorials(tutorial){ g.V.filter{it.getProperty('title') == tutorial} v.inE('liked').outV.outE('liked').inV.title[0..4]

the

tutorial

and

what

other

>>>> script = g.scripts.get('similar_tutorials') >>>> similar_tutorials = g.gremlin.execute(script, params={'tutorial': 'Make Sous Vide Chicken at Home'}) >>> similar_tutorials.content [u'Make Potato Latkes', u'Make Beeswax and Honey Lip Balm', u'Make Sous Vide Chicken at Home', u'Cook the Perfect & Simple Chicken Ramen Soup', u'Make a Simple (But Authentic) Paella on Your BBQ']

What is the query above express ? It filters all users that liked the tutorial (inE('liked')) and find out what they liked (outV.outE('liked')), fetching the title of those tutorials (inV.title) . It returns the first five items ([0..4]) In recommendations we have to find the most-common purchased or liked itens. Using Gremlin, we can work on a simple collaborative filtering algorithm by joining several steps together.

1.

//Get def

similar =

m v = g.V.filter{it.getProperty('title') v.inE('liked').outV.outE('liked').inV.title.groupCount(m).iterate() return }

tutorials topMatches(tutorial){ [:] == tutorial} m.sort{-it.value}[0..9]

>>> script = g.scripts.get('topMatches') >>> topMatches = g.gremlin.execute(script, params={'tutorial': 'Make Sous Vide Chicken at Home'}) >>> sorted(topMatches.content.items(), key=lambda keyword: keyword[1])[:10] {u'Make Cake Pops!!': 75, u'Make Sous Vide Chicken at Home': 1000, u'Make Potato Latkes': 124, u'Make Incredible Beef Jerky at Home Easily!': 131, u'Cook the Perfect & Simple Chicken Ramen Soup': 96, u'Make Mint Juleps': 74, u"Solve a 3x3 Rubik's Cube": 89, u'Cook Lamb Shanks Moroccan Style': 74, u'Make Beeswax and Honey Lip Balm': 75, u'Make an Aerium': 74}

This traversal will return a list of tutorials. But you may notice if you get all matches, ther are many duplicates. It happens because who like How to Make sous Vide Chicken At Home also like many of the same other tutorials. The similarity between users in represented at collaborative filtering algorithms.

How many of How to Make sous Vide Chicken At Home highly correlated tutorials are unique ? 1.
//Get def v return } //Get def v return } the = number unique similar tutorials n_similar_unique_tutorials(tutorial){ g.V.filter{it.title == tutorial} v.inE('liked').outV.outE('liked').inV.dedup.count() of

the =

similar tutorials n_similar_tutorials(tutorial){ g.V.filter{it.getProperty('title') == tutorial} v.inE('liked').outV.outE('liked').inV.count()

number

of

>>> script = g.scripts.get('n_similar_tutorials') >>> similar_tutorials = g.gremlin.command(script, params={'tutorial': 'Make Sous Vide Chicken at Home'}) >>> similar_tutorials 37323 >>> script = g.scripts.get('n_similar_unique_tutorials') >>> similar_tutorials = g.gremlin.command(script, params={'tutorial': 'Make Sous Vide Chicken at Home'}) >>> similar_tutorials 8766

There are 37323 paths from Make Sous Vide Chicken at Home to other tutorials and only 8766 of those tutorials are unique. Using this information we can use these duplications to build a ranking mechanism to build recommendations.

Which tutorials are most highly co-rated with How to Make Soous Vide Chicken ?

>>> script = g.scripts.get('topMatches') >>> topMatches = g.gremlin.execute(script, params={'tutorial': 'Make Sous Vide Chicken at Home'}) >>> sorted(topMatches.content.items(), key=lambda keyword: keyword[1])[:10] [(u'Make Sous Vide Chicken at Home', 1000), (u'Make Incredible Beef Jerky at Home Easily!', 131), (u'Make Potato Latkes', 124), (u'Cook the Perfect & Simple Chicken Ramen Soup', 96), (u"Solve a 3x3 Rubik's Cube", 89), (u'Make Cake Pops!!', 75), (u'Make Beeswax and Honey Lip Balm', 75), (u'Make Mint Juleps', 74), (u'Cook Lamb Shanks Moroccan Style', 74), (u'Make an Aerium', 74)]

So we have the top similar tutorials. It means, people who like Make Sous Vide Chicken at Home, also like Make Sous Viden Chicken at Home, oops! Let's remove these reflexive paths, by filtering out the Sous Viden Chicken. 1.
//Get def v tutorials topUniqueMatches(tutorial){ m = [:] = g.V.filter{it.getProperty('title') == tutorial} possible_tutorials = v.inE('liked').outV.outE('liked').inV possible_tutorials.hasNot('title',tutorial).title.groupCount(m).iterate() return m.sort{-it.value}[0..9] similar

>>>> script = g.scripts.get('topUniqueMatches') >>>> topMatches = g.gremlin.execute(script, params={'tutorial': 'Make Sous Vide Chicken at Home'}) >>> topMatches.content [(u'Make Incredible Beef Jerky at Home Easily!', 131), (u'Make Potato Latkes', 124), (u'Cook the Perfect & Simple Chicken Ramen Soup', 96), (u"Solve a 3x3 Rubik's Cube", 89), (u'Make Cake Pops!!', 75), (u'Make Beeswax and Honey Lip Balm', 75), (u'Make Mint Juleps', 74), (u'Cook Lamb Shanks Moroccan Style', 74), (u'Make an Aerium', 74), (u'Make a Leather iPhone Flip Wallet', 73)]

The recommendation above starts from a particular tutorial (i.e. Make Sous Vide Chicken), not from a particular user. This collaborative filtering method is called item-based filtering. Given an tutorial that a user likes, who else like this tutorial, and from those what other tutorials do they like that are not already liked by the initial user.

And the recommendation for a particular user ? That comes the user-based filtering. Which tutorials that similar users liked are recommended given a specified user ?
def userRecommendations(user){ = [:] v = g.V.filter{it.getProperty('userId') == user} v.out('liked').aggregate(x).in('liked').dedup.out('liked').except(x).title.groupCount(m).iterat e() return m.sort{-it.value}[0..9] } m

1.

>>>> script = g.scripts.get('topRecommendations') >>>> recommendations = g.gremlin.execute(script, params={'user': 'emmarushin'}) >>> recommendations.content [(u'Create a Real Fisheye Picture With Your iPhone', 1156), (u'Make a DIY Galaxy Print Tshirt', 933), (u'Make a Macro Lens for Free!', 932), (u'Make Glass Marble Magnets With Any Image', 932), (u'Make DIY Nail Decals', 932), (u'Make a Five Strand Braid', 929), (u'Create a Pendant Lamp From Coffee Filters', 928), (u'Make Avocado Toast', 926), (u'Make Instagram Magnets for Less Than $10', 923), (u'Make a Recycled Magazine Tree (Christmas Tree)', 923)]

Emma

Rushin

will

really

like

art

and

crafts

suggestions!

:D

Ok, we have interesting recommendations, but if I desire to make another styles of chicken like Chicken Ramen Soup for my dinner, I probably do not want some tutorial of How to Solve a Rubik Cube 3x3. To adapt to this situation, it is possible to mix collaborative filtering and content-based recommendation into a traversal so it would recommend similar chicken and food tutorials based on similar keywords. Now let's play with content-based recommendation! Which tutorials are most highly correlated with Sous Vide Chicken that share the same category of food?
//Top def sharing all categories. topRecommendations(tutorial){ m = [:] x = [] as Set v = g.V.filter{it.getProperty('title') == tutorial} tuts =v.out('hasCategory').aggregate(x).back(2).inE('liked').outV.outE('liked').inV tuts.hasNot('title',tutorial).out('hasCategory').retain(x).back(2).title.groupCount(m).iterate() return m.sort{-it.value}[0..9] } recommendations mixing content + collaborative

1.

>>>> script = g.scripts.get('topRecommendations') >>>> recommendations = g.gremlin.execute(script, params={'tutorial': 'Make Sous Vide Chicken at Home'}) >>> topMatches.content

[(u'Make Incredible Beef Jerky at Home Easily!', 131), (u'Make Potato Latkes', 124), (u'Cook the Perfect & Simple Chicken Ramen Soup', 96), (u'Make Cake Pops!!', 75), (u'Make Beeswax and Honey Lip Balm', 75), (u'Make Mint Juleps', 74), (u'Cook Lamb Shanks Moroccan Style', 74), (u'Cook an Egg in a Basket', 72), (u'Make Banana Fritters', 72), (u'Prepare Chicken With Peppers and Gorgonzola Cheese', 71)]

This rank makes sense, but it still has a flaw. The tutorial like Make mint Juleps may not be interesting for me. How about only considering those tutorials that share the same keyword 'chicken' with Vide Chicken ?

Which tutorials are most highly co-rated with Vide Chicken that share the same keyword 'chicken' with Vide Chicken? 1.
//Top recommendations mixing content + collaborative sharing the chicken category. def topRecommendations(tutorial){ m = [:] v = g.V.filter{it.getProperty('title') == tutorial}

v.inE('liked').outV.outE('liked').inV.hasNot('title',tutorial).out('hasCategory'). has('category' ,'chicken').back(2).title.groupCount(m).iterate() return m.sort{-it.value}[0..9] }

>>>> script = g.scripts.get('topRecommendations') >>>> recommendations = g.gremlin.execute(script, params={'tutorial': 'Make Sous Vide Chicken at Home'}) >>> topMatches.content {u'Make a Whole Chicken With Veggies in the Crockpot': 28, u'Bake Crispy Chicken With Doritos': 30, u'Cook Chicken Rollatini With Zucchini & Mozzarella': 28, u'Make Beer Can Chicken': 23, u'Roast a Chicken': 54, u'Cook the Perfect & Simple Chicken Ramen Soup': 96, u'Pesto Chicken RollUps Recipe': 31, u'Cook Chicken in Roasting Bag': 23, u'Make Chicken Enchiladas': 29, u'Prepare Chicken With Peppers and Gorgonzola Cheese': 71}

Link:::::: http://aimotion.blogspot.in/2013/03/graph-based-recommendations-using-how.html

You might also like