GSoC 2017 Proposal - Rajat Arora

GSoC 2017 Proposal - Rajat Arora
Name and Contact Information :

Name : Rajat Arora
Email : rajatarora216@gmail.com
Phone No: +91-9582483969
Github Handle : rajat1994
Twitter Handle : @shrodinger1994
Blog : https://medium.com/@rajatrarora
Timezone : IST (UTC +5:30)
Title: Spam filtering and content classification
Synopsis:
Zenodo believes in the open science culture which matches my ideologies . I really like
how it enables researchers to share and preserve any research outputs in any size, any
format and from any science. Zenodo has many cool projects, but the one i like the
most is spam filtering and content classification. This project uses machine learning
and data analysis at its core, which is in very high demand these days. It really helps to
check whether the content is spam/non-spam and in which domain. Above all, it can be
built as a foundation for the paper recommendation system.
Benefits to Community:
Problem Description
Since Zenodo allows researchers from around the world to publish their papers, data
and softwares, so its a high chance that it will face the problem of spam content. Also,
there is no module which can classify content. If this problem is not solved, then it will
make the users to question the authenticity of the platform.
How Does it make zenodo and open science better?
Detection of spam content helps in the authentication of research data. It also increases
the number of active users. As my project aims to implement different classifiers, so
finding the exact document becomes easier and consume less amount of time. This will
definitely make zenodo a great platform to access quality research content and make
open science better.
Deliverables :
Basic Idea
As zenodos metadata contains features in the form of text. So, NLP (Natural Language
Processing) is a great way to tackle this problem. I will feed this features to train
different classifiers. The result of such classifier would be a small piece of metadata
information (JSON), which can then be plugged into the search engine.
Plan of work and Proposed solution
The Project consists of various stages:
Selection of useful features and Data cleaning

As metadata contains various features, so to train the classifier efficiently, we need to
extract useful features. Features like keywords, description, title, subjects, etc are
some of the key features to train a classifier.
Since lots of features are null, consist of duplicate values, HTML markup, punctuations,
stopwords , etc. So its essential to clean the data. The quality of classifier depends on
the quality of the data provided. I have started the data cleaning process and
successfully removed null and duplicate values.
To remove punctuation and numbers, I will use a package for dealing with regular
expressions, called re. To remove HTML markup, a Beautiful Soup library will be
used. It is quite a powerful python library for data cleaning purposes.
Finally to remove stopwords (words that dont carry much meaning), i will use pythons
Natural Language Toolkit (NLTK). NLTK will be used heavily in the entire project.
Divide the dataset into training and testing data
This is quite the simple step but the essential one. After data cleaning, we have to
divide the dataset into training data and testing data. The training data will be used to
train a classifier. The testing data will be used to determine the accuracy of the
classifier. The division can performed by train_test_split helper function under the
scikit-learn library or manually.
Formation of a Natural Language Processing model

This is the most crucial step in the project. I will be implementing two models i.e.
bag-of-words for spam/non-spam classifier and word2vec for domain classifier.
Bag-of-words model
The Bag of Words model learns a vocabulary from all of the documents, then models
each document by counting the number of times each word appears. For example,
consider the following two sentences:
Sentence 1: "The cat sat on the hat"

Sentence 2: "The dog ate the cat and the hat"
From these two sentences, our vocabulary is as follows:
{ the, cat, sat, on, hat, dog, ate, and }
To get our bags of words, we count the number of times each word occurs in each
sentence. In Sentence 1, "the" appears twice, and "cat", "sat", "on", and "hat" each
appear once, so the feature vector for Sentence 1 is:
{ the, cat, sat, on, hat, dog, ate, and }
Sentence 1: { 2, 1, 1, 1, 1, 0, 0, 0 }
Similarly, the features for Sentence 2 are: { 3, 1, 0, 0, 1, 1, 1, 1}
Now to classify zenodos metadata whether it is spam or not, lets take keywords as the
attribute, so ill be making a vocabulary of keywords of all the records and then apply a
classifier on it. The small sample of keywords vocabulary would look like this:
{taxonomy,arhropoda,Diptera, neoephydra, Insecta, neotetraonchidae,platyhelminthes,

universe, tanaoa, galilia, lachesillidae, graphocaecilius}
Word2Vec model
Word2Vec is a neural network implementation that learns distributed representations
for words. Word2Vec does not need labels in order to create meaningful
representations. This is useful, since most data in the real world is unlabeled. Words
with similar meanings appear in clusters, and clusters are spaced such that some word
relationships.
Since there is no labelled data in zenodos metadata for domain classification, so i will
be implementing a word2vec model on title and description attributes to find the
scientific domain.
Word2Vec expects single sentences, each one as a list of words. In other words, the
input format is a list of lists.
Training of models
After forming the models, the next step is to train them. This is the time consuming step
because it requires permutations and combinations of various classifiers. A general
classifier looks like as shown in the figure :
Figure 1
To train bag of words model, i will be using various pre implemented classifiers like
random forests, decision trees, naive bayes, etc under the scikit-library. The model
which will give the highest accuracy will be selected.
To train Word2Vec model, K-means clustering will be used. Word2Vec creates clusters
of semantically related words, so possible approach is to exploit the similarity of words
within a cluster, which we can do by using a clustering algorithm such as K-Means. In
K-Means, the one parameter we need to set is "K," or the number of clusters. I will use
scikit-learn to implement K-Means clustering.
My main aim of this task is to use TensorFlow and compare its results with the
scikit-learn. TensorFlow is an open source software library for numerical computation
using data flow graphs, developed by google. TensorFlow contains tf.learn API which
has various pre implemented classifiers like DNN( Deep neural nets), K-means, etc.
Implementation of a generic invenio record classifier module

After training the model, i will integrate it into the invenio module. I have already started
exploring CookieCutter - Invenio module template to make invenio module. This
module will be generic which means it is extendible with different classifiers through
entry points.
Since we have trained the model with the given metadata, so the state of the model will
remain intact as it will be stored on zenodos database and can be loaded later to make
predictions. It will be trained further as the new metadata is provided. This will improve
the accuracy of the model. The input to the invenio module is JSON data with a simple
programmatic API.
Classifiers can either be run as part of the main Flask application (with Celery tasks to
off-load the main application from the heavy computation), or a stand-alone service with
a REST API. Finally, the documentation will be infrastructured properly for running the
tool.
Timeline:
I have only two final exams and it will be over in the first week of May. So I will have no
problem in contributing full time to gsoc.
Community Bonding Period

May 4 - May 29
Familiarize with the codebase of zenodo and invenio more deeply.

Select useful features and model the JSON data accordingly.
Start working on the Data cleaning process.
Complete the data cleaning process
Phase 1
May 30 - June 14
Start implementing bag-of-words model for spam/non spam

Train various classifiers on it using scikit-learn.
June 15 - June 25
Getting started with the word2vec model for domain classification.

Train various classifiers on it using scikit-learn.
June 26 - June 30
Start learning about new features of TensorFlow like tf.learn API, Tensorboard
visualisations more thoroughly.
Submit work done till phase 1.
Phase 2
July 1 - July 14
Start training bag of words using pre implemented classifier in tf.learn API
Continue training it on different set of features.
July 15 - July 23
Word2Vec will be trained using pre implemented classifier in tf.learn API

Tensorboard visualisations will be used to visualise model.
Start working on cookiecutter invenio module template to make invenio module
July 24 - July 28
Continue working on invenio module.

Submit work done till phase 2.
Phase 3 (Final phase)

July 29 - Aug 14
Extend different classifiers through entry points.

Classifiers will be run as part of main flask application with celery tasks to off-load
the main application from the heavy computation.
Aug 15 - Aug 21
Test the module thoroughly.

Infrastructure the document properly for running the module.
Submit the final work done.
Related Work:
I have been contributing to the following PRs/Issues and also active on zenodos gitter
channel. Links of the PR/issues are:
#1011 - https://github.com/zenodo/zenodo/pull/1011 (work in progress)

#996 - https://github.com/zenodo/zenodo/issues/996
#1001 - https://github.com/zenodo/zenodo/issues/1001
TensorFlow and Scikit-learn projects
1) Breast cancer prediction : https://github.com/rajat1994/BreastCancer_TensorFlow

2) Diseases Prediction : https://github.com/rajat1994/Diseases_Prediction
3) TensorFlow Basics : https://github.com/rajat1994/Machine-Learning
4) Scikit-learn Basics : https://github.com/rajat1994/Machine_Learning_Problems
Biographical Information:
University : Guru Gobind Singh Indraprastha University, India

Education : B.Tech ECE (8th semester)
Coding skills : Python, JavaScript, AngularJs, PostgreSQL
Technical Skills : TensorFlow, Flask, Scikit-learn,Numpy, Matplotlib, Elasticsearch
Scholarship : Android Nanodegree scholarship from google and udacity
Previous Open source contribution :
Last year I contributed in fossasia. Links of the PRs :
1. https://github.com/fossasia/fossasia-communities/pull/103
Any Plans/ Commitment (During GSoC)
I have no other commitments for this summer and i am positive that I will be able
to contribute for about 60-65 hours a week for the project.
I will be maintaining a blog (weekly) to keep track of my progress and also to get
feedback from the community.
I am not sending proposals to any other organisation.
Future Plans (After GSoC)

I plan to actively maintain my code and do bug-fixing/reviewing in zenodo even
after my GSoC time period is over.
I will be there to further enhance the functionalities of this project.
Zenodo is pretty close to my interests, I am looking for a long term association
with the community.
References :
http://www.nltk.org/book
http://www.iosrjournals.org/iosr-jce/papers/Vol16-issue5/Version-4/S0165411611
9.pdf
https://www.tensorflow.org/api_docs/python/tf/contrib/learn/DNNClassifier
https://www.tensorflow.org/api_docs/python/tf/contrib/learn/KMeansClustering
https://www.tensorflow.org/tutorials/word2vec
https://medium.com/@ilblackdragon/tensorflow-text-classification-615198df9231
https://www.tensorflow.org/get_started/summaries_and_tensorboard
http://scikit-learn.org/stable/auto_examples/cluster/plot_cluster_iris.html
http://scikit-learn.org/stable/modules/naive_bayes.html
http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForest
Classifier.html
http://scikit-learn.org/stable/modules/tree.html
http://scikit-learn.org/stable/modules/feature_extraction.html
http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html
https://zenodo.readthedocs.io/en/latest/
https://github.com/zenodo/zenodo
https://cookiecutter-invenio-module.readthedocs.io/en/latest/
http://invenio.readthedocs.io/en/latest/modules/index.html
https://github.com/inveniosoftware/invenio
http://developers.zenodo.org/?python#rest-api
https://zenodo.org/record/375909#.WN1VQXV948o

GSoC 2017 Proposal - Rajat Arora

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

GSoC 2017 Proposal - Rajat Arora

Uploaded by

Copyright:

Available Formats

GSoC 2017 Proposal - Rajat Arora

Name and Contact Information :

Title: Spam filtering and content classification

Plan of work and Proposed solution

The Project consists of various stages:

Selection of useful features and Data cleaning

Formation of a Natural Language Processing model

Sentence 1: "The cat sat on the hat"

From these two sentences, our vocabulary is as follows:

{ the, cat, sat, on, hat, dog, ate, and }

{ the, cat, sat, on, hat, dog, ate, and }

Similarly, the features for Sentence 2 are: { 3, 1, 0, 0, 1, 1, 1, 1}

{taxonomy,arhropoda,Diptera, neoephydra, Insecta, neotetraonchidae,platyhelminthes,

Implementation of a generic invenio record classifier module

Community Bonding Period

Familiarize with the codebase of zenodo and invenio more deeply.

Start implementing bag-of-words model for spam/non spam

Getting started with the word2vec model for domain classification.

Word2Vec will be trained using pre implemented classifier in tf.learn API

Continue working on invenio module.

Phase 3 (Final phase)

Extend different classifiers through entry points.

Test the module thoroughly.

#1011 - https://github.com/zenodo/zenodo/pull/1011 (work in progress)

1) Breast cancer prediction : https://github.com/rajat1994/BreastCancer_TensorFlow

University : Guru Gobind Singh Indraprastha University, India

Any Plans/ Commitment (During GSoC)

Future Plans (After GSoC)

You might also like