Professional Documents
Culture Documents
GSoC 2017 Proposal - Rajat Arora
GSoC 2017 Proposal - Rajat Arora
Synopsis:
Zenodo believes in the open science culture which matches my ideologies . I really like
how it enables researchers to share and preserve any research outputs in any size, any
format and from any science. Zenodo has many cool projects, but the one i like the
most is spam filtering and content classification. This project uses machine learning
and data analysis at its core, which is in very high demand these days. It really helps to
check whether the content is spam/non-spam and in which domain. Above all, it can be
built as a foundation for the paper recommendation system.
Benefits to Community:
Problem Description
Since Zenodo allows researchers from around the world to publish their papers, data
and softwares, so its a high chance that it will face the problem of spam content. Also,
there is no module which can classify content. If this problem is not solved, then it will
make the users to question the authenticity of the platform.
How Does it make zenodo and open science better?
Detection of spam content helps in the authentication of research data. It also increases
the number of active users. As my project aims to implement different classifiers, so
finding the exact document becomes easier and consume less amount of time. This will
definitely make zenodo a great platform to access quality research content and make
open science better.
Deliverables :
Basic Idea
As zenodos metadata contains features in the form of text. So, NLP (Natural Language
Processing) is a great way to tackle this problem. I will feed this features to train
different classifiers. The result of such classifier would be a small piece of metadata
information (JSON), which can then be plugged into the search engine.
Since lots of features are null, consist of duplicate values, HTML markup, punctuations,
stopwords , etc. So its essential to clean the data. The quality of classifier depends on
the quality of the data provided. I have started the data cleaning process and
successfully removed null and duplicate values.
To remove punctuation and numbers, I will use a package for dealing with regular
expressions, called re. To remove HTML markup, a Beautiful Soup library will be
used. It is quite a powerful python library for data cleaning purposes.
Finally to remove stopwords (words that dont carry much meaning), i will use pythons
Natural Language Toolkit (NLTK). NLTK will be used heavily in the entire project.
Divide the dataset into training and testing data
This is quite the simple step but the essential one. After data cleaning, we have to
divide the dataset into training data and testing data. The training data will be used to
train a classifier. The testing data will be used to determine the accuracy of the
classifier. The division can performed by train_test_split helper function under the
scikit-learn library or manually.
Bag-of-words model
The Bag of Words model learns a vocabulary from all of the documents, then models
each document by counting the number of times each word appears. For example,
consider the following two sentences:
To get our bags of words, we count the number of times each word occurs in each
sentence. In Sentence 1, "the" appears twice, and "cat", "sat", "on", and "hat" each
appear once, so the feature vector for Sentence 1 is:
Sentence 1: { 2, 1, 1, 1, 1, 0, 0, 0 }
Now to classify zenodos metadata whether it is spam or not, lets take keywords as the
attribute, so ill be making a vocabulary of keywords of all the records and then apply a
classifier on it. The small sample of keywords vocabulary would look like this:
Since there is no labelled data in zenodos metadata for domain classification, so i will
be implementing a word2vec model on title and description attributes to find the
scientific domain.
Word2Vec expects single sentences, each one as a list of words. In other words, the
input format is a list of lists.
Training of models
After forming the models, the next step is to train them. This is the time consuming step
because it requires permutations and combinations of various classifiers. A general
classifier looks like as shown in the figure :
Figure 1
To train bag of words model, i will be using various pre implemented classifiers like
random forests, decision trees, naive bayes, etc under the scikit-library. The model
which will give the highest accuracy will be selected.
To train Word2Vec model, K-means clustering will be used. Word2Vec creates clusters
of semantically related words, so possible approach is to exploit the similarity of words
within a cluster, which we can do by using a clustering algorithm such as K-Means. In
K-Means, the one parameter we need to set is "K," or the number of clusters. I will use
scikit-learn to implement K-Means clustering.
My main aim of this task is to use TensorFlow and compare its results with the
scikit-learn. TensorFlow is an open source software library for numerical computation
using data flow graphs, developed by google. TensorFlow contains tf.learn API which
has various pre implemented classifiers like DNN( Deep neural nets), K-means, etc.
Since we have trained the model with the given metadata, so the state of the model will
remain intact as it will be stored on zenodos database and can be loaded later to make
predictions. It will be trained further as the new metadata is provided. This will improve
the accuracy of the model. The input to the invenio module is JSON data with a simple
programmatic API.
Classifiers can either be run as part of the main Flask application (with Celery tasks to
off-load the main application from the heavy computation), or a stand-alone service with
a REST API. Finally, the documentation will be infrastructured properly for running the
tool.
Timeline:
I have only two final exams and it will be over in the first week of May. So I will have no
problem in contributing full time to gsoc.
Phase 1
May 30 - June 14
June 15 - June 25
June 26 - June 30
Start learning about new features of TensorFlow like tf.learn API, Tensorboard
visualisations more thoroughly.
Submit work done till phase 1.
Phase 2
July 1 - July 14
Start training bag of words using pre implemented classifier in tf.learn API
Continue training it on different set of features.
July 15 - July 23
July 24 - July 28
Aug 15 - Aug 21
Related Work:
I have been contributing to the following PRs/Issues and also active on zenodos gitter
channel. Links of the PR/issues are:
Biographical Information:
1. https://github.com/fossasia/fossasia-communities/pull/103
2. https://github.com/fossasia/fossasia-communities/pull/104
3. https://github.com/fossasia/fossasia-communities/pull/105
4. https://github.com/fossasia/fossasia-communities/pull/106
I have no other commitments for this summer and i am positive that I will be able
to contribute for about 60-65 hours a week for the project.
I will be maintaining a blog (weekly) to keep track of my progress and also to get
feedback from the community.
I am not sending proposals to any other organisation.
http://www.nltk.org/book
http://www.iosrjournals.org/iosr-jce/papers/Vol16-issue5/Version-4/S0165411611
9.pdf
https://www.tensorflow.org/api_docs/python/tf/contrib/learn/DNNClassifier
https://www.tensorflow.org/api_docs/python/tf/contrib/learn/KMeansClustering
https://www.tensorflow.org/tutorials/word2vec
https://medium.com/@ilblackdragon/tensorflow-text-classification-615198df9231
https://www.tensorflow.org/get_started/summaries_and_tensorboard
http://scikit-learn.org/stable/auto_examples/cluster/plot_cluster_iris.html
http://scikit-learn.org/stable/modules/naive_bayes.html
http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForest
Classifier.html
http://scikit-learn.org/stable/modules/tree.html
http://scikit-learn.org/stable/modules/feature_extraction.html
http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html
https://zenodo.readthedocs.io/en/latest/
https://github.com/zenodo/zenodo
https://cookiecutter-invenio-module.readthedocs.io/en/latest/
http://invenio.readthedocs.io/en/latest/modules/index.html
https://github.com/inveniosoftware/invenio
http://developers.zenodo.org/?python#rest-api
https://zenodo.org/record/375909#.WN1VQXV948o