Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

R package Recommendation

Yingying Xu
Abstract: New comers to R programming language always face the problem of choosing packages that are more relevant and are of higher qualities. An automated R package recommendation system would be useful in telling what packages average people have installed, and the measurable qualities of the packages. When building a recommendation system, we can interested in the probability of a user installing a package, so in this paper, various features that might influence the users decision are explored, and several statistical learning methods are experimented.

views, and installed by users. A maintainer can be maintaining one or several packages at the same time. We also know whether a package is a recommended package on CRAN, and whether it is a core package. 1.2 Training data The training data we used are taken from the Dataists R recommendation system contest (https://github.com/johnmyleswhite/r_recommendation_system) . The graph data were obtained by crawling the entire CRAN website and the predictors were derived from the rich meta-data. The training data contains 99,640 rows of data describing installation information for 1865 packages for 52 users of R. Its a matrix with each row provides the following information: Package: The name of the current R package. User: The numeric ID of the current user who may or may not have installed the current package. Installed: A dummy variable indicating whether the current package is installed by the current user. DependencyCount: The number of other R packages that depend upon the current package. SuggestionCount: The number of other R packages that suggest the current package. ImportCount: The number of other R packages that import the current package. ViewsIncluding: The number of task views on CRAN that include the current package. CorePackage: A dummy variable indicating whether the current package is part of core R. RecommendedPackage: A dummy variable indicating whether the current package is a recommended R package. Maintainer: The name and e-mail address of the package's maintainer. PackagesMaintaining: The number of other R packages that are being maintained by the current package's maintainer Besides the given features, we can also use the open source website crawler to crawl the entire CRAN website and derive more informative links from the network graph, such as the actual name of each package that depend on, imports, or suggests the current package, and the name of the task views that the current package appears. This will be discussed in the following section. 2. Feature Selection There are several intuitions and assumptions in predicting whether a user will install an R package in this paper. A package ! is more likely to be installed by user ! if

1.

Introduction

Each programming language contains a large number of libraries/packages that extend the functionality of the core language. The fluent use of a number of libraries is important for every programmer as well as the mastery of the basic syntax. New comers to a programming language always face the problem of choosing libraries that are more relevant and are of higher qualities. Inspecting each library manually by looking at its functionality descriptions can be a daunting task, and they provide little information on the quality of the library. Thus, an automated package recommendation system would be useful in telling the programmer what packages other people have installed, and the measurable properties that generate this result. R is a language and software for statistical computing and graphics. In this paper, we will try several methods for building an R package recommendation engine. A popular website for R, CRAN, is a network of ftp and web servers around the world that stores code and documentation for R. On CRAN, each R package summary contains a short description of what the package does, the authors and maintainer of the package, the imports and dependencies, and other packages it suggests. The rich meta-data each R package contains enables us to explore many potential relationships and make fairly accurate predictions on whether a user will install a package 1. 1.1 Data Description Network Graph

In R package network, there are several entities, package, topic/task view, author, maintainer, and user. Topic/task view enables user to browse packages by its topic. There are currently 29 available topics/task views on CRAN. The links between packages are imports, depend and suggests. A package is written by several authors, maintained by one maintainer (maintainer is usually one of the authors), be in one or several topics/task

It has more in-links. The maintainer of this package maintains higher quality packages on average It is a recommended package ! installs more packages on average The user installed packages that depend on ! ! is a core package

0.81. This suggests that the probabilities of different users installing packages are quite different. Some users might install every package the website has to offer, while others only install a small portion of packages. 3.2 Modified KNN Ordinary KNN would calculate the similarity between each package, get the top K most similar neighbors of a package, and use the majority vote of the K sample points to decide the label of a package. But link information between package can also be utilized, such as related packages are more likely to be assigned to the same class. A modified KNN penalize the package if related packages are not installed. 3.3 ! Regularized Logistic Regression Logistic Regression models the conditional probability of installing a package: P (y = 1) = 1 1 + exp ()( + ! )

Assume that a user will install a package ! if One type of in-link is the dependency of other packages. Another type of in-link is when other packages suggest ! . In the given features, SuggestionCount, ImportCount, and DependencyCount characterize the total number of each type of in-links. Intuitively, packages that are suggested by popular packages should have higher probability of being installed; in other words, the in-links from different packages should have different weight according to their popularity. Thus, the features SuggestionCount, ImportCount, and DependencyCount are replaced by SuggestionScore, ImportScore, and DependencyScore, which are weighted version of the original features: () =
! !"#$"%&

Where is the feature data, y is the label, and and ! the coefficients. An ! Regularized Logistic Regression maximizes the log likelihood with ! penalty:
!

[ ! !"##$!%! ! ]

# = # where is a normalizing constant ensuring ! !"#$"%& = 1 We notice that no user information is given in the training data, so a column !"#$ is added to the training set as feature: !"#$ = 3. Models
! !"#$"%& !"#$ !"#$%&&'( !

, {
! !!

! + ! log 1 + exp + !
!


! !!

! }

An AUC adaptive boost is used to boost the power of the learners, the algorithm is as follows: Given feature values X , true class labels vector y , the number of iterations T: 1. 2. Initialize the weights of the observations ! ! = , = ! 1,2 , For t=1 to T: a. b. c. d. e. Fit a classifier ! () to the training data using weights ! ! =predictions using classifier ! () Compute err=1-AUC Compute ! = log[ 1 /]
! ! !

In this section, Ill explain in detail the models that are used in the experiment. 3.1 The baseline model The contest mentioned above provides a baseline model, which is an unregularized logistic regression on the probability ! of installing a package . The predictors used are those provided in the original dataset, with a log transformation of all the nonbinary predictors, namely, LogDependencyCount, LogSuggestionCount, LogImportCount, LogViewsIncluding, and LogPackagesMaintaining. It achieves an AUC of 0.90. A variant of this model omits the user, which is a dummy variable indication which user installed the package, achieves an AUC of

For k=1,2,,N Let ! = ! ! exp (! )

f. 3. 3.4

Let ! !!! =

! |!| ! !!! ! !

The predicted value for an input x is Nave Bayes

In KNN, the total number of sample points is used as K; when calculating the cosine similarity between two packages, those numerical predictors are used. For those that are core package, we predict their class labels to be -1(indication the package will be installed by a user) for all users. 4.3 Results Five-fold cross-validation AUC is used as the criterion. The results are summarized in the table 4.3.1:
Model Average AUC in 5-fold crossvalidation 0.9028

Nave Bayes assumes that conditioning on the class label, each feature are conditionally independent from each other, i.e.,
!

= (|) =(!"#$%$"#$"&$ !""#$%&'())


! !!

(! |)

Where y is the class label, is the feature vector, ! is each feature, p is the total number of features, is a normalizing factor ensuring = 1 +P (y= -1|)=1. is calculated for each sample point, and y is assigned to the class that has the higher probability. Despite the naive assumption of this model, it achieves good AUC score in the experiment. 3.5 SVM SVM with linear, Quadratic and Gaussian kernel are experimented, and quadratic kernel returns the best AUC. Linear Kernel: , ! = ( , ) Gaussian Radial Basis Function Kernel: , ! = exp ( || ||! ) The Rational Quadratic kernel || ||! , ! = 1 | ! || + The Rational Quadratic Kernel is less computationally intensive than the Gaussian kernel 4. 4.1 Experiment Data preprocessing

Baseline model

KNN with K= # observations

0.9472

L1 regularized logistic regression +Adaboost Nave Bayes

0.9635

0.9019

SVM with quadratic kernel

0.9455

Table 4.3.1

5.

Discussion

Among the models stated above, L1 regularized logistic regression gives the highest AUC. In this paper, only in-links, such as being suggested by other packages, are considered. In future studies, out-links, like importing other packages or suggesting other packages can also be utilized. Classification might also be improved if we separately classify the packages in each topic. REFERENCES
[1] Alexandros Karatzoglou and David Meyer. Support Vector Machines in R. Journal of Statistical Software, Volume 15, Issue 9,April 2006 [2] Greg Ridgeway. Generalized Boosted Models: A guide to the gbm package. August 3, 2007 [3] Jerome H. Friedman. Regularized Discriminant Analysis. Journal of the American Statistical Association, Vol. 84, No. 405 (Mar., 1989), pp. 165-175 [4] Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The Elements of Statistical Learning [5] http://cran.r-project.org/web/packages/e1071/index.html [6] http://www.cs.princeton.edu/~schapire/boost.html [7] http://cran.r-project.org/web/packages/glmnet/index.html [8] http://cran.r-project.org/

The training data contains a large number of rows that has missing class labels (we dont know whether a user installed the package). Since there are much more (10 times more) training data in class 1 (not installed) than in class -1 (installed), the missing class labels are simply replaced by 1. 4.2 Predictors In logistic regression, the final predictors used in the experiments are: Package, User, DependencyScore, SuggestionScore, DependencyScore, ViewsIncluding, CorePackage, Recommended Package, Maintainer, and PackagesMaintaining. The meaning of each predictor has been explained in section 1.2 and 2.

[9] Training data set and the web crawler open source code are downloaded from: https://github.com/johnmyleswhite/r_recommendation_system

You might also like