Paper Okmf Carla2015

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 14

Accelerating kernel matrix factorization through

Theano GPGPU symbolic computing


No Author Given

No Institute Given

Abstract. This paper presents an ecient implementation of an online


kernel matrix factorization algorithm. Matrix factorization is an impor-
tant technique with many applications in data analysis, but, like many
machine learning algorithms it is highly computationally demanding.
Furthermore, the algorithm presented in this paper is a kernel-induced
feature space factorization method that has the capability of extracting
non-linear patterns, allowing it to get a better performance in compari-
son with linear methods, but also implies further increases in computa-
tional requirements. So, in order to achieve an ecient implementation
of this algorithm, the paper presents a solution that exploits the com-
puting power of GPUs. But take advantage of the high-level frameworks
Theano and PyCUDA in order to keep the code simple and easy to scale
and maintain. This paper describes the set of strategies employed to op-
timize each of the critical points in the algorithm, and also presents an
experimental evaluation, which shows the advantages of the proposed
architecture by achieving good speedup factors.

1 Introduction
Machine learning algorithms induce complex descriptive and predictive models
from data. It has been very successful addressing challenging problems in com-
puter vision, speech recognition, data analysis and natural language processing,
among others. One of the main factors in this success is the ability to train com-
plex models using millions of data samples. In many cases, this has been possible
by the implementation of algorithms that exploit general purpose graphic pro-
cessing units (GPGPU) to eciently perform linear algebra and other number
processing tasks.
General Purpose (GP) computing on Graphics Processing Units (GPUs)
deals with the usage of GPUs to perform numerical computations other than
for graphical purposes. A GPU is a massively parallel device which exploits cer-
tain data structures to execute pipelines of instructions over individual data
components in parallel, such as performing the same operation over all pixels
of an image (for graphics computing) or all elements of a matrix (for general
purpose computing).
GPGPUs are programmed through low-level C/Fortran based APIs (OpenCL
[11] or CUDA[13]) through a costly development process in terms of eort and
required skills which, in many occasions, becomes worthwhile only due to the
large accelerations obtained by these devices for certain problems. However, in
many cases, this cost constitutes a barrier in practice to exploit the inherent
power behind GPGPUs and many frameworks and libraries have emerged to ll
this gap, bringing GPGPU programming accessible to dierent communities and
knowledge areas. This includes mostly bindings of the low-level APIs to other
languages (such as jCuda for Java[25], PyCUDA for Python[16] and similar tools
for R[5], MATLAB[23] among others).
Recently, access to GPGPU power is being enabled at a higher level abstract-
ing the programmer even from API bindings automating GPGPU code genera-
tion from domain specic expressions. Among these, one can nd in the Python
arena Theano [3] or Numbapro[7] providing dierent levels of abstraction. In this
paper we focus on Theano, which is a Python library to dene, evaluate and op-
timize mathematical expressions involving multi-dimensional arrays eciently.
This is of particular utility in machine learning, where one usually seeks to opti-
mize mathematical expressions over large sets of data sets. Theano transparently
generates GPGPU enabled code from symbolic mathematical expressions allow-
ing for a certain degree of automatic symbolic manipulation (dierentiation).
Naturally, this usually comes at a cost and speedups obtained in Theano do not
always match those obtained using GPGPU language bindings (PyCUDA in this
case) or the low level C/Fortran APIs.
The trade-o between acceleration, development eort and code maintain-
ability is something which needs to be measured and understood for dierent
problem domains and the goal of this paper is to shed light on the usage of
Theano for kernel matrix factorization.
Matrix factorization is an important machine learning tool, which is applied
in dierent kind of machine learning problems including latent topic analysis,
recommender systems, blind source separation and clustering, among others. The
general idea behind matrix factorization is to nd two matrices whose product
approximates an original matrix:

X ≈ WH (1)

where X ∈ Rn×m , W ∈ Rn×r H ∈ Rr×m . Usually r  n or r 


and
m, so the factorization reveals the low-rank of X , meaning that the data in
X may be explained by a small number, r, of factors. The model found by
matrix factorization is clearly linear, and this could be an important restriction
in dierent real world problems where the dependencies between variables are
non-linear. A strategy to address this restriction is to perform kernel matrix
factorization. The main idea is to still do a linear decomposition of the matrix
but in a higher dimensionality space induced by a kernel. This a well known
strategy in machine learning known as the kernel trick. The downside of this
strategy is the high computational cost introduced by the kernel formulation,
since it requires, in principle, the calculation of a kernel matrix whose dimension
is n × n, where n is the number of training samples.
In this paper, we present an ecient implementation of an online kernel ma-
trix factorization algorithm. The implementation takes advantage of Theano and
PyCUDA to exploit the computing power of the GPU, while keeping the code
simple and easy to maintain. Dierent implementation strategies were system-
atically evaluated to determine the gains in time when compared to a baseline
implementation exclusively based on CPU.
The paper is organized as follows: Section 2 describes the online kernel matrix
factorization algorithm, Section 3 gives the details of the implementation of
the algorithm using Theano and PyCuda, Section 4 presents the experimental
evaluation, nally, conclusion and future work are discussed in Section 5.

2 Online kernel matrix factorization


Online kernel matrix factorization is a kernel-induced feature space factorization
method. Given the memory and time requirements to compute and store kernel
matrices is O(n2 ), it's not possible to apply a standard kernel-induced feature
space factorization [26,8]to large-scale data sets. OKMF overcomes the time and
memory limitations with two strategies. The rst is imposing a budget restric-
tion, i.e., restricting the number of samples needed to represent the feature space
base. The second is using stochastic gradient descent (SGD) [4]to compute the
factorization, allowing OKMF to scale linearly in time to large-scale data sets.
The factorization OKMF computes is Φ(X) = Φ(B)WH, where Φ(X) ∈ Rn×l is
n×p
the mapping of problem space into feature space, Φ(B) ∈ R is the mapping
of the budget B into the feature space and B satises the budget restriction, i.e,
|B|  |X|. W ∈ Rp×r is a weight matrix. Finally, H ∈ Rr×l is the latent space
representation for every element of Φ(X). To compute this factorization can be
expressed as the following minimization problem

λ α
min J(W, hi ) = min kΦ(xi ) − Φ(B)Whi k2F + kWk2F + khi k2F (2)
W,hi W,hi 2 2
To solve the previous optimization problem with SGD the following update
rules are stated

T
hit = (Wt−1 k(B, B)Wt−1 − αI)−1 Wt−1
T
k(B, xi ) (3)

Wt = Wt−1 − γ(k(B, xi )hi Tt − k(B, B)Wt−1 hit hi Tt + λWt−1 ) (4)

With the optimization rules found in equations 3 and 4, the algorithm 1 is


proposed to compute the factorization

3 Accelerating OKMF with GPU


3.1 GPU Development for machine learning
GPUs provide massive computational resources due to its extremely parallel ar-
chitecture composed of hundreds or even thousands of cores (streaming proces-
sors) that can collectively run thousands of computing threads. This prompted
Algorithm 1 Online kernel matrix factorization
1: procedure OKMF(X, budget, W, γ, λ, α, epochs)
2: KB ← k(budget, budget)
3: for e ← 1, epochs do
4: for all xi ∈ X do
5: kxi ← k(budget, xi )
6: hi ← (WT KBW − αI)−1 WT kxi
7: W ← W − γ(kxihi T − KBWhi hi T + λW)
8: end for
9: end for
10: for allxi ∈ X do
11: kxi ← k(budget, xi )
12: Hi ← (WT KBW − αI)−1 WT kxi
13: end for
14: return W and H
15: end procedure

the General Purpose GPU (GPGPU) movement that led to researchers around
the world and across several scientic and engineering disciplines to use GPUs
to speed up their codes. Unfortunately, GPGPU was challenging, due to the
complexity of translating the algorithms into the graphic primitives provided.
In order to makes the computational power of GPUs more accessible and least
complex, Nvidia developed CUDA (Compute Unied Device Architecture)[20],
which is a parallel computing platform and programming model that allows the
users to ignore the underlying graphical concepts and write the algorithms in
natural C/C++ or Fortran code. In addition, with the aim to make the im-
plementation and testing of new mathematical models and algorithms faster
and easier, the machine learning community have developed several computing
frameworks: PyCUDA [15], Theano [2,3], Pylearn2 [10] and Torch 7 [6], among
others.

All these frameworks provide a high level of abstraction by managing re-


sources automatically, and using high-level languages that enables various pro-
gramming paradigms (such as functional programing and object-oriented pro-
gramming, among others). Both, PyCUDA and Theano are based on Python
programing language that is automatically translate to CUDA code. One of the
outstanding features of Theano is the capability to manipulating and optimiz-
ing graphs representing symbolic mathematical expressions, including symbolic
dierentiation that allows users to quickly implement machine learning models
based on gradient descent without manually deriving the gradient. Pylearn2 is
another machine learning framework based on Theano that consists of several
components that can be combined to implement a complete learning algorithm.

Pylearn2 is the quickest way to get started by using standard algorithms


based on neural networks and deep learning in Python. This framework presents
a higher level of abstraction that makes it easier the design of an entire experi-
mental setup, but also it implies loss of exibility and control to implement new
algorithms. Like the frameworks presented above, Torch7 is based on a scripting
programing language (Lua), that uses heavily-optimized scientic computation
libraries. Torch7 was intended as a Matlab-like environment for machine learn-
ing that allows quick and easy development of algorithms that can be easily
extended. One of the advantages of Torch7 is the use of Lua that presents lower
interpreter overhead than Python. But, at the same time, Lua has the drawback
that is not as highly popular and mature as Python, so it is being supported by
a reduced scientic community.

3.2 Theano OKMF


One of the important characteristics of Theano is its compatibility with the well
known library for python called Numpy, it even inherits the same syntax; Numpy
is one of the mainstream libraries used inscientic computing for handling in a
easier way complex tasks in numerical computation. This feature gives us an
incredible exibility to construct modules for GPU with Theano and be gentle
with source modules constructed in Python that use Numpy. With the aim of
constructing fast prototype code for novel ML algorithms that can exploit the
potential of the GPUs, we choose Theano due to the following reasons: the
ability to code in Python, which allows us to produce simpler and reusable code;
the abstraction of GPU, which avoids the complexity of handling the memory
(one of the main concerns when coding in CUDA); and the inclusion of parallel
programming patterns, which are automatically dealt with by the library.
However, Theano's abstraction has a big drawback, the internal CUDA ker-
nel code is not visible, acting like a black box that doesn't allow to enhance the
performance using small tweaks for the specic application that are in construc-
tion. Given this, our goal was to analyse not only if Theano can improve the
performance over the Python code that use Numpy to solve OKMF but also,
if using the tools provided by Theano it is possible to accelerate even more the
initial implementation. One of these tools, is the scan pattern, also known as
Parallel Prex Sum[12], that allows to compute in a parallel way algorithms
that seems inherently sequential.

Theano as a symbolic calculation tool One of the remarkable features of


Theano is its ability to perform symbolic calculations, this is done by building
a graph of the operations that need to be performed to achieve a calculation,
then the graph is optimized to avoid unnecessary operations, and nally the
graph is compiled to generate code either for the CPU or GPU depending on
the target chosen by the user. In addition, symbolic manipulation of numeric
functions, such as gradient calculation, are automatically perform. This is par-
ticularly useful for implementing machine learning algorithms based on gradient
descent optimization, as it is the case for OKMF.
dy
We show a simple code that performs the derivative
dx of the function f (x) =
2 2
ax of the function f (x) = ax and the graph of operations that the GPU
performs, as we can see in the gure 1 the GPU allocates the variables a and X
from host and computes an element wise multiplication of a, X and the constant
2 that is the constant of the derivative of ax2 then sends the answer back tho
host
1 import t h e a n o . t e n s o r a s T
2 from t h e a n o import f u n c t i o n
3 a=T . f s c a l a r ( name= ' a ' )
4 X=T . f s c a l a r ( name= 'X ' )
5 Y=a ∗ X∗∗ 2
6 gy = T . g r a d ( Y , X)
7 derivate = f u n c t i o n ( i n p u t s =[ a , X ] , o u t p u t s=g y )
8 print derivate (5 ,2)

Fig. 1. Theano's graph of the gradient

All Theano's functions behaves in a similar way, we dene a Theano func-


tion with a series of symbolic variables, then dene the respective operation and
launch the Theano function like every function in python; the returned infor-
mation be in Numpy array type, that make easy the task of link code based on
Numpy.

Parallel Prex Sum over Theano (Scan Pattern) Theano also oers an
interface that implements parallel patterns like map-reduce and scan the last one
is of special interest because it allows to make repetitive sequential operations
like the typical loop but in parallel, improving dramatically the eciency of the
computations.

3.3 Theano + PyCuda OKMF


Theano implements several linear algebra operations, however it does not provide
functions to solve linear systems. Currently, solving a linear system in Theano
requires to move the data to the host main memory and use directly the Numpy
library. This implies a lot of expensive Host to Device and Device to Host trans-
actions.
The new version of CUDA Toolkit 7 from NVIDIA introduces a new module
called cuSOLVER based in the well known library cuBLAS [21], that allows us
to solve linear systems for dense and sparse matrices; but this is a CUDA library.
With the aimof taking advantage of this new tool, we built a wrapper based in
PyCUDA [14] hat allows to call these functions directly from the Theano code.

4 Experimental Evaluation
4.1 Experimental design
We divide the algorithm in several computational steps with the aim of evaluat-
ing its behavior under dierent circumstances as follows:

 Kernel matrices computation


 Solving hit matrix given a minibatch to analyze the performance of the
PyCUDA wrapper in conjunction with Theano
 Fitting time for n epochs (theSGD steps discounting the time of computing
h)
For every experiment we use as baseline a Python implementation of the OKMF
using Numpy. This implementation is run using an optimized distribution of
Python provided by Continuum that uses Intel's MKL [17]. MKL transparently
uses multiple cores to run linear algebra operations.

Kernel matrices computation For the kernel calculation we use a radial basis
function (RBF)

||x−x0 ||2
K(x, x0 ) = e−( 2σ 2
)
(5)

We take as the baseline the naive implementation of the RBF kernel that it's
a subtraction row to row of the matrices x and x', and make three experiments
varying the number of rows or columns depending of the case:

 Asymmetric rows matrices (matrices with mn where m is the number of


rows)
 Asymmetric columns matrices (matrices with n  m)
 Square matrices

We compare the baseline with the Scikit-Learn implementation [22], a Num-


baPro implementation, and several versions of the implementation in Theano;
that includes using python's loops to perform the iterations with and without
preserving the results in the GPU, a mem version that computes all the kernel
in a tensorial operation that is costly in terms of memory, and nally a version
using the pattern scan provided by Theano.
For this experimental stage, we used a GPU Nvidia Tesla K40 vs. CPU Xeon
X5570 and a conditional stopping has been implemented when GPU memory
oods or surpass the time limit of 60 second for one iteration. The nal re-
sult is the average of three computations under the same conditions; we choose
this GPU for their memory capacity, and be able to explore deeply the mem
schema, and using another GPU shows that the impact in overall performance
is preserved but the lack of memory don't allow us explore the characteristics of
mem .

Solving the linear system Solve the system of linear equations of the form
AX = B ; as baseline we use the linear algebra package from Numpy, that inter-
nally performs the computation using the LAPACK routine _gesv that is based
in the LU factorization, then is compared with the same routine but using the
wrapper for cuSOLVER, and additionally is compared with the solver based in
the Cholesky factorization that as is discussed in [19,1]. To use Cholesky factor-
ization the matrices must be semi-positive denite that means all the eigenvalues
must be positive, that is true for the kernel matrix as follows.
Be A a matrix is semi-positive denite if y T Ay ≥ 0. Be K the kernel matrix
T
dened as K = X X, let A=K then y X Xy = ||yx||2 ≥ 0.
T T

Then we perform the computations for square and asymmetric matrices, the
experiment with asymmetric matrices means that the matrix A stay in xed
size (1260×1260) that is the size of square matrices (A, B ) when the GPU Tesla
C2050 performs equivalently to CPU Xeon E5645 and then increment linearly
the number of columns of the matrix B (that is equivalent to factorize a unique
matrix A and solve it for many examples bi ); we choose the Tesla C2050 for this
and the rest of the experiments because actually is GPU that is spread across
many hardware infrastructure unlike Tesla K40.

Fitting the model The SGD is performed in this step, and the baseline again
is the implementation in Numpy, the time expended in solve the equation system
is discounted with the nality of distinguish among the improvement of one and
another task. We run the benchmark with 4 dierent combinations of the GPU
implementation of OKMF as follows:

 GPU+SOLV_LU: Solve the system of linear equations in GPU, and per-


forms the kernel calculation and tting in GPU
 GPU+SOLV_CHOL: Same as the above but instead solves with Cholesky
factorization
 GPU+SOLV_CPU: Solve the system of linear equations in CPU but per-
forms the kernel calculation and tting in
 GPU. GPU+CHOL+KERN_CPU: Calculates the gaussian kernel in CPU
and perform the rest of operations in GPU

Then perform the training of OKMF using the sensit-vehicle-seismic data


set[24], the data set have 98528 examples and 51 attributes, we create incre-
mental slides of the dataset with a step of 500 examples and take as measure
the average time of 5 execution per slide, also use a budget of 500 and a mini
batch size of 1000 examples.

4.2 Results and discussion

Fig. 2. Speedups for RBF Kernel Computation

Kernel matrices computation As we can see in the gure 2 the speedups


are not held across the dierent scenarios, interesting behaviors can be observed,
such as:

 If the matrices are small the best performance are achieved by mem ap-
proach, but if the amount of features of the dataset (columns) are large
rapidly oods the GPU's memory (12 GB for the tesla K40)
 Theano using python's for loops have incredible poor results, at that point
that tends to reach rapidly the timeout
 If the number of features is large enough, the Scikit-Learn implementation
outperforms all GPU implementations
 The implementation that use the scan pattern behaves ne in most of the
scenarios
 For ML tasks is usual to have a much larger amount of examples (rows)
than features (columns), therefore in the case of asymmetric rows is worth
to make the calculation in GPU

Solving the equation system Figure 3 shows the speedup of our wrapper
over the Numpy with MKL implementation, as we can see is worth only if the
size of the matrix A is greater than 1260, then when perform the benchmark
for asymmetric matrices shows a small benet from this size, showing that the
real improvement comes from the factorization and not much from the solving
the system with the triangular matrices either LU factorization or Cholesky
factorization; and as expected the Cholesky solver performs slightly better.

Fig. 3. Speedups solving equation systems

Fitting the model Finally we put all together and measure the times of each of
the tasks presented above, the sum of the times of those are represented in Figure
4. note that the schemes that uses the CPU are the worst of the experimental set,
and more important is the fact that calculate the solution of the linear equation
systems in CPU is the bottom line in the experiments; in the other hand the
implementation that use the wrapper outperforms the baseline in a factor of
more than 100X.
Table 2 presents the speed-up factors achieved for dierent well known datasets
(Abalone[9], MNIST[18] and SensIT[24]) when the GPU implementation is com-
pared with the reference CPU code implemented with Numpy.
In the the table 1 are listed the respective properties of each dataset.

Fig. 4. Total speedups for OKMF implemented in GPU

Table 1. Properties of the dierent Datasets

Dataset Examples Number Features Number


Abalone 4177 8
Mnist 60000 784
SenseIT 95528 50

Note that the reward of use the GPU implementation grow up while the
amount of data is bigger, as we can see exist a clear pattern in the speedups
shown in the table 2 where the biggest improvements are precisely in the case of
the SensIT dataset, the largest of the set of datasets used, this is important to
point it out because our main focus is to build fast, scalable and maintainable
prototyping code for several ML algorithms, and the scalability it's one of the
main concern nowadays due the amount of disponible data are growing by time.
The dierence in improvement between the methods of solving the equations
system (LU factorization and Cholesky factorization) are not clear, for this sizes
of datasets appear to be not meaningful except in the case of SensIT where the
dierence is barely perceptible, that also is an important observation because
if we need to implement a novel ML algorithm and we cannot ensure that the
Table 2. Speedups for several data sets

dataset SOLV_LU SOLV_CHOL SOLV_CPU CHOL+KERN_CPU


Abalone 28.71X 28.96X 8.52X 8.16X
MNIST 60.69X 60.64X 13.44X 28.48X
SensIT 117.11X 119.93X 12.05X 33.46X

matrix to factorize will be semi-positive denite, the penalization to not be able


to use the Cholesky factorization method at least for this data sizes is almost
zero.

5 Conclusions

This paper presented an ecient implementation of the online kernel matrix


factorization algorithm that takes advantage of Theano and PyCUDA to exploit
the computing power of the GPUs, while keeping the code simple and easy to
maintain. This paper described each one of strategies employed to optimize the
critical points in the algorithm. The experiments showed that it is posible to
obtain a speed up around 120X for a 100.000 samples dataset. An important
conclusion is that Theano is an eective alternative to exploit the computing
power of the GPU whitout dealing with the complexities of low-level GPU pro-
gramming. However, as experiments show, just using Theano does not guarantee
immediate good performance, in fact, in some cases a Theano implementation
using GPU may be slower than a straightforward CPU implementation.
Because of this, dierent strategies were proposed in order to maintain a trade
o between memory usage and speed. The best performance was obtained when
Theano was integrated with PyCUDA, which allowed us to solve the system of
linear equations without having to transfer the matrices from the GPU memory
to the main memory, saving a signicant amount of time.
Finally, the reported results show that at each stage of the algorithm remark-
able speed-up factors are achieved when the GPU implementation is compared
with the reference CPU code.
As future work we propose to study if the previously showed behaviors are
preserved when the dataset are sparse, and what techniques are necessary to
handle data with these characteristic and how maintain or even exceed the per-
formance gain that we achieve in this paper.

References
1. Sergio Barrachina, Maribel Castillo, Francisco D Igual, Rafael Mayo, and
Enrique S Quintana-Ortí. Solving dense linear systems on graphics pro-
cessors. In Euro-Par 2008Parallel Processing, pages 739748. Springer,
2008.
2. Frédéric Bastien, Pascal Lamblin, Razvan Pascanu, James Bergstra, Ian J.
Goodfellow, Arnaud Bergeron, Nicolas Bouchard, and Yoshua Bengio.
Theano: new features and speed improvements. 2012.
3. James Bergstra, Olivier Breuleux, Frédéric Bastien, Pascal Lamblin, Raz-
van Pascanu, Guillaume Desjardins, Joseph Turian, David Warde-Farley,
and Yoshua Bengio. Theano: a CPU and GPU math expression compiler.
In Proceedings of the Python for Scientic Computing Conference (SciPy),
June 2010. Oral Presentation.
4. Léon Bottou. Large-scale machine learning with stochastic gradient de-
scent. In Proceedings of COMPSTAT'2010, pages 177186. Springer, 2010.
5. Joshua Buckner, Justin Wilson, Mark Seligman, Brian Athey, Stanley
Watson, and Fan Meng. The gputools package enables gpu computing
in r. Bioinformatics, 26(1):134135, 2010.
6. Ronan Collobert, Koray Kavukcuoglu, and Clément Farabet. Torch7: A
matlab-like environment for machine learning. In BigLearn, NIPS Work-
shop, number EPFL-CONF-192376, 2011.
7. Analitics Continuum. NumbaPro A cotinuum's CUDA-based API for
writing CUDA code in Python . http://docs.continuum.io/numbapro/,
2015. Accessed: 2015-04-2.
8. Chris Ding, Tao Li, and Michael I Jordan. Convex and semi-nonnegative
matrix factorizations. Pattern Analysis and Machine Intelligence, IEEE
Transactions on, 32(1):4555, 2010.
9. Center for Machine Learning and Intelligent Systems. Abalone Data Set
. https://archive.ics.uci.edu/ml/datasets/Abalone. Accessed: 2015-
04-16.
10. Ian J. Goodfellow, David Warde-Farley, Pascal Lamblin, Vincent Du-
moulin, Mehdi Mirza, Razvan Pascanu, James Bergstra, Frédéric Bastien,
and Yoshua Bengio. Pylearn2: a machine learning research library. arXiv
preprint arXiv:1308.4214, 2013.
11. Khronos OpenCL Working Group et al. Opencl-the open standard for
parallel programming of heterogeneous systems. On line] http://www.
khronos. org/opencl, 2011.
12. Mark Harris, Shubhabrata Sengupta, and John D Owens. Parallel prex
sum (scan) with cuda. GPU gems, 3(39):851876, 2007.
13. David Kirk et al. Nvidia cuda software and gpu parallel computing archi-
tecture. In ISMM, volume 7, pages 103104, 2007.
14. Andreas Klöckner. Pycuda. Courant Institute of Mathematical Sci-
ences, New York University,[Cited 2012-01-08]. Available at WWW:<
http://mathema. tician. de/software/pycuda, 2011.
15. Andreas Klöckner, Nicolas Pinto, Yunsup Lee, Bryan Catanzaro, Paul
Ivanov, and Ahmed Fasih. Pycuda and pyopencl: A scripting-based ap-
proach to {GPU} run-time code generation. Parallel Computing, 38(3):157
 174, 2012.
16. Andreas Klöckner, Nicolas Pinto, Yunsup Lee, Bryan Catanzaro, Paul
Ivanov, Ahmed Fasih, AD Sarma, D Nanongkai, G Pandurangan, P Tetali,
et al. Pycuda: Gpu run-time code generation for high-performance com-
puting. Arxiv preprint arXiv, 911, 2009.
17. Vipin Kumar. Numpy/Scipy with Intel MKL. https://software.
intel.com/en-us/articles/numpyscipy-with-intel-mkl, 2012. Ac-
cessed: 2015-06-9.
18. Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haner. Gradient-
based learning applied to document recognition. Proceedings of the IEEE,
86(11):22782324, 1998.
19. Maxime Legendre, Albrecht Schmidt, Saïd Moussaoui, and Uwe Lammers.
Solving systems of linear equations by gpu-based matrix factorization in
a science ground segment. Astronomy and Computing, 3:5864, 2013.
20. D. Luebke. Cuda: Scalable parallel programming for high-performance
scientic computing. In Biomedical Imaging: From Nano to Macro, 2008.
ISBI 2008. 5th IEEE International Symposium on, pages 836838, May
2008.
21. CUDA Nvidia. Cublas library. NVIDIA Corporation, Santa Clara, Cali-
fornia, 15, 2008.
22. Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel,
Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer,
Ron Weiss, Vincent Dubourg, et al. Scikit-learn: Machine learning in
python. The Journal of Machine Learning Research, 12:28252830, 2011.
23. Jill Reese and Sarah Zaranek. Gpu programming in matlab. MathWorks
News&Notes. Natick, MA: The MathWorks Inc, pages 225, 2012.
24. Machine Learning Data Set Repository. SensIT Vehicle (seis-
mic) dataset. http://mldata.org/repository/data/viewslug/
sensit-vehicle-seismic/. Accessed: 2015-04-16.
25. Yonghong Yan, Max Grossman, and Vivek Sarkar. Jcuda: A programmer-
friendly interface for accelerating java programs with cuda. In Euro-Par
2009 Parallel Processing, pages 887899. Springer, 2009.
26. Daoqiang Zhang, Zhi-Hua Zhou, and Songcan Chen. Non-negative matrix
factorization on kernels. In PRICAI 2006: Trends in Articial Intelligence,
pages 404412. Springer, 2006.

You might also like