Professional Documents
Culture Documents
Paper Okmf Carla2015
Paper Okmf Carla2015
Paper Okmf Carla2015
No Institute Given
1 Introduction
Machine learning algorithms induce complex descriptive and predictive models
from data. It has been very successful addressing challenging problems in com-
puter vision, speech recognition, data analysis and natural language processing,
among others. One of the main factors in this success is the ability to train com-
plex models using millions of data samples. In many cases, this has been possible
by the implementation of algorithms that exploit general purpose graphic pro-
cessing units (GPGPU) to eciently perform linear algebra and other number
processing tasks.
General Purpose (GP) computing on Graphics Processing Units (GPUs)
deals with the usage of GPUs to perform numerical computations other than
for graphical purposes. A GPU is a massively parallel device which exploits cer-
tain data structures to execute pipelines of instructions over individual data
components in parallel, such as performing the same operation over all pixels
of an image (for graphics computing) or all elements of a matrix (for general
purpose computing).
GPGPUs are programmed through low-level C/Fortran based APIs (OpenCL
[11] or CUDA[13]) through a costly development process in terms of eort and
required skills which, in many occasions, becomes worthwhile only due to the
large accelerations obtained by these devices for certain problems. However, in
many cases, this cost constitutes a barrier in practice to exploit the inherent
power behind GPGPUs and many frameworks and libraries have emerged to ll
this gap, bringing GPGPU programming accessible to dierent communities and
knowledge areas. This includes mostly bindings of the low-level APIs to other
languages (such as jCuda for Java[25], PyCUDA for Python[16] and similar tools
for R[5], MATLAB[23] among others).
Recently, access to GPGPU power is being enabled at a higher level abstract-
ing the programmer even from API bindings automating GPGPU code genera-
tion from domain specic expressions. Among these, one can nd in the Python
arena Theano [3] or Numbapro[7] providing dierent levels of abstraction. In this
paper we focus on Theano, which is a Python library to dene, evaluate and op-
timize mathematical expressions involving multi-dimensional arrays eciently.
This is of particular utility in machine learning, where one usually seeks to opti-
mize mathematical expressions over large sets of data sets. Theano transparently
generates GPGPU enabled code from symbolic mathematical expressions allow-
ing for a certain degree of automatic symbolic manipulation (dierentiation).
Naturally, this usually comes at a cost and speedups obtained in Theano do not
always match those obtained using GPGPU language bindings (PyCUDA in this
case) or the low level C/Fortran APIs.
The trade-o between acceleration, development eort and code maintain-
ability is something which needs to be measured and understood for dierent
problem domains and the goal of this paper is to shed light on the usage of
Theano for kernel matrix factorization.
Matrix factorization is an important machine learning tool, which is applied
in dierent kind of machine learning problems including latent topic analysis,
recommender systems, blind source separation and clustering, among others. The
general idea behind matrix factorization is to nd two matrices whose product
approximates an original matrix:
X ≈ WH (1)
λ α
min J(W, hi ) = min kΦ(xi ) − Φ(B)Whi k2F + kWk2F + khi k2F (2)
W,hi W,hi 2 2
To solve the previous optimization problem with SGD the following update
rules are stated
T
hit = (Wt−1 k(B, B)Wt−1 − αI)−1 Wt−1
T
k(B, xi ) (3)
the General Purpose GPU (GPGPU) movement that led to researchers around
the world and across several scientic and engineering disciplines to use GPUs
to speed up their codes. Unfortunately, GPGPU was challenging, due to the
complexity of translating the algorithms into the graphic primitives provided.
In order to makes the computational power of GPUs more accessible and least
complex, Nvidia developed CUDA (Compute Unied Device Architecture)[20],
which is a parallel computing platform and programming model that allows the
users to ignore the underlying graphical concepts and write the algorithms in
natural C/C++ or Fortran code. In addition, with the aim to make the im-
plementation and testing of new mathematical models and algorithms faster
and easier, the machine learning community have developed several computing
frameworks: PyCUDA [15], Theano [2,3], Pylearn2 [10] and Torch 7 [6], among
others.
Parallel Prex Sum over Theano (Scan Pattern) Theano also oers an
interface that implements parallel patterns like map-reduce and scan the last one
is of special interest because it allows to make repetitive sequential operations
like the typical loop but in parallel, improving dramatically the eciency of the
computations.
4 Experimental Evaluation
4.1 Experimental design
We divide the algorithm in several computational steps with the aim of evaluat-
ing its behavior under dierent circumstances as follows:
Kernel matrices computation For the kernel calculation we use a radial basis
function (RBF)
||x−x0 ||2
K(x, x0 ) = e−( 2σ 2
)
(5)
We take as the baseline the naive implementation of the RBF kernel that it's
a subtraction row to row of the matrices x and x', and make three experiments
varying the number of rows or columns depending of the case:
Solving the linear system Solve the system of linear equations of the form
AX = B ; as baseline we use the linear algebra package from Numpy, that inter-
nally performs the computation using the LAPACK routine _gesv that is based
in the LU factorization, then is compared with the same routine but using the
wrapper for cuSOLVER, and additionally is compared with the solver based in
the Cholesky factorization that as is discussed in [19,1]. To use Cholesky factor-
ization the matrices must be semi-positive denite that means all the eigenvalues
must be positive, that is true for the kernel matrix as follows.
Be A a matrix is semi-positive denite if y T Ay ≥ 0. Be K the kernel matrix
T
dened as K = X X, let A=K then y X Xy = ||yx||2 ≥ 0.
T T
Then we perform the computations for square and asymmetric matrices, the
experiment with asymmetric matrices means that the matrix A stay in xed
size (1260×1260) that is the size of square matrices (A, B ) when the GPU Tesla
C2050 performs equivalently to CPU Xeon E5645 and then increment linearly
the number of columns of the matrix B (that is equivalent to factorize a unique
matrix A and solve it for many examples bi ); we choose the Tesla C2050 for this
and the rest of the experiments because actually is GPU that is spread across
many hardware infrastructure unlike Tesla K40.
Fitting the model The SGD is performed in this step, and the baseline again
is the implementation in Numpy, the time expended in solve the equation system
is discounted with the nality of distinguish among the improvement of one and
another task. We run the benchmark with 4 dierent combinations of the GPU
implementation of OKMF as follows:
If the matrices are small the best performance are achieved by mem ap-
proach, but if the amount of features of the dataset (columns) are large
rapidly oods the GPU's memory (12 GB for the tesla K40)
Theano using python's for loops have incredible poor results, at that point
that tends to reach rapidly the timeout
If the number of features is large enough, the Scikit-Learn implementation
outperforms all GPU implementations
The implementation that use the scan pattern behaves ne in most of the
scenarios
For ML tasks is usual to have a much larger amount of examples (rows)
than features (columns), therefore in the case of asymmetric rows is worth
to make the calculation in GPU
Solving the equation system Figure 3 shows the speedup of our wrapper
over the Numpy with MKL implementation, as we can see is worth only if the
size of the matrix A is greater than 1260, then when perform the benchmark
for asymmetric matrices shows a small benet from this size, showing that the
real improvement comes from the factorization and not much from the solving
the system with the triangular matrices either LU factorization or Cholesky
factorization; and as expected the Cholesky solver performs slightly better.
Fitting the model Finally we put all together and measure the times of each of
the tasks presented above, the sum of the times of those are represented in Figure
4. note that the schemes that uses the CPU are the worst of the experimental set,
and more important is the fact that calculate the solution of the linear equation
systems in CPU is the bottom line in the experiments; in the other hand the
implementation that use the wrapper outperforms the baseline in a factor of
more than 100X.
Table 2 presents the speed-up factors achieved for dierent well known datasets
(Abalone[9], MNIST[18] and SensIT[24]) when the GPU implementation is com-
pared with the reference CPU code implemented with Numpy.
In the the table 1 are listed the respective properties of each dataset.
Note that the reward of use the GPU implementation grow up while the
amount of data is bigger, as we can see exist a clear pattern in the speedups
shown in the table 2 where the biggest improvements are precisely in the case of
the SensIT dataset, the largest of the set of datasets used, this is important to
point it out because our main focus is to build fast, scalable and maintainable
prototyping code for several ML algorithms, and the scalability it's one of the
main concern nowadays due the amount of disponible data are growing by time.
The dierence in improvement between the methods of solving the equations
system (LU factorization and Cholesky factorization) are not clear, for this sizes
of datasets appear to be not meaningful except in the case of SensIT where the
dierence is barely perceptible, that also is an important observation because
if we need to implement a novel ML algorithm and we cannot ensure that the
Table 2. Speedups for several data sets
5 Conclusions
References
1. Sergio Barrachina, Maribel Castillo, Francisco D Igual, Rafael Mayo, and
Enrique S Quintana-Ortí. Solving dense linear systems on graphics pro-
cessors. In Euro-Par 2008Parallel Processing, pages 739748. Springer,
2008.
2. Frédéric Bastien, Pascal Lamblin, Razvan Pascanu, James Bergstra, Ian J.
Goodfellow, Arnaud Bergeron, Nicolas Bouchard, and Yoshua Bengio.
Theano: new features and speed improvements. 2012.
3. James Bergstra, Olivier Breuleux, Frédéric Bastien, Pascal Lamblin, Raz-
van Pascanu, Guillaume Desjardins, Joseph Turian, David Warde-Farley,
and Yoshua Bengio. Theano: a CPU and GPU math expression compiler.
In Proceedings of the Python for Scientic Computing Conference (SciPy),
June 2010. Oral Presentation.
4. Léon Bottou. Large-scale machine learning with stochastic gradient de-
scent. In Proceedings of COMPSTAT'2010, pages 177186. Springer, 2010.
5. Joshua Buckner, Justin Wilson, Mark Seligman, Brian Athey, Stanley
Watson, and Fan Meng. The gputools package enables gpu computing
in r. Bioinformatics, 26(1):134135, 2010.
6. Ronan Collobert, Koray Kavukcuoglu, and Clément Farabet. Torch7: A
matlab-like environment for machine learning. In BigLearn, NIPS Work-
shop, number EPFL-CONF-192376, 2011.
7. Analitics Continuum. NumbaPro A cotinuum's CUDA-based API for
writing CUDA code in Python . http://docs.continuum.io/numbapro/,
2015. Accessed: 2015-04-2.
8. Chris Ding, Tao Li, and Michael I Jordan. Convex and semi-nonnegative
matrix factorizations. Pattern Analysis and Machine Intelligence, IEEE
Transactions on, 32(1):4555, 2010.
9. Center for Machine Learning and Intelligent Systems. Abalone Data Set
. https://archive.ics.uci.edu/ml/datasets/Abalone. Accessed: 2015-
04-16.
10. Ian J. Goodfellow, David Warde-Farley, Pascal Lamblin, Vincent Du-
moulin, Mehdi Mirza, Razvan Pascanu, James Bergstra, Frédéric Bastien,
and Yoshua Bengio. Pylearn2: a machine learning research library. arXiv
preprint arXiv:1308.4214, 2013.
11. Khronos OpenCL Working Group et al. Opencl-the open standard for
parallel programming of heterogeneous systems. On line] http://www.
khronos. org/opencl, 2011.
12. Mark Harris, Shubhabrata Sengupta, and John D Owens. Parallel prex
sum (scan) with cuda. GPU gems, 3(39):851876, 2007.
13. David Kirk et al. Nvidia cuda software and gpu parallel computing archi-
tecture. In ISMM, volume 7, pages 103104, 2007.
14. Andreas Klöckner. Pycuda. Courant Institute of Mathematical Sci-
ences, New York University,[Cited 2012-01-08]. Available at WWW:<
http://mathema. tician. de/software/pycuda, 2011.
15. Andreas Klöckner, Nicolas Pinto, Yunsup Lee, Bryan Catanzaro, Paul
Ivanov, and Ahmed Fasih. Pycuda and pyopencl: A scripting-based ap-
proach to {GPU} run-time code generation. Parallel Computing, 38(3):157
174, 2012.
16. Andreas Klöckner, Nicolas Pinto, Yunsup Lee, Bryan Catanzaro, Paul
Ivanov, Ahmed Fasih, AD Sarma, D Nanongkai, G Pandurangan, P Tetali,
et al. Pycuda: Gpu run-time code generation for high-performance com-
puting. Arxiv preprint arXiv, 911, 2009.
17. Vipin Kumar. Numpy/Scipy with Intel MKL. https://software.
intel.com/en-us/articles/numpyscipy-with-intel-mkl, 2012. Ac-
cessed: 2015-06-9.
18. Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haner. Gradient-
based learning applied to document recognition. Proceedings of the IEEE,
86(11):22782324, 1998.
19. Maxime Legendre, Albrecht Schmidt, Saïd Moussaoui, and Uwe Lammers.
Solving systems of linear equations by gpu-based matrix factorization in
a science ground segment. Astronomy and Computing, 3:5864, 2013.
20. D. Luebke. Cuda: Scalable parallel programming for high-performance
scientic computing. In Biomedical Imaging: From Nano to Macro, 2008.
ISBI 2008. 5th IEEE International Symposium on, pages 836838, May
2008.
21. CUDA Nvidia. Cublas library. NVIDIA Corporation, Santa Clara, Cali-
fornia, 15, 2008.
22. Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel,
Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer,
Ron Weiss, Vincent Dubourg, et al. Scikit-learn: Machine learning in
python. The Journal of Machine Learning Research, 12:28252830, 2011.
23. Jill Reese and Sarah Zaranek. Gpu programming in matlab. MathWorks
News&Notes. Natick, MA: The MathWorks Inc, pages 225, 2012.
24. Machine Learning Data Set Repository. SensIT Vehicle (seis-
mic) dataset. http://mldata.org/repository/data/viewslug/
sensit-vehicle-seismic/. Accessed: 2015-04-16.
25. Yonghong Yan, Max Grossman, and Vivek Sarkar. Jcuda: A programmer-
friendly interface for accelerating java programs with cuda. In Euro-Par
2009 Parallel Processing, pages 887899. Springer, 2009.
26. Daoqiang Zhang, Zhi-Hua Zhou, and Songcan Chen. Non-negative matrix
factorization on kernels. In PRICAI 2006: Trends in Articial Intelligence,
pages 404412. Springer, 2006.