Professional Documents
Culture Documents
Parallel Genetic Algorithms For Financial Pattern Discovery Using Gpus Springerbriefs in Applied Sciences and Technology Baúto
Parallel Genetic Algorithms For Financial Pattern Discovery Using Gpus Springerbriefs in Applied Sciences and Technology Baúto
https://ebookstep.com/product/science-and-technology-for-
society-5-0-tim-penulis/
https://ebookstep.com/product/using-r-for-digital-soil-mapping-
progress-in-soil-science-malone/
https://ebookstep.com/product/contemporary-accounts-in-drug-
discovery-and-development-1st-edition-xianhai-huang/
https://ebookstep.com/product/financial-technology-elif-
pardiansyah-s-sy-m-si-editor/
Applied Quantitative Finance Statistics and Computing
Härdle
https://ebookstep.com/product/applied-quantitative-finance-
statistics-and-computing-hardle/
https://ebookstep.com/product/model-pembelajaran-matematika-
berbasis-proyek-dalam-kerangka-integrasi-sciences-technology-
engineering-mathematics-and-islam-stemi-mulin-numan/
https://ebookstep.com/product/information-technology-for-
business-sugeng-hariadi/
https://ebookstep.com/product/research-methods-in-building-
science-and-technology-field-based-analysis-and-simulation-1st-
edition-rahman-azari/
https://ebookstep.com/product/the-insurtech-book-the-insurance-
technology-handbook-for-investors-entrepreneurs-and-fintech-
visionaries-1st-edition-vanderlinden/
SPRINGER BRIEFS IN APPLIED SCIENCES AND
TECHNOLOGY COMPUTATIONAL INTELLIGENCE
Parallel Genetic
Algorithms for
Financial Pattern
Discovery Using
GPUs
123
SpringerBriefs in Applied Sciences
and Technology
Computational Intelligence
Series editor
Janusz Kacprzyk, Polish Academy of Sciences, Systems Research Institute,
Warsaw, Poland
The series “Studies in Computational Intelligence” (SCI) publishes new develop-
ments and advances in the various areas of computational intelligence—quickly and
with a high quality. The intent is to cover the theory, applications, and design
methods of computational intelligence, as embedded in the fields of engineering,
computer science, physics and life sciences, as well as the methodologies behind
them. The series contains monographs, lecture notes and edited volumes in
computational intelligence spanning the areas of neural networks, connectionist
systems, genetic algorithms, evolutionary computation, artificial intelligence,
cellular automata, self-organizing systems, soft computing, fuzzy systems, and
hybrid intelligent systems. Of particular value to both the contributors and the
readership are the short publication timeframe and the world-wide distribution,
which enable both wide and rapid dissemination of research output.
123
João Baúto Nuno Horta
Instituto Superior Técnico Instituto Superior Técnico
Instituto de Telecomunicações Instituto de Telecomunicações
Lisbon Lisbon
Portugal Portugal
Rui Neves
Instituto Superior Técnico
Instituto de Telecomunicações
Lisbon
Portugal
© The Author(s), under exclusive licence to Springer International Publishing AG, part of Springer
Nature 2018
This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part
of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations,
recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission
or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar
methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this
publication does not imply, even in the absence of a specific statement, that such names are exempt from
the relevant protective laws and regulations and therefore free for general use.
The publisher, the authors and the editors are safe to assume that the advice and information in this
book are believed to be true and accurate at the date of publication. Neither the publisher nor the
authors or the editors give a warranty, express or implied, with respect to the material contained herein or
for any errors or omissions that may have been made. The publisher remains neutral with regard to
jurisdictional claims in published maps and institutional affiliations.
This Springer imprint is published by the registered company Springer International Publishing AG
part of Springer Nature
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
To Maria, Manuel and Miguel
João Baúto
The financial markets move vast amounts of capital around the world. This fact and
the easy access to trading in a manual or automatic way that creates a more
accessible way to participate in the markets activity attracted the interest of all type
of investors, from the “man on the street” to academic researchers. This type of new
investors and the automatic trading systems influence the market behavior. In order
to adapt to this new reality, the domain of computational finance has received an
increasing attention by people from both finance and computational intelligence
domains.
The main driving force in the field of computational finance, with application to
financial markets, is to define highly profitable and less risky trading strategies. In
order to accomplish this main objective, the defined strategies must process large
amounts of data which include financial markets time series, fundamental analysis
data, technical analysis data. and produce appropriate buy and sell signals for the
selected financial market securities. What may appear, at a first glance, as an easy
problem is, in fact, a huge and highly complex optimization problem, which cannot
be solved analytically. Therefore, this makes the soft computing and in general the
computational intelligence domains especially appropriate for addressing the
problem.
The use of chart patterns is widely spread among traders as an additional tool for
decision making. The chartists, as these analysts are known, try to identify some
known pattern formations and based on previous appearances try to predict future
market trends. The visual pattern identification is hard and largely subject to errors,
and patterns in the financial time series are not as clean as the images in the books,
so the need to create some solution that helps on this task will always be welcomed.
Together, with this, the general availability of GPU boards, today, presents itself as
an excellent alternative execution system, to traditional CPU architectures, to cope
with high-speed processing requirements at relatively low cost.
This work explores the benefits of putting together a low-cost high-performance
computing solution, a GPU-based architecture, and a state-of-the-art computational
finance approach, SAX/GA which combines a Symbolic Aggregate approXimation
(SAX) technique together with an optimization kernel based on genetic algorithms
vii
viii Preface
(GA). The SAX representation is used to describe the financial time series, so that
relevant patterns can be efficiently identified. The evolutionary optimization kernel
is here used to identify the most relevant patterns and generate investment rules.
The SAX technique uses an alphabetic symbolic representation of data defined by
adjustable parameters. In order to capture and preserve the essence of the
explored financial time series, a search for the optimal combination of SAX
parameters is presented. The proposed approach considers a tailored implementa-
tion of the SAX/GA technique to a GPU-based architecture in order to improve the
computational efficiency of the referenced approach. This approach was tested
using real data from S&P500. The achieved results show that the proposed
approach outperforms CPU alternative with speed gains reaching 200 times faster.
The book is organized in five chapters as follows:
• Chapter 1 presents a brief description on the problematic addressed by this book,
namely the investment optimization based on pattern discovery techniques and
high-performance computing based on GPU architectures. Additionally, the
main goals for the work presented in this book as well as the document’s
structure are, also, highlighted in this chapter.
• Chapter 2 discusses fundamental concepts, key to understand the proposed
work, such as pattern recognition or matching, GAs and GPUs.
• Chapter 3 presents a review of the state-of-the-art pattern recognition techniques
with practical application examples.
• Chapter 4 addresses the CPU implementation of the SAX/GA algorithm along
with a detailed explanation of the genetic operators involved. A benchmark
analysis discusses the performance of SAX/GA and introduces possible loca-
tions to accelerate the algorithm.
• Chapter 5 presents the developed solutions along with previous attempts to
accelerate the SAX/GA algorithm. Each solution started as a prototype that
evolved based on the advantages and disadvantages identified.
• Chapter 6 discusses the experimental results obtained for each solution and
compares them to the original implementation. Solutions are evaluated based on
two metrics, the speedup and the ROI indicator.
• Chapter 7 summarizes the provided book and supplies the respective
conclusions and future work.
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Book Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1 Time Series Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.1 Euclidean Distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.2 Dynamic Time Warping . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1.3 Piecewise Linear Approximation . . . . . . . . . . . . . . . . . . . . 6
2.1.4 Piecewise Aggregate Approximation . . . . . . . . . . . . . . . . . 7
2.1.5 Symbolic Aggregate approXimation . . . . . . . . . . . . . . . . . 8
2.2 Genetic Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.1 Selection Operator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2.2 Crossover Operator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2.3 Mutation Operator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3 Graphics Processing Units . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3.1 NVIDIA’s GPU Architecture Overview . . . . . . . . . . . . . . . 13
2.3.2 NVIDIA’s GPU Architectures . . . . . . . . . . . . . . . . . . . . . . 15
2.3.3 CUDA Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3 State-of-the-Art in Pattern Recognition Techniques . . . . . . . . . . . . . 21
3.1 Middle Curve Piecewise Linear Approximation . . . . . . . . . . . . . . 21
3.2 Perceptually Important Points . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.3 Turning Points . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
ix
x Contents
xiii
xiv Acronyms
SP Streaming Processor
SVM Support Vector Machine
TPC Texture/Processor Cluster
TSA Tabu Search Algorithm
Investment Related
B&H Buy & Hold
C/F Ratio Crossover/Fitness Ratio
HSI Hong Kong Hang Seng Index
IL Enter Long
IS Enter Short
NYSE New York Stock Exchange
OL Exit Long
OS Exit Short
ROI Return on Investment
RSI Relative Strength Index
S&P500 Standard & Poor 500
Others
ECG Electrocardiogram
Chapter 1
Introduction
Abstract This chapter presents a brief description on the scope of the problem
addressed in the book which is the performance and optimization of algorithms
based on pattern discovery. Additionally, the main goals to be achieved by this work
are discussed along with a breakdown of the document’s structure.
The financial system as it is currently known does not come from an idea invented
one century ago but from the human ideal of trading goods. It is an idea that evolved
into the current stock markets where goods are traded for a monetary value.
Stock markets like New York Stock Exchange (NYSE) and Hong Kong Hang
Seng Index (HSI) are responsible for the movements of tremendously high amount
of capitals. They connect investors from different corners of the world into one com-
mon objective, trading. Trading must occur in real-time where stock prices must be
displayed without any delay and simultaneous to all parties involved. Once presented
with the stock prices, the investors have two main type of analysis, fundamental or
technical, with which they can base their decisions. Some investors are interested in
the company’s position in relation to social or political ideologies, while others are
focused in raw numbers.
The author of [1] discusses a question with it’s fair share of interest, “to what
extent can the past history of a common stock’s price be used to make meaningful
predictions concerning the future price of the stock?”. The motto of technical analysis
depends heavily on the previous question and there are evidences that support this
approach. If the past history of a stock can reveal future movements, one can try
to identify points in history that reflect those movements and use them for future
decisions. These points or patterns are one of most interesting topics of technical
analysis and identifying them has posed a true challenge.
1.1 Motivation
1.2 Goals
The objective of this work is to study and understand whether the Symbolic Aggregate
approXimation (SAX)/Genetic Algorithm (GA) algorithm can take advantage of
many-core systems such as NVIDIA’s GPU to reduce the execution time of the CPU
sequential implementation. SAX/GA is an algorithm designed to optimize trading
strategies to be applied in the stock market and the whole algorithm was implemented
so that it could explore a vast search space using small populations of individuals.
The authors of SAX/GA [2, 3] found the need to use aggressive genetic operators
capable of preventing the algorithm of entering in a static behaviour and circling
identical solutions.
1.2 Goals 3
The first step is analysing the performance of SAX/GA and understand where are
the causes of prolonged execution time. Once the bottlenecks are identified, different
GPU strategies of optimization will be presented and compared to the original CPU
algorithm based on accuracy of the solution and speedup (Eq. 1.1).
References
1. E.F. Fama, The behavior of stock-market prices. J. Bus. 38(1), 34–105 (1965)
2. A. Canelas, R. Neves, N. Horta, A SAX-GA approach to evolve investment strategies on
financial markets based on pattern discovery techniques. Expert Syst. Appl. 40(5), 1579–1590
(2013), http://www.sciencedirect.com/science/article/pii/S0957417412010561. https://doi.org/
10.1016/j.eswa.2012.09.002
3. A. Canelas, R. Neves, N. Horta, Multi-dimensional pattern discovery in financial time series
using sax-ga with extended robustness, in GECCO (2013). https://doi.org/10.1145/2464576.
2464664
Chapter 2
Background
Abstract This Chapter presents some fundamental concepts required to fully under-
stand the topics discussed. First, a brief introduction to some concepts related to
pattern matching and time series dimensional reduction followed, lastly, by an his-
torical and architectural review of GPUs. Time series analysis is one of the pillar
of technical analysis in financial markets. Analysts use variations in a stock’s price
and volume of trade in combination with several well known technical indicators
and chart patterns to forecast what will be the future price of a stock or speculate at
least whether the price will increase or decrease. However, the widespread use of this
indicators and patterns may indirectly influence the direction of the market causing
it to converge into chart patterns that investors recognize.
Searching for chart patterns may seem to be a simple process where two patterns or
time series from different periods would be compared and analysed for similarities
but it is not that trivial as will be later demonstrated. In the following sections, P
will be referred as a normal time series while Q represents the time series to test
similarity with P.
This procedure is the basis for some pattern matching techniques that will be later
presented. Starting with two time series, P = ( p1 , p2 , . . . , pi , . . . , pn ) and Q =
(q1 , q2 , . . . , q j , . . . , qn ), the Euclidean Distance (ED) method iterates through both
series and calculates the distance between pi and q j (Eq. 2.1).
© The Author(s), under exclusive licence to Springer International Publishing AG, 5
part of Springer Nature 2018
J. Baúto et al., Parallel Genetic Algorithms for Financial Pattern Discovery Using GPUs,
Computational Intelligence, https://doi.org/10.1007/978-3-319-73329-6_2
6 2 Background
E D( p, q) = (q1 − p1 )2 + (q2 − p2 )2 (2.1)
At first sight it is possible to observe some important issues. What if both time
series have a different magnitude, and different alignments? With a different magni-
tude, applying the ED method would be pointless as the main feature lies in direct
spatial comparison. The same result is observed with different alignments as both
series may be equal or the very least partially similar, but since they shifted or
unaligned, a direct match will not be found.
An alignment technique, Dynamic Time Warping (DTW) [1], can be used to solve the
previous problem. This approach aligns two times series, P = ( p1 , p2 , ..., pi , ..., pn )
and Q = (q1 , q2 , ..., q j , ..., qm ), using a matrix D of n × m. First, to each pair (i, j)
in D, the distance ( pi − q j )2 is calculated. The warping or alignment path (W ) is
obtained by minimizing the cumulative distance defined by,
DTW was capable of solving the alignment problem however it was through an
increase in the computational time. Some optimizations could be done to DTW
algorithm but the main issue would remain untouched, the dataset. Financial time
series tend to present small variance in value during a time period and, taking this
into consideration, some points of the dataset can be eliminated.
With a sliding-window Piecewise Linear Approximation (PLA) approach, the time
series is condensed into a representation using N breakpoints where each breakpoint
is the last point that satisfied a threshold condition. The series can then be approxi-
mated by two methods, a linear interpolation, connecting each breakpoint into linear
2.1 Time Series Analysis 7
N
Once the dataset is normalized, PAA reduces a time series of dimension N into W
N
time windows of size W where W must an integer value otherwise Eq. 2.4 is not valid.
An implementation of PAA with a non-integer number of windows is presented in [3]
where border elements of two windows have a partial contribution to both windows.
For each window, the mean value is calculated (Eq. 2.4) and that value is assigned to
represent a time window as represented in Fig. 2.1.
SAX [4] can be viewed as an improvement to PAA as it still uses this method to
obtain a dimensional reduced time series but adds a new type of data transformation,
numeric to symbolic representation.
This transformation relies in a Normal distribution, N ∼ (0, 1), with αn intervals
where the probability between the z-score of αi+1 (βi+1 ) minus αi z-score (βi ) must
1
be equal to , where each interval is to considered as a symbol. For example, with
αn
αn = 3, there are 3 intervals, all of them with equal probability of 33.3% and with
symbolic representation,
In Fig. 2.2, frame 3 (c3 ) has an average value of 1.5 and considering an alphabet
with 3 letters (α = 3), from Table 2.1 and Eq. 2.6 it is possible to assess that c3
is between β2 and ∞ and, therefore, the corresponding letter is ’C’. This method
ensures that, in an alphabet containing all possible symbols in the SAX representation,
each symbol has equal probability allowing a direct comparison. The z-score values
(Table 2.1) were obtained from [4, 5].
Now that PAA series is normalized and the z-scores of α are known, the SAX
representation can be easily obtained. To each segment of PAA (ci ) a corresponding
α interval will be assigned so that α must satisfy to conditions similar to those in
Eq. 2.6. The transformation in Fig. 2.2 compressed a window with size equal to 500
into a SAX sequence with si ze = 10 and an alphabet of 3 letters.
Until this point, there is not must of an improvement since there is not a way to
compare two time series, the input and search pattern. The authors of SAX faced
a struggle, how to compared two series if they are represented in a string format?
2.1 Time Series Analysis 9
Fig. 2.2 Transformation of a PAA series into a SAX series with 3 symbols
It is possible to know if both series are equal but not if they are similar. Lin et al.
[4] needed to redefined the distance measure so that two symbolic series could be
compared. Similar to the PAA distance, this new distance measure is defined by,
w
n 2
M I N D I ST ( P̂, Q̂) = · dist ( p̂i − q̂i ) (2.7)
w i=1
At first sight, Eq. 2.7 is essentially equal to the one used in PAA. However a new
element was added, the dist(·) function. This function (Eq. 2.8) calculates the
distance between two symbols based on the z-scores values used to transform from
numeric to symbolic representation. For instance, with an alphabet of 4 symbols, the
distance between ’A’ and ’C’ will be given by the z-score of c minus a z-score. In
case of near symbols, such as ’A’–’B’ or ’C’–’D’, the distance will be evaluated
as zero.
10 2 Background
⎧
⎪
⎨0, |i − j| 1
dist ( p̂i − q̂i ) = β j−1 − βi , i < j − 1 (2.8)
⎪
⎩
βi−1 − β j i > j + 1
SAX symbolic representation can produce a very compact and efficient time
series however it is subject to a particular problem, mainly caused by PAA. Since the
symbolic representation of each window is calculated using the average value of the
series in that window, it cannot, accurately, represent a trend as important points will
be ignored. An alternative solution, Extended SAX (eSAX) [6], can be used to fix
this issue. Instead of only considering the average value of the frame, two additional
points are added, the maximum and minimum value of the frame. This values will
composed a string of ordered triplets, < vmin , vavg , vmax >, that can help understand
the behaviour inside each frame.
Algorithms are methods that transform input data, through a set of operations, into
an output that is a solution to a specific problem. However, sometimes, finding a
solution may not be so linear. A particular type of problem fall into an optimiza-
tion problem where an approximate and less time consuming solution is acceptable
instead of a more accurate but more time costly. To tackle these problems, researchers
switched to a different field of algorithms, Evolutionary Algorithms (EAs), taking
also advantage of innovative data representation. EAs include, but are not limited
to, Neural Networks (NN), Particle Swarm (PS) and the one relevant to this work,
the Genetic Algorithm (GA). These algorithms follow an identical idea, evolution
of a population of individuals until a near-optimal solution is achieved, inspired by
Darwin’s natural selection and survival of the fittest.
A GA works around a pool of individuals or chromosomes. Each individual, ran-
domly generated, represents a possible solution to the optimization problem and, at
the beginning, is assigned with a score according to an evaluation, the fitness func-
tion. To follow the biological process of evolution, individuals must be subject to
reproduction, where two individuals are randomly selected from the population and
their genetic information is mixed to form two offspring, hopefully with better char-
acteristics. As chromosomes reproduce, there a risk of mutation where one or more
genes of a chromosome are inadvertently changed, also hoping for more favourable
features. At the end of each reproduction cycle, all individuals in the population are
evaluated based on the fitness function and the worst percentage of the population is
discarded (Fig. 2.3).
2.2 Genetic Algorithm 11
The three main selection techniques (Fig. 2.4) are the tournament selection, roulette
wheel and rank-based roulette wheel selection [7]. Tournament selection techniques
use n random individuals where two or more individuals compete between them
and the winner is allowed to proceed to the next stage. Roulette wheel selection is
based in a probabilistic model where the best scoring individuals have the highest
probability of being selected to reproduce while low scoring individuals have limited
chances but not null. The rank-based selection tries to prevent the overpowering of
highly fit individuals by mapping their fitness score into ranks.
When searching for a solution, GAs are prone to be stuck in a local optima, points
in a limited closed space where the solution is optimal but to an open space it is not.
To prevent algorithms from entering in a local optima, mutation operators performed
small changes to individuals introducing new possible solutions and increasing pop-
ulation diversity (Fig. 2.6).
Fig. 2.6 Mutation example. Gene at the beginning and end were mutated causing a change in the
genetic information
GPU, as commonly known, were firstly introduced by NVIDIA in 1999 [8]. The
new generation of graphical processors, GeForce 2, shifted vertex transformation
and lightning (T&L) from the CPU to the GPU by including dedicated hardware. By
2001 NVIDIA had replaced fixed-function shaders by programmable vertex shaders,
units capable of performing custom instructions over the pixels and vertices of a
scene [9].
Although shader programming was limited to the usage of current graphics API
such as OpenGL and DirectX, researchers tried with some success to solve non-
graphics problems on GPUs by masking them into traditional rendering problems.
Thompson [10] proposed a GPU implementation of matrix multiplication and 3-
SAT using a GeForce Ti4600 and OpenGL’s API obtaining a speed-up of up to 3.2×
when comparing CPU/GPU. Other applications include Ray tracing [11] and Level
set methods [12]. This was the first step into the (GPGPU) programming.
The performance of rendering a 3D scene was heavily linked with the type of
shader used since a GPU normally processes more pixels than vertices, three to one
ratio [8], and with a predefined number of processors the workload is normally unbal-
anced across all processors. Nonetheless, with the release of Tesla based-architecture
GeForce 8, NVIDIA accomplished an important milestone to what is now known
as the GPU architecture. Unifying vertex shaders with Tesla’s new feature, pro-
grammable pixel-fragment shaders into a single shader pipeline, created a new world
to programmers and developers, enabling them to balance workload between vertex
and pixel shaders [9]. This pipeline now behaves similar to the basic CPU archi-
tecture, with its own instruction memory, instruction cache and sequential control
logic. Additionally, Compute Unified Device Architecture (CUDA) framework was
released. CUDA provided access to a parallel architecture capable of being pro-
grammed with high-level languages like C and C++ breaking the need of graphics
API, completing the transition into the GPGPU era.
the same instruction but in different datasets and that is why it is so useful in a 2D/3D
scene rendering, few operations are required however thousands of pixels need to be
processed.
To obtain a SIMT architecture, the GPU must be designed to execute hundreds
of threads concurrently [13]. On a top-level, a GPU is a combination of multiple
Streaming Multiprocessor (SM), independent multi-threaded units responsible for
the creation, management, schedule and launch of threads, paired in groups of 32
called warps. Each SM features an instruction cache, warp schedulers that selects
warps ready to execute,instruction dispatch units that issues instruction to individual
warps, a 32-bit register file, a shared memory, several types of cache and the most
important element, the CUDA core or Streaming Processor (SP).
On the memory side, a GPU memory organization is divided in a 3 level hierarchi-
cal structure. Each level has a defined set of functions, benefits and limitations and it
is the programmer’s responsibility to assure the appropriate use and correct manage-
ment. All SMs are connected and can communicate through a global memory located
off-chip and with a magnitude of Gigabyte (GB) that is linked to the CPU through the
Peripheral Component Interconnect Express (PCIe) bus. Being a “general” access
off-chip memory leads to an important problem, the latency between requesting and
retrieving information, which can be as high as 800 clock cycles depending on the
device capability [13]. The accesses in global memory can be done with either 32-,
64- or 128-bytes memory transactions which must be aligned to a multiple of the
transaction size, e.g. a warp that requests a sequential 4-byte word with address range
116–244 triggers two 128-byte transactions from address 0 to 256. Ideally, a warp’s
accesses should be coalesced meaning that each thread requests a sequential and
aligned word that is transferred in one or more memory transaction depending on the
word and transaction size. In more recent architectures, aligned but non-sequential
accesses are considered as coalesced transactions.
On a second level, there is a set of caches and an important mechanism of commu-
nication between threads, a shared memory. The latter memory consists in a fast high
throughput access memory located inside each SM that is accessible although only a
small size is available, around the Kilobyte (KB) magnitude. Such advantages come
with disadvantages mainly the access pattern by threads. To achieve peak throughput,
NVIDIA organized shared memory in a modular structure of equally-sized memory
modules called banks with memory lines of either 16 or 32 four bytes banks, compute
capability dependent. Maximum memory bandwidth is obtained by performing read
or writes in n addresses that match n unique banks however once m threads execute
an instruction whose address falls in the same memory bank, it triggers a m-way
bank conflict and each conflict is served in serially.
With the exception of Tesla microarchitecture, two levels of caches, L1 and L2,
are present to assist memory transaction between threads and global memory where
the L2 cache is mainly used to cache global memory loads and the L1 cache is for
local memory accesses (memory whose size is not known at compile time such as
dynamic size arrays or register spill).
For the third and more restrict level, each SM is equipped with a 32-bit register file
with the higher throughput available dedicated for private variables of each thread.
2.3 Graphics Processing Units 15
The limited size of the register file creates a constraint in the number of registers
used per thread which can vary from 63 to 255 depending on the microarchitecture.
Although threads are allowed to allocate up to this limit, such will reduced the number
of active warps per SM and therefore decrease the overall performance.
Over the course of one decade, NVIDIA has been releasing new architectures,
improving existing features while providing developers with new techniques to
increase parallelism in GPU. This section presents an brief overview of NVIDIA’s
lastest GPU generations with technical aspects related to the GPU architecture and
features to enhance parallelism.
With the release of Tesla microarchitecture in 2006, NVIDIA introduced the world
to a programmable unified architecture. Tesla is organized on a top-level with eight
Texture/Processor Cluster (TPC) each consisting of one texture unit and two SM
(later increased to three in GT200). The SMs are structured with eight CUDA cores,
two Special-function Unit (SFU) that are responsible for transcendental functions
(functions that can not be expressed through a polynomial expression such as square
root, exponential and trigonometric operations and their inverses), an instruction
fetch and issue unit with instruction cache that serves multiple concurrent threads
with zero scheduling overhead, a read-only constant cache and a 16 KB shared
memory.
The shared memory is split into 16 banks of consecutive four bytes words with
high throughput when each bank is requested by distinct threads in a warp. However
there is a discrepancy between the number of threads and banks and when a warp tries
to access shared memory banks, the requests are divided in independent accesses, one
per half-warp, that should not have bank conflicts. In case of multiple threads reading
from the same bank, a broadcast mechanism is available serving all requesting threads
simultaneously [14].
Fermi (2010) brought major changes for both the SM and memory organization.
Graphics Processor Cluster (GPC) replaced the TPC as the top-level module through
the introduction of four dedicated texture units removing the now redundant texture
unit in Tesla, while increasing the overall number of SMs from two (three in GT200)
to four SMs. The SMs now feature 32 CUDA cores and a new configurable cache
with two possible configurations that gives freedom to the programmer, where for
16 2 Background
graphics programs a lower L1 cache is beneficial and for compute program a larger
shared memory allows more cooperation between threads. This cache can be used as
16 KB of L1 cache and 48 KB of shared memory or 48 KB of L1 cache and 16 KB
of shared memory. Besides a configurable cache, shared memory suffered internal
changes. Previously with Tesla, shared memory was organized into 16 four bytes
that served a warp in two independent transactions without bank conflicts however
with Fermi the number of banks was raised to 32 with one request per warp. Bank
conflicts are still present in Fermi with addition to the broadcast mechanism added
with Tesla.
The increase in CUDA cores and a renewed cache were not the only changes
in the SM structure. The number of SFU was doubled to four, each capable of
one transcendental instruction per thread independently of other execution units
preventing a stall in the GPU pipeline due to a separation of CUDA cores and SFU
units from the dispatch unit responsible for serving instruction to each execution unit
and because with Fermi two separate dispatch units are available. The destination
addresses of a thread result is now calculated by one of the 16 Load/Store units
available for a total of 16 thread results per clock. The workload is divided across
two groups of 16 CUDA cores each and instructions are distributed by two warp
schedulers allowing two warps to be issued and executed concurrently meaning
that for a work group to be complete execution, two clock cycles are required (for
transcendental instructions it takes eight cycles for all four SFUs to execute).
Table 2.2 Architectural comparison between Fermi, Kepler and Maxwell [13, 15–18]
Specifications Fermi - GF 100 Kepler - GK 104 Maxwell - GM 204
Compute capability 2.0 3.0 5.2
Streaming multiprocessor (SM) 11–16 6–8 13–16
CUDA cores 353–512 1152–1536 1664–2048
Theoretical Floating Point Single 855–1345 2100–3000 3500–4600
Precision (GFLOPS)
Main Memory (Megabyte (MB)) 1024–1536 1536–4096 4096
L1Cache(K B) 48 16 16 32 48 24
64 64
Shar ed Memor y(K B) 16 48 48 32 16 96
L2 Cache (KB) 768 1792–2048
Maximum Registers per Thread 63 255
Maximum Registers per SM 32768 65536
Threads per Warp 32
Maximum Warps per SM 48 64
Maximum Blocks per SM 8 16 32
Maximum Threads per SM 1536 2048
Maximum Threads per Block 1024
In parallel programming, the basic execution unit is the thread. In a CPU, threads
are sub-routines of a main program scheduled to execute a custom set of instruction
that may include memory accesses to local or shared resources. If necessary, threads
can communicate between them using a global resource or memory, however special
attention is required if running threads are performing write operations in the same
memory address.
CUDA introduced a general purpose parallel computing platform and program-
ming model able to combine well established programming languages with an highly
parallel architecture that is a GPU. Creating a functional CUDA C program in a GPU
is a three-stage process. First, the execution environment must be defined. This envi-
ronment consist in a kernel where a developer formalizes the routine to be executed
18 2 Background
in the GPU and how it should be executed. The kernel definition has four arguments
associated, the number of blocks, number of threads, size of dynamic shared mem-
ory per block and stream ID. The way a kernel is defined reflects how the problem
is spatially organized, e.g., parallel sum reduction over an array can be represented
with an 1D kernel and a multiplication between two matrices with an 2D kernel. In
Fig. 2.7, a kernel is declared with 4 blocks, each with 16 × 16 threads (256 in total)
while the size of dynamic shared memory and stream ID are optional, defaulting to
0—Example from [17].
Once the kernel is declared, the second stage begins. The program is compiled
through NVIDIA’s compiler driver, NVCC, that generates a set of binaries that include
the GPU assembly code, Parallel Thread eXecution (PTX), containing the execu-
tion plan for a given thread [19]. Each thread is assigned a unique three-element
identifier (x,y,z coordinates), threadIdx, that will locate her in the GPU execu-
tion plan. Based on several of the available compilation flags, NVCC can perform
some optimizations that can increase a kernel performance. One of those flags,
-maxrregcount, grants the programmer a way to lock the maximum registers
allowed per thread which can greatly impact the kernel performance. By reducing
the register usage per thread, with the same register file it is possible to effectively
allocate more blocks to each SM resulting in more warps being dispatched. Another
advantage is preventing register spilling. With complex kernels, NVCC’s task of cre-
ating PTX code becomes harder and eventually there not enough registers to satisfy
a thread’s needs. In those cases, local memory is used to replicate a register function
and since this type of memory is addressed in the global memory space, it inherits
all characteristics such as latency. The main issue with maxrregcount flag is that
it forces the compiler to generate additional instructions that may not compensate
for the extra one or two blocks per SM.
Furthermore, NVCC has internal mechanisms that are able to optimize redundant
code and prevent duplicate operations identical to those in Fig. 2.8.
And finally, the program’s execution in the GPU. At this point, all threads are
organized, spatially, in a single or multi-dimensional grid formed by blocks. The
SMs are assigned multiple unique blocks (Fig. 2.9) by a global scheduler, GigaThread
unit in post Tesla microarchitecture, from which SMs schedule and execute smaller
groups of 32 consecutive threads called warps. Threads in a warp execute a common
instruction at a time that should not invoke conditional branching operations as it
will introduce thread divergence and therefore the serial execution of all threads in
each branching path until they reach a common instruction again (only applied to
threads in the same warp) [13]. Once a warp finishes executing, a warp scheduler
switches context, with no overhead cost, and replaces the current warp in a SM by a
new one, from the same block or not, ready to execute. This mechanism is also used
to mask the latency associated with memory transactions since it prevents stalling
the pipeline while a warp waits for the transaction to be completed.
2.4 Conclusions
This chapter presented an introduction to some basic techniques that are the founda-
tion for many of the state-of-the-art pattern matching, the basic concepts of the GA
and finally a review of NVIDIA’s GPUs. The pattern matching techniques can be
divided into two categories, linear and aggregate approximations, that try to create
accurate approximations of a time series using the minimum amount of points possi-
ble. The GA takes part of a group of algorithms, EAs, that attempt to solve problems
that do not have a concrete solution such as non convex problems. The GPUs are
alternative execution system to the common multi-core systems that used the CPU
as the main processing unit. A GPU started as a system that was meant to process 2D
and 3D graphical scenes however, researchers identified the ability of using them to
accelerate highly parallel algorithms.
References
1. D.J. Berndt, J. Clifford, Using dynamic time warping to find patterns in time series, in KDD
Workshop (1994), pp. 359–370
2. E. Keogh, S. Chu, D. Hart, M. Pazzani, Segmenting time series: a survey and novel approach.
Data Mining in Time Series Databases (2003), pp. 1–21
3. L. Wei, Sax: N/n not equal an integer case, http://alumni.cs.ucr.edu/wli/
20 2 Background
4. J. Lin, E. Keogh, S. Lonardi, B. Chiu, A symbolic representation of time series, with impli-
cations for streaming algorithms, in Proceedings of the 8th ACM SIGMOD Workshop on 78
Research Issues in Data Mining and Knowledge Discovery, ser. DMKD 2003. (ACM, New
York, NY, USA, 2003), pp. 2–11. https://doi.org/10.1145/882082.882086
5. A. Canelas, R. Neves, N. Horta, A sax-ga approach to evolve investment strategies on financial
markets based on pattern discovery techniques. Expert Syst. Appl. 40(5), 1579–1590 (2013),
http://www.sciencedirect.com/science/article/pii/S0957417412010561
6. B. Lkhagva, Y. Suzuki, K. Kawagoe, Dews2006 4a-i8 extended sax: extension of symbolic
aggregate approximation for financial time series data representation (2006)
7. N. Razali, J. Geraghty, Genetic algorithm performance with different selection strategies in
solving tsp. IEEE Micro 31(2), 50–59 (2011)
8. E. Lindholm, J. Nickolls, S. Oberman, J. Montrym, Nvidia tesla: a unified graphics and com-
puting architecture. IEEE Micro. 28(2), 39–55 (2008)
9. D. Luebke, G. Humphreys, How gpus work. IEEE Comput. Soc. 40(2), 96–100 (2007)
10. C.J. Thompson, S. Hahn, M. Oskin, Using modern graphics architectures for general-purpose
computing: a framework and analysis, in Proceedings 35th Annual IEEE/ACM International
Symposium on Microarchitecture (2002), pp. 306–317
11. T.J. Purcell, I. Buck, W.R. Mark, P. Hanrahan, Ray tracing on programmable graphics hardware,
in Proceedings of ACM SIGGRAPH 2002 ACM Transactions on Graphics (TOG), vol. 21
(2002), pp. 703–712
12. M. Rumpf, R. Strzodka, Level set segmentation in graphics hardware, in Proceedings of Image
Processing, vol. 3 (2001), pp. 1103–1106
13. NVIDIA Corporation, Nvidia cuda compute unified device architecture programming guide
(2015), ]urlhttps: //docs.nvidia.com/cuda/pdf/CUDA C Programming Guide.pdf. Accessed 15
Nov 2015
14. NVIDIA Corporation, Nvidia cuda compute unified device architecture programming guide
(2012), https://www.cs.unc.edu/prins/Classes/633/Readings/CUDA_C_ProgrammingGuide_
4.2.pdf. Accessed 10 Aug 2016
15. N. Corporation, Whitepaper—nvidia geforce gtx 980 (2014), http://international.download.
nvidia.com/geforce-com/international/pdfs/GeForce_GTX_980_Whitepaper_FINAL.PDF
16. C.M. Wittenbrink, E. Kilgariff, A. Prabhu, Fermi gf100 gpu architecture, in Proceedings of the
World Congress on Engineering, vol. 2 (2011)
17. J. Sanders, E. Kandrot, CUDA By Example: An Introduction to General-Purpose GPU Pro-
gramming. (Addison-Wesley, 2012)
18. N. Corporation, Whitepaper—nvidia geforce gtx 680 (2012), http://people.math.umass.edu/
johnston/M697S12/Nvidia_Kepler_Whitepaper.pdf. Accessed 13 Nov 2015
19. N. Corporation, Parallel Thread Execution ISA Application Guide v3.2 (2013), http://docs.
nvidia.com/cuda/pdf/Inline_PTX_Assembly.pdf. Accessed 13 Nov 2015
Chapter 3
State-of-the-Art in Pattern Recognition
Techniques
Abstract Pattern recognition, matching or discovery are terms associated with the
comparison of an input query, a pattern, with a time series sequence. These input
queries can be patterns similar to those presented in Chen (Essentials of Technical
Analysis for Financial Markets, 2010 [1]) or user-defined ones. Although focus will
be in pattern matching techniques applied to financial time series, these techniques
proved to be very versatile and expandable to different areas, going from the medical
sector with applications in Electrocardiogram (ECG) Chen et al. (Comput Meth-
ods Programs Biomed 74:11–27, 2004 [2]) to the energy sector with forecasting
and modelling of buildings energetic profile Iglesias and Kastner (Energies 6:579,
2013 [3]).
The previous technique tried to find inversion points in the time series to build an
equivalent time series. A similar approach is Perceptually Important Points (PIP). As
the name suggests, PIP technique searches for points that are humanly identifiable.
The process starts with two fixed PIPs, the first point, p1 , and last point, p2 , of
the time series P. The next PIP can be obtained by maximizing the distance of the
lines that unites two consecutive PIPs to a point in the time series (Fig. 3.1). For
instance, p3 is the result of maximizing the distance between the segment p1 p2 and
P while p4 and p5 are the maximum distance between P and p1 p3 and p3 p2 . This is
an iterative process so that for each PIP, two more are generated, meaning that there
is not a defined stopping condition with the possibility of having len(dataset) − 1
PIPs.
Although previously it is mentioned that a distance measure is used, a formal
description of such has not yet been made. The authors of [5] present three distinct
measures, Euclidean Distance (ED), Perpendicular Distance (PD) and Vertical Dis-
tance (VD). The ED method maximizes the sum of the distance between each pair of
consecutive PIPs ( pi , p j ) to possible a test point, ptest , in the time series (Eq. 3.2).
E D( ptest , pi , p j ) = (x j − xtest )2 + (y j − ytest )2 + (xi − xtest )2 + (yi − ytest )2 (3.2)
3.2 Perceptually Important Points 23
PD uses instead the perpendicular distance between the point in test to the line
segment that connects pi to p j . The slope of the line segment pi , p j is given by
Eq. 3.3 while the relative position, pc , of the test point in pi , p j can be calculated
using Eq. 3.4. And finally, the PIP is calculated by maximizing Eq. 3.5.
y j − yi
s = Slope( pi , p j ) = (3.3)
x j − yi
xtest + (s · ytest + (s 2 · x j ) − (s · y j )
xc = − xtest
2
1 + s2 (3.4)
yc = (s · xc ) − (s · x j ) + y j
P D( ptest , pc ) = (xc − xtest )2 + (yc − ytest
2
) (3.5)
The last measure presented is VD and is the vertical distance (y-axis) between
the test point and pi , p j , calculated by Eq. 3.6.
xc − xi
V D( ptest , pc ) = yi + (y j − yi ) · − ytest (3.6)
x j − xi
Until now, only time series dimensional reduction techniques were presented.
Similar to ED idea, a template of the pattern can be searched by a minimizing of the
point-to-point distance (Eq. 3.7) between the time series, P, and the PIP’s template, T .
1 n
2
V er tical Distance(P, T ) = pk,x − tk,x (3.7)
n k=1
24 3 State-of-the-Art in Pattern Recognition Techniques
A vertical similarity method has been established only now missing a horizontal
measure. This measure (Eq. 3.8) must take in consideration possible time distortion
between the template and time series.
1
n
2
H ori zontal Distance(P, T ) = pk,y − tk,y (3.8)
n−1 k=2
To determine if a template matches the time series, a combination of Eq. 3.7 with
Eq. 3.8 is needed. A weighted method can be used where a weight factor is assigned
to both measures. Based on experiments, [5] suggests a weight factor of 0.5 where
horizontal and vertical contribute equally to the final distance measure (Eq. 3.9).
As for evaluation, [5] used a dataset with 2532 points taken from the HSI index.
To set a default distance measure, a comparative test was performed where PD
presented the highest accuracy of all methods while only being slower than VD. For
benchmarking, both template and rule-based techniques were matched against PAA.
From all three methods, the template-based PIP approach presented the best overall
results with 96% accuracy followed by PAA with approximately 82% of correct
patterns while the rule-based PIP had the worst results with an accuracy of around
38%.
The work in [6] introduced an hybrid approach combining a rule-based method
with Spearman’s rank correlation coefficient to compare the degree of similarity
between two patterns. The authors use PIP with a sliding window technique with
two types of displacement, if the subsequence being tested matches a pattern then
the window will slide W units, where W is the size of the window, and if it does not
match any pattern then the window is shifted by one unit. With this, the authors expect
to accelerate the overall speed of the process while not skipping most patterns. To
determine if the windows should be shifted or displaced, Spearman’s rank correlation
coefficient is applied to the PIP values. For both the time series and search pattern,
the PIP values are converted into ranks according to their value, so that a low PIP
value equals to a low rank. Now it is possible to determine the level of similarity
between the input time series and the pattern created, using Spearman’s correlation
coefficient (Eq. 3.10),
n
6 · i=1 di2
ρ =1− (3.10)
n(n 2 − 1)
where n is equal to the number of ranks or in this case, the number of PIPs, and di
patter n
is the difference between the rank of P I Pitime series and P I Pi . This coefficient
ranges from −1 to 1 indicating that if the module of ρ is near 1, the framed time
series and pattern are identical while if it is near 0, they are not a match.
Two different datasets were used to test the proposed technique, a synthetic and a
real one. In the synthetic dataset, the rule-based method with Spearman rank correla-
tion outperforms the simple template method in finding common patterns—Multiple
Top, Head-and-Shoulders and Spikes. In 7 of the 8 input pattern, this technique has
an overall accuracy of 95% only dropping to around 85% in Spike Top pattern. The
real dataset consisted in information extracted from HSI index for the past 21 years
and the results are very similar to those obtained in the synthetic dataset with an
increase of the Spikes pattern accuracy since there is a low level of occurrence [6].
A different approach was introduced in [7] where the authors present an evolution-
ary pattern discovery approach using GAs and PIP resorting to a clustering technique
to group similar patterns in corresponding clusters. This method starts with an initial
population of size P Si ze randomly generated in which, each individual is a possible
time series solution. The time series in a chromosome is then divided into k clusters
and evaluated with an appropriate fitness function, followed by several genetic oper-
ations and individual selection. The process is iterated until the termination criteria
is met or the number of generation is reached. To validate this approach, the authors
use a normalized and smoothed dataset of Taiwanese companies dated from January
2005 to December 2006. As for the algorithm parameters, population size was set to
26 3 State-of-the-Art in Pattern Recognition Techniques
100 individuals, each with 6 clusters, where the applied crossover rate was equal to
0.8, mutation rate of 0.3 and a stopping criteria of 300 generations.
The proposed technique shows decent results detecting 2 known patterns, Dou-
ble Top and Double Bottom, and, additionally, Uptrend and Downtrend detection,
although using clusters does not seem to be a convincing technique as some level of
abstraction is required when matching each cluster to the corresponding pattern.
Continuing the trend of evolutionary algorithms, [8] uses a Neural Networks (NN)
to accurately detect the Head-and-Shoulders (HS) pattern. This method is based on
a two layers feed-forward Neural Networks (NN) mechanism, the Self Organizing
Maps (SOM), where the output layer is formed by nodes or neurons and the input layer
by the training/validation data with the objective of minimizing the distance between
a node and the input vector. The authors use a SOM with two nodes indicating that
two possible “clusters” are allowed, one with sequences matching the Head-and-
Shoulders (HS) pattern and another with irrelevant patterns. The input vectors are
created by transforming the rule-based training patterns into rescaled 64 × 64 binary
matrices, that are later compressed into 16 × 16 matrices, obtained by summing the
neighbours of the original matrix. The authors report a recognition result of the Head-
and-Shoulders (HS) pattern of 97.1% although this can be disputed by the fact that
the result is highly dependent in the quality of the input patterns, since this method
relies in a dataset to train the network and with a training set with lower quality, a
less efficient network is created and, therefore, worse results.
As the name states, this method searches for Turning Points (TPs) in a time series
and considering a sliced time series they represent local minimum and maximum
indicating the trend of the stock [9].
While iterating a time series, a TP is found at t = i if the time series value, pi , is
lower or higher than his neighbours, pi−1 and pi+1 , so than,
f ( pi−1 ) > f ( pi ) and f ( pi+1 ) > f ( pi ) ⇒ Minimum
(3.11)
f ( pi−1 ) < f ( pi ) and f ( pi+1 ) < f ( pi ) ⇒ Maximum
When compared to PLA, both methods locate the maximum and minimum values
however the TP technique applies a filter where points that have a low contribution
to the overall shape of the time series are suppressed. This is achieved through a set
of simplifications (Fig. 3.3),
• Case 1—If a down-trend time series is interrupted by a small temporary up-trend,
the maximum and minimum values, M AX 2 and M I N1 , created by this reversal
can be ignored as long as the difference between M AX 2 and M I N1 is smaller than
M AX 1 minus M AX 2 plus the difference between M I N1 and the next minimum,
M I N2 .
3.3 Turning Points 27
• Case 2— Similar to the first case, here an up-trend time series has a small trend
reverse, where M AX 1 and M I N2 can be suppressed.
• Case 3—If an up-trend time series suffers a more noticeable trend reversal, M AX 1
and M I N2 can be ignored if their values are close to the closest maximum or
minimum respectively.
• Case 4—Similar to the third case however now it is a down-trend and the same
condition is applied.
In [9], the authors present a comparative study of TP with PIP, since both tech-
niques are based in extracting points from the original time series. The dataset used
was taken from HSI market dated from January 2000 to May 2010. The first test
evaluated the approximation error of both methods, defined as sum of the difference
between point i in the time series and approximated series. The TP method produced
a reduced time series with higher error than PIP, around 105% higher in the worst
case and 13% on the best, easily justified by the simplifications used. The second test
consisted in verifying the number of trends preserved and in this case, TP performed
better with, on average, 30% more trends preserved mainly because TP “is designed
to extract as many trends as possible from the time series” [9].
A different approach was introduced by [10] where a stack is used to organize the
TPs based on their contribution to the overall shape of the time series. This stack is
then converted into a Optimal Binary Search Tree (OBST) in which, the root has the
TP with higher weight or contribution and in the lower branches are TPs with low
effect on the time series shape.
The used test market was the HSI with a timespan of 10 years starting in 2000
until 2010, for which a time series with 2586 TP was created. The authors use a rule-
and template-based pattern detection technique to search for a Head-and-Shoulders
(HS) pattern and divide the tests in 3 categories, depending on the size of the used
TP time series, C1 with 75%, C2 with 50% and C3 with 25% of the original TPs.
The first step of this technique is to reconstruct the reduced time series based on
the stack depending on the category of test. Similar to other methods, this one also
uses the sliding window where the rule- and template-based methods are applied to
match the TP time series with the normalized Head-and-Shoulders (HS) pattern. At
the end, all patterns found by both methods are retrieved by the algorithm.
28 3 State-of-the-Art in Pattern Recognition Techniques
Table 3.1 Comparison between TP, PIP and PLA. Based in the results presented in [10], it is not
possible to accurately say the error value althought PLA has the lowest error followed by PIP and
then TP. The error metric is the same as used by [9]. It is the sum of the difference between all
points in the time series and approximated series
Error # Preserved trends Execution time (ms)
TP <100 1280 406
PIP <100 840 1329
PLA <100 700 1515
Identical to the results presented by [9], the authors of [10] reached similar con-
clusions when comparing TP with PIP and PLA (Table 3.1). TP can preserve a higher
number of trends in a time series however when the overall error is calculated, PIP
outperforms TP only losing to PLA when considering a low to medium number of
points, as expected from previous conclusions since PLA can recreate a time series
with higher fidelity. It is possible to correlate this results with the execution time,
where the error is inversely proportional to the execution time.
3.5 Shapelets
A supervised technique was introduced in [11] where instead of searching for specific
pattern in a time series, the algorithm searches for the best pattern or sub-sequence
that is capable of characterizing a time series class, group of similar time series, Ti ,
3.5 Shapelets 29
e.g., species of leaves as used in [11], with an unique identifier. All possible solutions
or pivots are computed from the original times series with fixed length l.
The objective is to divide the original dataset D into two sub-groups, D L and D R ,
based on a threshold distance of the pivots, θ ,
Distance(Pivot j , Ti ) ≤ θ, ∀Ti ∈ D L
(3.12)
Distance(Pivot j , Ti ) > θ, ∀Ti ∈ D R
and maximize the quality of the dataset split, and in other words, maximizing the
distance between elements of different classes. The pivot ( j, i) of time series Ti,C
that ends up on maximizing the dataset split and θ , will be used as the shapelet that
“identifies” class C.
The shapelet rationale is identical to the one used in Support Vector Machines
(SVMs), also an supervised classification algorithm, where there are two types of
classes, A and B, and the objective is to obtain a weight vector w that maximizes
the distance between each point in A and B in relation to an hyperplane. The weight
vector gives an idea on the relative importance of each point to separate both classes.
An unexplored system until now was used in [12] where instead of the CPU,
the authors implement the Shapelet algorithm in a GPU. The implementation had
as target the Fermi-based GPU GTX 480, GF 100 in Table 2.2, and their approach
relied in using a CUDA core to perform all calculations related to pivot ( j, i) while
the thread block organization is done in such way that each block is responsible for
one time series Ti .
3.6 Conclusions
Several pattern recognition methods were presented, all with different approaches
to achieve one common objective, a high fidelity algorithm for pattern discovery
or matching. However there is a question to be answered. How easily can these
algorithms be ported into a GPU?
The MPLA method with DTW showed good results in discovering and matching
patterns in a financial time series although it presents some challenges in terms of
GPU optimization. To obtain the reduced time series through a parallel execution, the
original series could be segmented in several time windows and, each window, would
be reduced locally in a thread or block organization. This optimization introduces
a new problem, undetected segments due to the transition between windows, that
would required a new analysis in this transition points. DTW in paper looks to be
the ideal method for a GPU implementation as it relies in a matrix algorithm capable
of exploring a two-dimensional kernel organization however there is dependency in
cells that limits the overall number of threads or blocks to use.
The PIP approaches also demonstrated a great success with high levels of accuracy
but they come with a huge setback. The PIP representation is highly dependent on the
3.6 Conclusions 31
total points of the dataset, which means that with an ineffective use of the available
resources of a GPU, a faster implementation may still be produced. The effect of
this problem is minimized in the PIP-GA approach as the evolution of the genetic
algorithm will take the majority of the execution time. Identical to the PIP methods,
a GPU implementation of TP faces the same issues as PIP, due to the similarities in
the data reduction approach.
And finally the SAX/GA approach. Looking separately to SAX and GA both
present the best chances in portability to a GPU. SAX relies in N independently time
windows of size W , ideal for each window to be calculated in a thread organization
where N threads are responsible for calculating the reduced series, or a block orga-
nization, in which W threads in a block calculate the distance between two points
in window, with a total of N blocks. The GA, as previously explained, also poses as
good method to be implemented a GPU even though there is partial dependencies
between operators (Table 3.2).
References
1. J. Chen, Essentials of Technical Analysis for Financial Markets, 1st edn (2010)
2. W.-S. Chen, L. Hsieh, S.-Y. Yuan, High performance data compression method with
pattern matching for biomedical ecg and arterial pulse waveforms. Comput. Methods
Programs Biomed. 74(1), 11–27 (2004), http://www.sciencedirect.com/science/article/pii/
S0169260703000221
3. F. Iglesias, W. Kastner, Analysis of similarity measures in times series clustering for the dis-
covery of building energy patterns. Energies 6(2), p. 579 (2013), http://www.mdpi.com/1996-
1073/6/2/579
4. H. Li, C. Guo, W. Qiu, Similarity measure based on piecewise linear approximation and deriva-
tive dynamic time warping for time series mining. Expert Syst. Appl. 38(12), 14732–14743
(2011), http://www.sciencedirect.com/science/article/pii/S0957417411007901
5. T.-C. Fu, F.-l. Chung, R. Luk, C.-M. Ng, Stock time series pattern matching: template-based
versus rule-based approaches. Eng. Appl. Artif. Intell. 20(3), 347–364 (2007). https://doi.org/
10.1016/j.engappai.2006.07.003
6. Z. Zhang, J. Jiang, X. Liu, R. Lau, H. Wang, R. Zhang, A real time hybrid pattern match-
ing scheme for stock time series, in Proceedings of the Twenty-First Australasian Confer-
ence on Database Technologies—Volume 104, ser. ADC 2010, Darlinghurst, Australia, Aus-
tralia: Australian Computer Society, Inc. (2010), pp. 161–170, http://dl.acm.org/citation.cfm?
id=1862242.1862263
7. C.-H. Chen, V. Tseng, H.-H. Yu, T.-P. Hong, Time series pattern discovery by a pip-based evo-
lutionary approach. Soft Comput. 17(9), 1699–1710 (2013). https://doi.org/10.1007/s00500-
013-0985-y
8. A. Zapranis, P. Tsinaslanidis, Identification of the head-and-shoulders technical analysis pattern
with neural networks, in Artificial Neural Networks—ICANN 2010, ed. by K. Diamantaras, W.
Duch, L. Iliadis. Lecture Notes in Computer Science, vol. 6354 (Springer, Berlin Heidelberg,
2010) pp. 130–136. https://doi.org/10.1007/978-3-642-15825-417
9. J. Yin, Y. Si, Z. Gong, Financial time series segmentation based on Turning Points, in 2011
International Conference on System Science and Engineering (ICSSE) (2011) pp. 394–399,
http://ieeexplore.ieee.org/xpls/absall.jsp?arnumber=5961935
10. Y.-W. Si, J. Yin, OBST-based segmentation approach to financial time series. Eng. Appl.
Artif. Intell. 26(10), 2581–2596 (2013), http://www.sciencedirect.com/science/article/pii/
S0952197613001723
32 3 State-of-the-Art in Pattern Recognition Techniques
11. L. Ye, E. Keogh, Time series shapelets, in Proceedings of the 15th ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining—KDD 2009 (2009), p. 947, http://
portal.acm.org/citation.cfm?doid=1557019.1557122
12. K.W. Chang, B. Deka, W.M.W. Hwu, D. Roth, Efficient pattern-based time series classification
on GPU, in Proceedings—IEEE International Conference on Data Mining, ICDM (2012), pp.
131–140, http://www.biplabdeka.net/files/icdm2012.pdf
13. A. Canelas, R. Neves, N. Horta, A SAX-GA approach to evolve investment strategies on
financial markets based on pattern discovery techniques. Expert Syst. Appl. 40(5), 1579–1590
(2013). https://doi.org/10.1016/j.eswa.2012.09.002
Chapter 4
SAX/GA CPU Approach
SAX/GA algorithm [1, 2] was developed with the purpose of optimizing a trad-
ing strategy with multiple patterns for entry and exit in the Standard & Poor 500
(S&P500) index, either in a long or short position. In a long position, an investor
purchases X shares of a stock at day i market value, expecting an increase in their
value, while an investor, in a short position, borrows X shares of a stock in day i from
a brokerage firm after which sells them at day i market value and is now in debt to
the brokerage firm. The short investor expects a decrease in value and once it has
reached the prospected price, the investor purchases X shares of the same stock at
day i + n price and returns them to the brokerage firm profiting the difference.
The algorithm uses daily historical data extracted from S&P500 index which is
then divided in a sliding window fashion (Fig. 4.1). The initial training dataset, Dtrain
with Dtsize size, is succeeded by a test dataset, Dtest with Dvsize size, used to validate
the best strategy obtained during the training phase. Once testing is completed, Dtrain
is shifted by Dvsize days and the process restarts until the dataset reaches the end or
a stopping criteria is met.
A Sound Offense
We want all of our players and coaches to understand and use our
offensive terminology. One or two words either explains the
descriptive action we want or identifies some segment of the offense
or the opposition’s defense. The terms are simple, meaningful and
descriptive.
The stance for the linemen, with the exception of the center, is
basically the same, with allowances being made for various physical
characteristics, which vary from individual to individual. The inside
foot is forward, the feet staggered in an arch to toe relationship. The
tackles and ends exaggerate the stagger from heel to toe since they
are further removed from the center and quarterback.
The feet should not be spread wider than the individual’s
shoulders, with the weight of the body concentrated on the balls of
the feet. The heels should be slightly in, with the cleats on the heel of
the forward foot almost touching the ground. The ankles should be
bent slightly. The knees should be bent slightly more than 90
degrees, and turned slightly in. The tail is even or a little higher than
the shoulders, and splitting the forward and rear heels. The back is
straight, shoulders square, neck relaxed, and eyes open keeping the
defensive linebacker in line of sight. The hands are placed down
slightly outside of the feet, elbows relaxed, and thumbs in and
slightly forward of the shoulders.
The center lines up in a left-handed stance with the feet even and
slightly wider than the shoulders. The weight is on the balls of the
feet, heels turned slightly in, with the cleats on the heels of the shoes
almost touching the ground. The knees are slightly in and bent a little
more than 90 degrees. The tail is slightly higher than the shoulders
and about two inches in front of the heels. The center places his left
hand inside his legs down from between his eye and ear almost
directly under the forehead, with the fingers spread and the thumb
turned slightly in. The shoulders are square, the back is straight, the
neck is relaxed, and the eyes looking upward. His right hand grasps
the football like a passer. He should reach out as far as possible
without changing his stance. The center is coached to place the ball
on his tail as quickly as possible with a natural turn of the arm. He
should drive out over the ball with his head coming up and tail down,
keeping his shoulders square as he makes his hand-back to the
quarterback.
Quarterback’s Stance
Halfback’s Stance
The feet of the halfback should not be wider than the shoulders,
and staggered in a heel to toe relationship with each other. The
weight should be on the balls of the feet, but will vary slightly
depending upon the direction the halfback must move in carrying out
his particular assignment. With the snap of the ball he should throw
himself in the direction he is going, and he should not use a cross-
over step.
His knees should be bent a little beyond 90 degrees, with the
knees and heels turned slightly in, and the tail a little higher than the
shoulders. The halfback’s shoulders should be square, with his head
and eyes in a position to see the defensive linebacker on the
opposite side from him. The inside hand should be down, slightly
forward and inside of the knee with the thumb turned a little to the
inside. The body weight should be forward slightly.
Fullback’s Stance
The fullback lines up with the feet even and a little wider than his
shoulders. The cleats on the heels of his shoes should touch the
ground. The heels and knees are turned slightly in with the weight on
the balls of the feet. The head and eyes are in a relaxed position, but
where they can see the second man standing outside of the
offensive end. The hands are directly in front of each foot with the
thumbs turned in. The shoulders are square, the back is straight, the
tail is directly above the heels, with the weight slightly forward, but
not to such an extent he cannot start quickly in a lateral direction to
either side.
When the linemen leave the huddle and come up to the line of
scrimmage in a pre-shift position (hands on knees in a semi-upright
stance), the basic split rule for the guards is to split one full man. The
tackles and ends will split slightly more than one full man. As the
linemen go down into their offensive stance, each man (except the
center) will move in, out or remain stationary, depending upon the
particular defensive alignment and the individual’s split rules.
Even Defense:
Odd Defense:
Figure 99a
Figure 99b
Do not ever emphasize that you are splitting to get good blocking
angles, but you split in order to isolate a defender. If the defender
splits when the offensive man splits, you can isolate him. If his split is
static, a good blocking angle will be the result. Your linemen should
never split merely to get the angle, however. It will also help the
linemen if they have a clear picture of where the ball crosses the line
of scrimmage (the critical point of attack), and from where the ball is
being thrown on a pass play. Then, too, there is no set rule that will
cover all defensive situations and the offensive men must be able to
apply the common sense split rule along with the basic split rule.
Figure 100 illustrates the pre-shift position of the right side of the
offensive line and the application of the guard’s, tackle’s, and end’s
split rules. From the pre-shift stance and position, the offensive men
are allowed to split one-half man either way, according to the
defense. The inside always must be protected. A defensive man
must not be allowed to penetrate or shoot the inside gap as he is
likely to stop the offensive play for a loss.
Figure 100
If the defensive man will move with the offensive man, then the
offense should be able to isolate one man and the point of attack
should be directed toward him. Figures 101a-b illustrate the center’s
man and the offensive right tackle’s man being isolated respectively,
and the critical point of attack being directed at the isolated
defenders.
Figure 101a
Figure 101b
Automatics
On the snap of the ball, the quarterback should dip his hips so his
hands will follow the tail of the center as he charges. This technique
will also help the quarterback push off. The quarterback will take the
ball with his right hand, using the left as a trapper, as was explained
previously. He should make certain he has the ball, and he should
not fight it, before withdrawing his hands from the center’s crotch. As
soon as the quarterback has possession of the ball, he should bring
it into his “third hand,” his stomach. Such a procedure will help
prevent a fumble. He then wants to push off and execute his
techniques as quickly as possible.
The quarterback must always be cognizant of the fact he cannot
score without the ball; consequently, he wants to make certain he
has possession of it before pulling out of there. If he gets in a big
hurry, he is likely to drop the ball to the ground. I have seen this
occur many times.
Quarterback Faking
When a ball carrier is in the open field, he should always keep the
tackler guessing. He should not tip-off whether he is going to try to
outrun him, run through him, or dodge him, until he is close enough
to the tackler to give him the fake and then get by him. The ball
carrier should never concede he is down, and he should always
keep fighting to gain ground until the whistle stops the play.
The ball carrier should always realize and know exactly where he
is on the field, and just what he must do in order for the play to be
successful. In a majority of cases, a ball carrier should be concerned
only with running for a touchdown.