Thomas M. Martinetz Et Al - "Neural-Gas" Network For Vector Quantization and Its Application To Time-Series Prediction

You might also like

Download as pdf
Download as pdf
You are on page 1of 12
“Neural-Gas” Network for Vector Quantization and its Application to Time-Series Prediction ‘Thomas M. Martinetz, Member, IEEE, Stanislav G. Berkovich, and Klaus J. Schulten rror—nhich, in general, has many local tian Intl 4 neural network algorithm based on a “soft ule fs presented that exhibits good performance in reaching the ‘optimum, or atleast coming clas. The voft-max rule employed i an extension of the standard -means chstering procedure and {aes into account a “neighborhood raking” af the reference (eight) vectors. Its show that the dynamics of the reference (eight) vectors during the inpl-driven adaptation procedure 1) is determined by the gradient of an energy function whose shape ‘an be modulated through a neighborhood determining pa ter, and 2) resembles the dynamics of Brownian particles moving {nm potetal determined by the data point density. The network {s employed to represent the attractor of the Mackey-Class ict the outpet values ‘The results obtained forthe time-series prediction compare very favorably with the results achieved by Back-propagatton and rial basis function networks, 1 werooucrion ETIACIAL 4 Wels igi nfrmaton processing Aisrrcne th oles aon anne et noni of is fen regu the apna ing tee nies fer coms, ns we age a pln Inoingopech and inupe ein, hes epee celts elon" gunn” eqns (oe + fw me 1D ‘Eso auteton iu ecoe saa manos, « simnill 8° lang ony ent (ese y) of rere oe! eco etc cluster centers) w € Ri IN. A data vector» € V i dcr y te esachig a “wing rere ‘eto te of w or ich ie asoron rendre sth gue oro e-mfseiml Thsporebos hie mand Vint heer a sabes Vix (ee Ville—wil 0, one cannot Specify cos function that is minimized by (6. 1, THE "NEURAL-GAS” ALGORITIM In this paper, we present a neural network model which, applied tothe task of vector quantization, 1) converges quickly {0 low distortion errors, 2) reaches a distortion error E lower than that resulting from K-means clustering, msximum- entropy clustering (for practically feasible numbers of iteration steps) ad from Kohonen’s feature map, and 3) a the same time obeys a gradient descent on an energy surface (ike the maximum-entopy clustering, io contrast 10 Kohonen’s feature map algorithm). For reasons we wil give later, we call this network model the "neura-gas” network. Similar {0 the maximum-entropy chstering and Kobonen’s feature tmap the neural-ges network alo Uses a "softmax” adaptation mule. However, instead of the distance |jv — w|| or of the frangement ofthe w;’s within an exteroal late, it wtizes 2 "ncighborbood-ankng” of the reference vectors w, forthe five data veaor ach time dala vector v & presented, we determine the “neighborhood-ranking” (wi.,Wi,,---,Wiy_,) of the refer- ence vectors witht, being closet to tn, being second lowest 10 and wi = 0,.--,NN ~ 1 being the Teference ‘rector for which there are F vectors w wit [lv — wl] < flv = 1, | I we dente the number asociated with cach ‘vector w; by k(e,), which depends on w andthe woe set (ay,-"-,w) of reference vector, then the adaption Sep we employ for adjusting the ws given by bw hglhi(o.w)) (vw) T= 1, o ‘The step size € € [0,1] describes the overall extent of the modification, and /(ki(v,w)) i unity for kj = 0 and ‘decays to 200 for increasing , with a characteristic decay ‘constant A. In the simulations we describe as follows, we cehose ha (Fi(oye)) =e K\PAD)/9, For X — 0, (7) becomes equivalent to the K-means adaptation rule (3), whereas for 2-0 not only the “winner” w,, but the second closest reference vector w,, third closest reference vector wi, et, is also updated, [As we show in Appendix I, the dynamics ofthe wy’s obeys a stochastic gradient descent on the cost function Eqg(t,)) = mab. | dy P(ayha(bi(o,w))(9 ~ w,)? ® wit ca) = Saath) = Fae) 25x normalization factor that only depends on A. Byy i elated to the framework of fuzzy clustering [24], [25] In contrast to hard clustering where each data point v is deterministcally assigned tits closest reference vector wp, fuzzy clustering associates 9 o a reference vecor w witha certin degree (a), the so-called “fuzzy membership of w to cluster i. In the case of hard clustering holds piy(v) = 1 and pa(v) = 0 for ## to, If we choose a “fuzzy” assignment of data point 1 to reference vector w;, which depends on whether 1; is the nearest, next-neares, next-next-nearest, et, neighbor of v, ie, if we choose pi(o) = ha(hi(e,e))/C(A), then the average distortion error we obtain, and which has to be sinimized, is given by Eng, and the corresponding gradient escent is given by adaptation rule (7). ‘Through the decay constant A we can modulate the shape Of the cost function Eyy. For A —+ oo the cost function Eng becomes parabolic, whereas for 1 —+ 0 it becomes {equivalent tothe eos function in (2), ie the cost function We ultimately want to minimize, but which has many local ‘minima. Therefore, to obtain good results concerning the set of reference vectors, we start the adaption process determined by (7) with a large decay constant and decrease A with each adaptation step. By gradually decreasing the parameter \ wwe expect the local minima of F to emerge slowly, thereby Preventing the set w of reference vectors fom getting trapped in suboptimal states. IL THE NETWORK’S PERFORMANCE ON A MODEL PROBLEM ‘To test the performance of the neural-gas algorithm in mini- ‘mizing E and to compare it with the three other approaches we described (-means clustering, maximam-entropy clustering, and Kohonen’s topology-conserving map), we choose a data sistrbution Pw) for which 1) the global minimum of Es known for large numbers of reference vectors and 2) which reflects, atleast schematically, essential features of data dsti- butions that are typical in applications. Data distributions that ions often consist of, eventually separated, points. Therefore, also for our test we choose 4 model data distribution that is clustered. To be able 10 Setermine the global minimum, in our mode data distribution the clusters are of square shape within » two-dimensional ut space. Since we choose N= Axnumber of clusters and separate the clusters far enough from each other, the ‘optimal set of u's is given when each of the square clusters is represented by four reference vectors, and when the four reference vectors within each cluster are arranged inthe known ‘optimal configuration for a single squat, In Fig. 1 we see the neural-gas adapting into a representation ‘of our model data distribution with 15 clusters and N’~ 60 reference vector. With each adaptation step, a data point Within one of the squares is stochastically chosen with equal probability over each square. Subsequently, adjustments ofthe 10's according to (7) are performed, We show the intial state, the state after $000, 15000, and finally after 80000 adaptation steps. In the simulation run depicted in Fig. 1 the neutal-gas algorithm was able to find the optimal representation of the data. distribution, However, depending on the initial choice of the wy's(cho- ‘sen randomly) and depending on the speed with which the parameter A is decreased, i., depending on the total number Of adaptation steps tax employed, it might happen thatthe refetence vectors converge toa configuration that is ony close ‘but not exactly at the optimum, Therefore, to demonstrate the average performance of the neural-gas algorithm we showin Fig. 2 the mean distortion error fr different total numbers of adaptation steps fnax- For each of the diferent total numbers ‘of adaptation steps We averaged over 50 simulation runs, for each of which not only the initialization of the w,"s were chosen randomly but also the 15 clusters of our mode! aia distribution were placed randomly. Since we know the ‘minimal distonion exter Fy that can be optimally achieved ooo 7 g ooo ago eo og Fg 1. The nerabgs netionk representing data dso a 22 hat ‘is of 1 prt cates of are shape, On each cast the decay ‘a ms homepne Te cece ves Wy ae Speed Dns Te nil ales forthe ws a hen ano! whe awe {nthe tp le pcre. We sesh th sate ae £40 (op gh. 18000 (toe Tet) nd ater 8000 sdepetion wep fm gh) A he end fhe adaption pace the set of recs esas as conte the opal congaton ey each che epee by four eres for our model data distribution and the number of reference vectors we employ, we choose « = (Eta) ~ Ea)/Eo a8 8 performance measure with E(¢max) as the final distortion cerzor reached. cr = 0 corresponds to a simulation run which reached the global minimum, whereas, c.g, = 1 corresponds ‘0 a very large distortion error, ie, distortion error which fs twice as large as the optimum, As we can see in Fig. 2 for tax = 100000 the average performance of the neural gas network is a = 0.09, which means that the average distortion err B for tmax = 100000 is 9% larger than what can be optimally achieved For comparison, we also show in Fig, 2 the result achieved by the K-means clustering, the maximum-entropy clustering, and Kohonen’s feature map algorithm. Up to tmac = 8000, only the distortion error ofthe K-means clustering is slightly smaller than the distortion error of the neural-gas algorithm. For tax > 8000, all three procedures perform worse than the neural-gas algorithm. Fora total number of 100000 adaptation steps the distortion error of the maximum-entropy clastering is more than twice as large as the distortion ertor achieved by the neural-gas algorithm. Theoretically, for the maximum- entropy approach the performance measure a should converge 'o 2er0 for fax — 90. However, a8 mentioned already in the introduction, the convergence might be extremely slow. Indeed, al four clustering procedures, including the maximum. entropy approach and the neural-gas algorithm, donot improve their final distortion eror significantly further within the range

You might also like