Hqsom

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

Biomimetic sensory abstraction using hierarchical quilted self-organizing maps

J. W. Miller and P. H. Lommel The Charles Stark Draper Laboratory, Inc. 555 Technology Square, Cambridge, MA 02139-3563, USA jwmiller@draper.com, 1 (617) 258-2945, www.draper.com

Copyright 2006 Society of Photo-Optical Instrumentation Engineers. This paper will be published in the Intelligent Robots and Computer Vision XXIV: Algorithms, Techniques, and Active Vision and is made available as an electronic reprint with permission of SPIE. One print or electronic copy may be made for personal use only. Systematic or multiple reproduction, distribution to multiple locations via electronic or other means, duplication of any material in this paper for a fee or for commercial purposes, or modication of the content of the paper are prohibited.

Biomimetic sensory abstraction using hierarchical quilted self-organizing maps


Jerey W. Miller and Peter H. Lommel Draper Laboratory, 555 Technology Square, Cambridge, MA 02139-3563, USA
ABSTRACT
We present an approach for abstracting invariant classications of spatiotemporal patterns presented in a highdimensionality input stream, and apply an early proof-of-concept to shift and scale invariant shape recognition. A model called Hierarchical Quilted Self-Organizing Map (HQSOM) is developed, using recurrent self-organizing maps (RSOM) arranged in a pyramidal hierarchy, attempting to mimic the parallel/hierarchical pattern of isocortical processing in the brain. The results of experiments are presented in which the algorithm learns to classify multiple shapes, invariant to shift and scale transformations, in a very small (77 pixel) eld of view. Keywords: Self-Organizing Maps (SOM), Unsupervised Learning, Pattern Recognition, Cognitive Learning, Image Interpretation, Robot Vision, Computational Neuroscience, Biomimetic

1. INTRODUCTION
The benets of modeling the brain are abundant. As models of the brain have grown in accuracy, researchers have been able to increase the intelligence our machines exhibit, gain insights leading to neurophysiological discoveries, and even interface directly with the brain. Models of many regions of the mammalian brain are in development, including perceptual areas such as the auditory cortex and visual cortex. We have chosen to focus rst on visionalthough it is considerably complexsince it oers a rich source of information for use in applications. Computers currently have great diculty interacting with the worldthere is too much variability, too much data, too many options from which to choose. Yet somehow, animals handle this complexity with ease, without even consciously thinking about it. The isocortex, which constitutes approximately 85% of the human brain (by mass), is responsible for a large part of its advanced activities, including sensory processing (visual, auditory, somatosensory), motor control, language understanding and generation, logic, math, and spatial visualization.1 The isocortex appears to have a uniform structure, and to perform similar information processing operations throughout its varied functional regions.2 The ventral stream of the visual cortex is arranged in a hierarchy, increasing in invariance and complexity from bottom to topfrom V1 neurons responding to small lines in one part of the eld of view, to V2 neurons responding to simple shapes in one part of the eld of view, to V4 neurons responding to complex shapes in a large part of the eld of view, to IT neurons responding invariantly to complex objects in various orientations in much of the eld of view.36 In landmark studies of the V1 area of the cat and monkey, Hubel and Wiesel observed a system of simple, complex, and hypercomplex cells. Simple cells respond to lines of a specic 1 orientation and location in a single, small receptive eld ( 1 4 4 to 44 degrees across). Complex cells respond to lines of a specic orientation anywhere within a single, small receptive eld. Hypercomplex cells respond to lines of a specic orientation anywhere within a single, small receptive eld, that do not extend beyond the borders of that receptive eld. Hierarchically, simple cells converge on complex cells, and complex converge on hypercomplex.3, 7, 8 It is not dicult to envision such a pattern continuing up through V2, V4, and IT. Several computer models have been built upon this concept, including Neocognitron (Fukushima),9, 10 HMAX (Riesenhuber & Poggio,11 Serre & Poggio12 ), Neural Abstraction Pyramid (Behnke),13 HTM (George & Hawkins),14 ART (Carpenter & Grossberg),15 and VisNet (Stringer & Rolls).16 Much progress has been made, but there is much to be donecomputer models are not yet capable of extensive visual perception capabilities outside highly specic tasks for which they are trained. Although our method is still only a proof-of-concept, and has yet to be proven on real-world data, we aim to build upon these successes and make contributions in the following areas:
Author info: jwmiller@draper.com; 1 (617) 258-2945; www.draper.com

Using learning throughout the model in order to improve generalization across domains by minimizing required a priori knowledge and to increase biological realism. Most models use domain knowledge in the form of pre-programmed image features, such as Gabor lters, hand-built edge prototypes, or patches selected from training data. We hope extend these models by enabling learning of such feature prototypes. Using temporal association to form invariant representations in order to improve generalization across domains and to increase biological realism. VisNet does this, as does HTM; we hope to further develop the method over additional invariant patterns and sequence types. Using online learning in order to enable adaptability to new situations and to increase biological realism. Most of these models separate the training phase from the operational phase. We hope to enable adaptability to new input patterns by making no distinction between training and execution, i.e. the model is always learning, even when being evaluated. It should be noted that a model that uses learning throughout, such as is presented here, may require longer training and more computational power than one which has certain domain-specic knowledge. However, an accurate cortical model must be able to operate using sensory data from any domain, since it has been demonstrated that the mammalian cortex can learn to understand input data which is very dierent than what it normally receives.1719 We feel that trading-o some eciency for these capabilities is reasonable in the pursuit of models of the brain, since the biological brain also requires lengthy learning and considerable computational power.

2. METHODOLOGY
A model is sought which maximizes the following objectives: 1. Biological functional equivalence (with the visual cortex) 2. Generalizability across sensory domains (visual, auditory, somatosensory, other) 3. Simplicity In modeling biology, we pursue functional equivalence, in which we assume that there is a basic building block of the cortex which may be treated as a black box beyond which there is no purpose in modeling additional details that do not aect overall behavior. Generalizability, the ability to function in diverse situations, is a hallmark of human intelligencewhen achieved in machines, it will be a critical improvement. Although we have given consideration to computational complexity, eciency is not a primary objectivewe seek rst to build a model that works and is as simple as possible, and optimization will be pursued only as necessary for practicality and biological equivalence. The general approach is to form abstractions based on patterns in the input data (unsupervised learning), which may then be linked to other sensory abstractions, labels, or actions (perhaps using supervised learning, auto-associative memories, or reinforcement learning). We propose that a reduction of sensory input to abstract concepts may be achieved through two processes: spatial clustering and temporal clustering. Spatial clustering forms abstractions based on the similarity of the patterns; temporal clustering forms abstractions based on the temporal proximity of the patterns.

2.1. Spatial clustering


In spatial clustering, each vector x(t) in a given input space VI is assigned to a group of vectors which are similar. The result is a set of invariant representations for classifying input patterns in the presence of noise or variation. This provides a means for compression from a higher-dimensionality input space VI to a lower-dimensionality output space VO . There are many spatial clustering algorithms, such as K-means, Self-Organizing Maps (SOM), Expectation Maximization (EM), and Winner-Take-All (WTA) Neural Networks. The SOM20 is used here, since it is familiar to the authors, conceptually simple, and well-developed. For more biological realism, neural network models may be used, however, for present purposes the SOM appears functionally equivalent.

The SOM, or Kohonen Neural Network, is a vector quantization algorithm that preserves the topology of the input spacesimilar input patterns produce similar output patterns. The algorithm constructs a codebook of weight vectors, each representing a map unit occupying a location in map space. Each map unit corresponds to a neuron, and its weight vector corresponds to the strengths of synapses to it. Weight vectors may be initialized to random values. For iterative learning, at each time step the input vector x(t) is compared with the weight vector wi for each unit i in map space VO , and a best-matching unit (BMU) b is chosen according to a measure of similarity such as x(t) wb = min {x(t) wi }, (1)
iVO

where denotes the Euclidean norm. The wi are shifted toward x(t) according to the update rule wi (t + 1) = wi (t) + hib (t)(x(t) wi (t)) (2) where i VO and , 0 1, is the learning rate. Units nearer (in map space) to b are shifted more, according to the neighborhood function, hib , usually a Gaussian function such as Ii Ib 2 hib (t) = exp , (3) 2 (t)2 where Ii and Ib are the indices of the map units i and b in map space, and (t) species the span of the function. Normally, and are initialized to higher values to achieve good topological organization and are reduced over time to tune the wi to precisely correspond to their respective input clusters. However, this has the undesirable eect of requiring a training period separate from operational usage. This does not allow for the neural plasticity displayed by the cortex, and has the practical disadvantage of eliminating the possibility of online adaptation. A solution to this problem is provided by the following adaptive neighborhood function : Ii Ib 2 hib (t) = exp , (4) b (t) 2 where b (t) is the mean squared error (MSE) of wb compared with x(t), and is given by b (t) =
N 1 1 (xi (t) wbi )2 = x(t) wb 2 , N i=1 N

(5)

where N is the number of values in the input vector. The eect is to dynamically adjust the neighborhood according to how well the BMUs weight vector wb is tuned to the input vector x(t). When wb is a poor match, as is typical early in the learning process, the neighborhood is large. When wb is well-matched to x(t), the neighborhood is small. In addition to simulating the standard gradual reduction of the neighborhood size, this method automatically increases learning for a novel pattern that is presented later in the learning process, using a larger neighborhood to help nd an appropriate map location for the pattern. In this method, is a constant which must be selected according to the variance exhibited within the clusters in the input data. Although this introduces an additional parameter, the benet of enabling online learning and ongoing plasticity is signicant. This method has been used in the model presented here. Another modication to the standard SOM algorithm which we have found to provide signicant improvement is the use of Luttrells method of growing the size of a map.21 The map is initialized to a small size and new map units are periodically added by inserting them between the existing units, and interpolating to set the values of the new units. When learning by a SOM is successful, the map unit weight vectors are centered at statistical clusters in the input patterns, and map units representing similar clusters occupy locations that are near in map space (see Fig. 1). The update in Eq. 2 corresponds to a Hebbian learning rule, frequently described in biological neurons as re together, wire together. The neighborhood function hb , a Gaussian, parallels the center-surround (excitatory-inhibitory) structure that is omnipresent in the cortex. Feature maps produced by the SOM bear striking resemblance to cortical maps, such as the orientation-selective regions in V1 , tonotopic maps in auditory cortex area AI, and somatosensory maps in SI.

Figure 1. An input pattern is compared to each unit in a map that has been exposed to oriented lines. The unit most similar to the input is the BMU. An activation vector may be calculated for a distributed representation of the output.

2.2. Temporal clustering


Temporal clustering (a subset of Temporal Sequence Processing, or TSP), is very similar to spatial clustering, but instead of grouping patterns only according to similarity, they are also grouped according to temporal proximity. One way to achieve this is by ltering the input with a leaky integrator, providing a recency-weighted trace of the input. This transformed input may then be fed to a spatial clustering algorithm to achieve temporal clustering. As F oldi ak has shown, the result is a set of invariant representations for patterns which are correlated temporally.22 Algorithms for temporal sequence processing include ARMA, NARMA, Temporal Kohonen Map (TKM), Recurrent SOM (RSOM), and recurrent neural networks. We currently use the RSOM,23 selected for its demonstrated performance and the elegant simplicity of using a single algorithm for both spatial and temporal clustering. In the RSOM, at each time step a recursive dierence yi (t) between the input vector x(t) and the weight vector wi is calculated for each unit i in map space VO , using yi (t + 1) = (1 )yi (t) + (x(t) wi (t)), (6)

where , 0 1, is the time decay factor determining the responsiveness to each new input vector. When is closer to 1, yi (t) acts as a shorter-term memory, while an nearer 0 produces a longer-term memory. The recursive best matching unit br at time t is found by ybr (t) = min {yi (t)},
iVO

(7)

and the update rule in Eq. 2 is modied to use the recursive dierence yi (t) and the recursive BMU br , as in wi (t + 1) = wi (t) + hibr (t)yi (t), (8)

where i VO and and hibr (t) are the same as in Eq. 2, except for the substitution of br for b. In the special case when = 1, Eqs. 7 and 8 are equivalent to Eqs. 1 and 2, respectivelyand the RSOM is identical to the SOM. Note that in the model presented here, while br is used for the learning update, b (from Eq. 1) is used as the output BMU from the RSOM in order to provide undelayed response. In the brain, a recursive dierence yi (t) might be represented in each synapse by a chemical potential aecting synaptic modication. Although TSP algorithms such as RSOM can simultaneously cluster both spatially and temporally, if all of the dierent input vectors are not orthogonal, considerable ambiguity can resulttwo very dierent sequences of patterns may look the same. This is very often the case. To overcome this problem, in the HQSOM spatial clustering with the SOM is performed rst, and the output is passed to the RSOM for temporal clustering.

In order to achieve orthogonality, the SOM output may be formulated as an activation vector with the value representing the best-matching unit (BMU) set to 1 and all others set to 0 (an impulse function). Another alternative is to use a continuous, distributed representation such as A(t) in Eq. 10 (also see Fig. 1), which we have found results in better performance in the presence of noise (due to variations in the BMU for a given pattern), even though true orthogonality is compromised. a(t) = 1 xS (t) wb 2 a(t) a(t) (9) (10)

xR (t) = A(t) =

where xS is the input to the SOM and xR is the input to the RSOM. After processing this signal, the recursive dierence represents the recent activity of each spatial pattern group (with respect to each weight vector). The RSOM then nds groups of such activity distributions, i.e. invariant representations of sequences. A SOM coupled with an RSOM using Eqs. 9 and 10 is referred to as a SOM-RSOM pair, and is a basic processing unit in the HQSOM.

2.3. Hierarchical quilted self-organizing map


It appears that simple cells in the visual cortex perform spatial clustering (responding to edge patterns with a specic orientation and location in their receptive eld) and complex cells perform temporal clustering (responding to edge patterns with a specic orientation anywhere in their receptive eld). Since during visual experience most movement in receptive elds is shifting (as opposed to rotating), this provides an explanation for how shifted edges (having the same orientation) might be associated togetherbecause they occur near in time. Simple-complex neuron groupings are arranged topologically in V1 in positions corresponding to the locations of their receptive elds in the eld of view. These feed neurons in V2, also arranged topologically, and having larger receptive elds. These V2 neurons feed V4 neurons, which, in turn, feed IT neurons with receptive elds increasing in size until they cover most or all of the eld of view (Fig. 2). The Hierarchical Quilted Self-Organizing Map (HQSOM) is an implementation of such an architecture, using a single algorithm, the RSOM, applied en masse. The RSOM can perform either spatial or temporal clustering, since the SOM is a special case of the RSOM in which the time decay factor is set to 1.0 (such that previous signals have no eect). In the model, the SOM mimics groups of simple cells having the same receptive elds, while the RSOM mimics corresponding groups of complex cells. Similar to Neocognitron, HMAX, and NAP, the HQSOM consists of layers of simple-complex cell groups (SOMRSOM pairs ). (Note that HQSOM is very dierent than the Hierarchical SOM, HSOM, which consists of SOMs arranged serially, rather than in parallel, and does not involve time.) In the HQSOM, the input is parsed into overlapping receptive elds, each of which is connected to a SOM-RSOM pair in the rst layer. Each layer consists of a quilt of SOM-RSOM pairs having overlapping receptive elds.

Figure 2. A simplied depiction of the hierarchical organization of neurons in the ventral stream of the visual cortex.

In each SOM-RSOM pair, the output of the SOM, formulated as an activation vector (Eq. 10), is the input to the RSOM. The output of the RSOM in each SOM-RSOM pair is combined with the outputs of the others in the layer to form a feature image which is presented to the next higher layer (Fig. 3). Rather than activation vectors, the indices of the BMUs from the

RSOMs are used to represent their outputs. Although this does not seem biologically realistic, we chose to use BMUs for now due to the simplication of the algorithm. For further simplication, we use only 1-dimensional maps in order to enable a single-valued BMU index to be topologically correct. However, in addition to adding biological realism, it seems likely that passing an activation vector between layers would increase performance, since the representation would be distributed, rather than as a grandmother cell. The signals passed from layer to layer represent topological maps of features, with increasing size and complexity after each layer. When the hierarchy is extended to a single SOM-RSOM pair at the apex, the output of the entire hierarchy is a single integer valuea compressed invariant representation of the image presented to the lowest layer. If the output of the HQSOM is to be processed by another algorithm, such as a supervised learner, more information may be obtained by using the outputs of one or more other lower layers as well (bypass routes). At the top of the hierarchy, sensory patterns are compressed into representations which are invariant to any transformation (e.g. shift, scale, orientation) as long as related patterns occur near in time. Although not yet tested on signals representative of other sensory domains (e.g. auditory or somatosensory), it is not dicult to envision how a spatial-temporal model like HQSOM would be able to form invariant representations in those domains as well. For the model to learn auditory pitch invariance for a given signal, that signal must occur at various pitches near in time. Interestingly, such variation occurs naturally. The sound of a zipper, for example, varies in pitch with the speed of the zipping action. By temporal clustering, the zipper signals for all pitches may be associated with a common invariant representation. Likewise, with somatosensory signals, a given sensation is often experienced at multiple bodily locations simultaneously or near in time. For instance, as one holds a pencil, it contacts multiple parts of the hand. The resulting somatosensory invariant representation for pencil would Figure 3. Hierarchical Quilted Self-Organizing Map. enable one to identify a pencil by touching it with any part of the hand. Although we have only used input of 2-dimensions (at each time step), the algorithm can process input of any dimensionalityenabling color vision, for example. Given hierarchies for diverse sensory domains, sensory fusion may be achieved by joining them with an identical hierarchy at the top. In this model, only a feed-forward process is developed. However, feedback is ubiquitous in the cortex, and due to competition for resources is likely to have a purpose. One utility of feedback would be to predict future activity of lower layers, and bias their activity accordingly. This would have dual benets of handling noise and variation, as well as providing a mechanism for predicting longer sequences of events for the purpose of planning.

3. RESULTS
The results of two experiments with the HQSOM using simulated data are presented. In each, the model is initialized and then exposed to a series of grayscale images. For the purposes of performance measurement, the end of each series is designated as a testing period. However, the system does not iterate through the data multiple times, since it is meant to represent an online time series. The operation of the algorithm is identical during the learning and testing periodsit is still learning, even during testing.

3.1. Shift-invariant representations of lines


The rst series is simpler in order to demonstrate the basic concepts of the model; the second is more dicult. In the rst, 33-pixel images of shifting vertical and horizontal lines are presented, separated by periods of blank

background. Gaussian noise of standard deviation 0.1 is added, and each pixel is a grayscale value from 0 (shown as white) to 1 (black). The desired behavior of the model is to form three abstractionsone for horizontal lines (5,6,7 in Fig. 4), one for vertical lines (2,3,4), and one for blank images (1)by associating them temporally. Since the images are very small, and there are only a few patterns which must be temporally associated, a single layer with a single SOM-RSOM pair is used. (This correlates to a group of simple and complex cells sharing a small receptive eld.) After exposure to 3000 images in this series, the weight vectors wi of the SOM and RSOM from a typical run are shown in Fig. 4. The following parameters were used for Series 1: for the SOM, = 0.1, = 1.0, = 4.0, with growths after exposure to 100, 200, and 300 images; for the RSOM, = 0.1, = 0.1, = 30.0, with a growth after 400 images. In all experiments, 1-dimensional maps were used, initialized to a size of 2 map units and growing in size by 2x-1 using interpolation (Luttrells method). As shown, the SOM unit weights correlate directly with the statistical centers of input image patterns, providing invariance to the noise (spatial clustering). For the RSOM, each unit is visualized by displaying the SOM units in order of the strength with which they are associated with that RSOM unit. All three RSOM units are most strongly associated with the SOM unit for blank input (1), however, RSOM unit 1 has a stronger association than 2 & 3, and is thus the BMU for blank input. Unit 3 has the strongest connections with the three SOM units (2,5,8) representing horizontal lines, and thus is the BMU for horizontal input. Unit 2 has the strongest connections with the three (3,6,9) for vertical lines, and is the BMU for vertical input. The RSOM provides an invariant representation for vertical and horizontal lines. While visualization provides insight, in order to quantitatively assess the performance of the model, a metric of performance is required. One method that is frequently used to assess unsuperFigure 4. Series 1 sample data and learned vised abstraction schemes is to couple it with a supervised learner, maps. to provide a mapping directly to the labels for the input patterns. However, if the output of the unsupervised learner is reduced to a single integer, as it is here, a mapping can be found without learning. A matrix is created with the correct label given by the row index i, the unsupervised algorithm output given by the column index j , and each value vij in the matrix representing the number of times that output j was given for the input corresponding to label i. The Probability of Correct Classication is then (maxj (vij )) PCC = i (11) i,j (vij )

which nds the simplest mapping from outputs to labels and calculates the likelihood that, using such a mapping, a given input would be matched with its correct label. For the rst series, the average PCC over 25 runs (in each of which the HQSOM is reinitialized) is 100%.

3.2. Shift/scale-invariant representations of simple shapes


The second series consists of 77-pixel images of three dierent shapes, a square, diamond, and an X, each at three dierent scales large, medium, and small. The series is constructed as follows. Each shape, at each scale, is presented at each possible position in the eld of view in a spiral pattern of motion, such that the position of the shape shifts by at most one pixel at each time step. For example, the rst shape and scale are selected

Figure 5. Series 2 sample data and learned maps. Portions of the input sequence are shown at the top (see the text for more description). Below this are shown selected map units from a typical HQSOM after exposure to this sequence. On the left are shown map units from the Layer 1 SOM and RSOM located at the center of the 33 grid of SOM-RSOM pairs. (Note that the SOM is 1-dimensionalit wraps around the edge of the gure for presentation purposes.) As in Fig. 4, each unit in the RSOM is displayed as a column of SOM units in order of the strength of association with the unit (only the strongest 20 for each RSOM unit are shown). Similar maps exist (but are not shown) for the 8 other SOM-RSOM pairs. On the right are shown the most active units from the Layer 2 RSOM. These units are dicult to visualize, since each unit represents many possible input patterns. As with Layer 1, each Layer 2 RSOM unit is shown as a column of Layer 2 SOM units. Each Layer 2 SOM unit in the column is displayed as a 33 grid of the corresponding Layer 1 RSOM units, for each of which only the strongest Layer 1 SOM unit is actually shown. Since there are many Layer 1 SOM units that could be displayed in each position of each 33 grid, there is ambiguity in this down-projection and the grids look ill-formed, even when they are ne. Thus, the best way to interpret the Layer 2 RSOM units is to look for all of the grid units to have features corresponding to a particular shape, but not necessarily to present an integrated picture. The Layer 2 SOM is not explicitly shown, but sample units from it are displayed as components of the Layer 2 RSOM units.

(square, large). The shape is presented at each possible position in the eld of view for that scale, in a spiral going from the center out. It is then presented again at each position in a spiral back in toward the center, for a total of two presentations in each position for that scale. (Note that for the large shapes, which are 77, there is only one possible position.) Then the next scale is chosen for that shape (medium square), and is presented at each position (there are 9 for medium scale) in a spiral going from the center out, and then back in. The next scale (small) is presented at each position (of 25), spiraling out and then back in. This is repeated 5 times for this shape, making a total of 350 frames for the shape. Then a blank background is presented for 100 frames. Sequentially, each of the other shapes is then presented in the same way (large/medium/small, spiraling out and in). This is repeated 150 times, resulting in a total of 202,500 frames. All frames have Gaussian noise with standard deviation 0.1. Here, the desired behavior is to form four invariant representationsone for each of the three shapes, plus the blank backgroundby associating all the dierent shifted positions of all scales for a given shape together. A two-layer HQSOM is used, with layer 1 consisting of nine SOM-RSOM pairs arranged in a 33 grid, each having a 33-pixel input receptive eld overlapping with the adjacent elds by 1 pixel (as in Fig. 3). Layer 2 has a single SOM-RSOM pair, with a 33 input receptive eld, connected to the output of the nine SOM-RSOM pairs in layer 1. The average performance over 5 runs is PCC = 96.44% using the parameters shown in Table 1. Further experimentation was done to determine whether combining multiple such maps in parallel would increase performance. When 5 such maps are combined in parallel, with the combined output determined by a voting scheme, performance is increased to PCC = 99.81%. Sample map units are shown in Fig. 5.

Table 1. Parameters of 2-layer HQSOM used for Series 2.

Layer Layer Layer Layer

1 1 2 2

SOMs RSOMs SOM RSOM

0.1 0.01 0.1 0.001

1.0 0.1 1.0 0.01

4.0 10.0 2.0 50.0

Growths 1000, 3000, 5000,...11000 12000, 16000, 20000, 24000 30000, 35000, 40000,...70000 75000, 95000, 115000, 135000

Final size 65 units 17 units 513 units 17 units

Note that some ne-tuning of parameter values was required to obtain the level of performance reported here. This may be an area for improvement. As a general rule, we have found that increasing the number of map units increases performance, as the representation becomes more distributed and redundant. Of course, this also increases computational expense. An algorithm that grows adaptively, such as the Growing Neural Gas, may reduce the number of parameters that must be selected and increase robustness.

4. CONCLUSIONS
There is a great potential benet to online unsupervised learning from unprocessed data. Since it is very easy to expose a system to large quantities of raw data (natural images from video, for example), if a system can learn to form high-level invariant abstractions from such data, those abstractions may be eciently linked to appropriate labels or actions using a supervised learner and the ability of machines to interact with the real world will be greatly increased. We have presented a hierarchical unsupervised learning system modeled after the visual cortex, with contributions furthering existing models by using learning of pattern components and exploiting temporal context in order to discover invariance. Experimentation with this model has shown success in forming invariant representations of simple shapes in a small eld of view. Potential avenues for future work include reduction of the number of parameters that must be selected, using multiple sequential RSOMs in each layer to associate large diverse clusters in a step-wise approach, using a distributed representation between layers, experimentation with natural images, and experimentation in other sensory domains.

REFERENCES
R Encarta R Online Encyclopedia, 2006. Accessed June 17, 2006. 1. Brain, in Microsoft 2. V. Mountcastle, An organizing principle for cerebral function: The unit model and the distributed system, in The Mindful Brain, G. M. Edelman and V. B. Mountcastle, eds., pp. 1749, The MIT Press, Cambridge, MA, 1982. 3. D. H. Hubel and T. N. Wiesel, Receptive elds and functional architecture of monkey striate cortex, J. Physiol. 195, pp. 215243, March 1968. 4. J. Hedg e and D. C. Van Essen, Selectivity for complex shapes in primate visual area V2, The Journal of Neuroscience 20, p. RC61, March 2000. 5. A. Pasupathy and C. E. Connor, Shape representation in area V4: Position-specic tuning for boundary conformation, Journal of Neurophysiology 86, pp. 25052519, November 2001. 6. K. Tanaka, Inferotemporal response properties, in The Visual Neurosciences, L. M. Chalupa and J. S. Werner, eds., pp. 11511164, The MIT Press, Cambridge, MA, 2004. 7. D. H. Hubel, Single unit activity in striate cortex of unrestrained cats, J. Physiol. 147, pp. 226238.2, September 1959. 8. D. H. Hubel and T. N. Wiesel, Receptive elds, binocular interaction and functional architecture in the cats visual cortex, J. Physiol. 160, pp. 106154.2, March 1962. 9. K. Fukushima, Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaected by shift in position., Biological Cybernetics 36, pp. 193202, April 1980. 10. K. Fukushima, S. Miyake, and T. Ito, Neocognitron: A neural network model for a mechanism of visual pattern recognition., IEEE Transactions on Systems, Man, and Cybernetics 13, pp. 826834, 1983. 11. M. Riesenhuber, How a Part of the Brain Might or Might Not Work: A New Hierarchical Model of Object Recognition. Dissertation thesis, Massachusetts Institute of Technology, May 2000. 12. T. Serre, M. Kouh, C. Cadieu, U. Knoblich, G. Kreiman, and T. Poggio, A theory of object recognition: Computations and circuits in the feedforward path of the ventral stream in primate visual cortex, AI Memo 2005-036, Massachusetts Institute of Technology, Cambridge, MA 02139, USA, December 2005. 13. S. Behnke, Hierarchical Neural Networks for Image Interpretation. Dissertation thesis, Freie Universit at Berlin, 2002. 14. D. George and J. Hawkins, A hierarchical Bayesian model of invariant pattern recognition in the visual cortex, International Joint Conference on Neural Networks 3, pp. 18121817, 2005. 15. G. A. Carpenter and S. Grossberg, Adaptive resonance theory, in The Handbook of Brain Theory and Neural Networks, Second Edition, M. A. Arbib, ed., The MIT Press, Cambridge, MA, 2003. 16. S. M. Stringer and E. T. Rolls, Invariant object recognition in the visual system with novel views of 3D objects, Neural Computation 14, pp. 25852596, 2002. 17. P. Bach-y-Rita, C. C. Collins, F. A. Saunders, B. White, and L. Scadden, Vision substitution by tactile image projection, Nature 221, pp. 963964, March 1969. 18. G. Clark, Cochlear Implants: Fundamentals and Applications, Springer, New York, 2003. 19. J. R. Newton, C. Ellsworth, T. Miyakawa, S. Tonegawa, and M. Sur, Acceleration of visually cued conditioned fear through the auditory pathway, Nature Neuroscience 7, pp. 968973, August 2004. 20. T. Kohonen, Self-organized formation of topologically correct feature maps, Biological Cybernetics 43, pp. 5969, January 1982. 21. S. P. Luttrell, Image compression using a multilayer neural network, Pattern Recognition Letters 10, pp. 17, 1989. 22. P. F oldi ak, Learning invariance from transformation sequences, Neural Computation 3, pp. 194200, March 1991. 23. M. Varsta, J. Heikkonen, and J. del R. Mill an, Context learning with the self-organizing map, Tech. Rep. B4, Helsinki University of Technology, P.O. Box 9400, FIN-02015 HUT, Finland, April 1997.

You might also like