Baseline CPU platform A CPU platform is used as the baseline platform to run the simulation of the
sequential imple-mentation. Therefore, devices have to allocate memory and transfer data from the
hostmemory to the device memory. In the meantime, GPUs have been developingdramatically to be
able to fulfill the requirements of high performance graphic applications aswell as intensive
computational applications. Along with the central nervous system, there is a peripheral nervous
Along with the central nervous system, there is a peripheral nervous system which containsreceptors and effectors.
question. Furthermore, various applications can be mapped on GPU. 3.2 CUDA framework As GPU
is designed to target data parallel applications such as graphical applications, parallelismis handled by
GPUs with less effort than by CPU. The simulated application depends on global memory
considerably, hence L1 cache is onlyuseful, in case it is used for reducing global memory accessing
time. The correctness ofthis index is used to determine the proper access to the external input array.
build 0.18 Page 57. To be more detailed, an equation of voltage along a passive cable is used withan
assumption that the geometrical and electrical properties of the cable is uniform. The real time
execution cannot beobtained on GeForce platform. When the membrane po- tential reaches the
threshold value, the neuron is saidto fire a spike, and is reset to. Discretisation mesh Figure 2.3 Page
26 and 27: 10 2. Although theHodgkin Huxley model is considered as the most biological
plausiblemodel, it has quite high complexity. Although this function is accurate on the CPU side,it
should be used in companion with a synchronization APIs of GPU to ensure the robustness oftime
Lahaye and Matthias Moller. Besides, the biggerthe amount of memory used, the more consuming
the memory access time is. The teaching approach, visually presented by some examples and
explained in this paper, consists of three closely related elements: (a) lectures and readings on basic
concepts of architectural, urban and landscape architectural design, (b) a canon of 160 projects
illustrating these concepts and (c) a typomorphological project analysis exercise. This new, integrated
programme was the follow-up of three former, separate study programmes, Basic Concepts of
Architectural Design, Basic Concepts of Urban Design, and History of Architecture, Urbanism and
Art. Besides, the execution time on CPU also increases linearly.In Figure 5.9 and Figure 5.8, it is
observed that the two line in logarithmic scale is parallel toeach other from the input size of about
100,000 cells. Overlapped execution In addition to concurrent execution, CUDA also provides
overlapped execution. Three functions operate on the previous states of IO and the external
input,hence they can be executed in parallel. However, this leads to large memoryconsumption for
this variable. build 0.18 Page 56. Theimplementation in C language is investigated to spot its critical
part which directly impacts thesimulation on a computer platform. The Inferior Olive (IO) modelis a
selected model to achieve the real time simulation of a largeneuron network. The supported data type
includes integer and float (single-precisionfloating point). In the course of further expansion, in 1987
application only allows for spacial parallelism but not fortemporal parallelism across multiple
iterations, which also limits parallelizing the application. build 0.18 Page 60. There are three
In many computing applications, the above synchronizations are indeed not enough. To evaluate the
application performance with single precision on the Tesla platform, we build 0.18 Page 67. The
simplest form of this model is a leak-free capacitance model as shown in the Figure 2.3.DC current is
the input of the capacitance, which acts like a relaxation oscillator or a current-to-frequency
converter. Luckily, the architecture library, containing several thousands of books and maps, as well
as many architecture models, including chairs by Gerrit Rietveld and Le Corbusier, were saved. DUT
Racing is known for constructing lightweight Formula Student racing cars and has been competing
in the global Formula Student competition for over a decade. In CUDA framework, texture memory
is provided with1D and 2D management. To solve theseproblems and utilize GPU performance,
firstly, the CUDA framework and afterward, the OpenCLframework were introduced in 2006 to blur
the difference between GPU and CPU programmingand abstract the graphics processing from
programmers. The soma conductance depends on a low-threshold calcium current(ICaL), potassium
models have beenwell-developed so that simulation on those models is providing plenty of insight on
brain opera-tions. This means that the computing resources in the GPU platform is fully occupied,or
the saturation point is reached. Whereas, the input sizes larger than the thread block sizes need
explicitsynchronization because the API syncthread() is not applied on different thread blocks. Fermi
architecture shown in Figure 3.3 uses the third generation of NVIDIA SM which ismore
programmable and efficient than previous architectures. The storage requirement of the integrate-
and-fire model family is the most modest as each model has only one variable. In addition to the 20
neuro-computationalfeatures reviewed above, we also consider whether the modelshave
biophysically meaningful and measurable parameters, andwhether they can exhibit autonomous
chaotic activity. Since 2006 all buildings of the university are located outside of the historical city
center of Delft. The second one is theindex of time step t which is required to locate which variable
to update the newly calculatingdata. To adapt todifferent platforms and targets, three variations of
the implementation are used. The parallel model which is used in GPUs is single-instruction,
multiple-thread (SIMT).The scheduling is carried out by a scheme of multiple threads called a warp.
The curriculum renewal brought a fresh look at study contents, teaching approach and assessment
strategies, based on the didactic principles of integrated learning. The simulated application depends
on global memory considerably, hence L1 cache is onlyuseful, in case it is used for reducing global
memory accessing time. The three compartments alsomake the model become even more
computationally intensive. On September 1, 1997, the 13 faculties of the TU Delft were merged into
9, to improve the management efficiency of the growing university. The execution time per time step
of the model is fast enough to be considered for real timesimulation with the simulation of up to 256
cells. They figured out thatthere are some variables which could be replaced by constants. Any of the
active neuron’s inhibitory inputs cause the output to shut off, whileall the active excitatory inputs xi
are multiplied by their synaptic weights wi and then addedup. Besides,the memory accessing time
decrease dramatically if the data is cached efficiently. If the voltage exceeds acertain threshold ?, the
cell immediately fires an output and resets the voltage. These conditions allow the texture memory to
be used to load theneighbor dendrite voltage. As explained above, each cell is connected to eight
Besides, the application is difficult to be split to map on differentplatform because of the correlation
among the application data at every iteration. At the small input size, the difference among the
results of different thread block size is small.The difference increases with the increasing input size.
In that case, the application performance might be improved when the cache is removed.To verify
this assumption, we measure the execution time of the application with the doubleprecision and
without L1 cache usage to compare with the same simulation with the maximumsize of L1 cache
potential in this area is not electrical transferbut a chemical reaction. Another synchronization is
required at the updating stage of the dendrite voltage. On the other hand, GPU utilizes most of its
19 Page 36. Using both techniques, the best achieved performance of double pre-cision simulation on
It finished second.
The changes to the Nuna 6 from its predecessor is the change of solar cells to Monocrystalline silicon
selected model to achieve the real time simulation of a largeneuron network. However, the number of
register per blockprevents this. It costs more synchronization time than the implicitsynchronization
because of the APIs execution time. The brain together with the spinal cord are called the central
nervous system. Besides, the kernel is reloaded every time step, hence the texture memory isalso
reloaded every kernel. However, a system that is able to perform the same functions is stilla challenge
to scientists in the field. Details on these three functions, which are not relatedto parallelizing the
implementation, are not discussed thoroughly in this thesis. 4.2 CUDA implementation As the
compute intensive part is located at the three loops where the calculation and update of thethree
compartments are carried out, the CUDA implementation (as shown in Figure 4.8) focuseson
content for every channel. Despite its complexity, SNN models have beenwell-developed so that
simulation on those models is providing plenty of insight on brain opera-tions. As explained earlier,
the external current is fed intothe dendrite compartment at the beginning of every time step. Besides,
theperformance is higher with the availability of a larger L1 cache. Fermi architecture implements a
memory hierarchy as shown in Figure 3.7 with a registerfile, a L1 cache, a L2 cache, a shared
memory and a global DRAM memory. A Project by Leeuwarden 1 Team Jorge, Qian, Melissa,
Natasa, Fulin, Alejandro, Hayagreev, Nanda. As the resistivityof the external medium is assumed to
be negligible, the circuit is connected to ground. Therefore, GeForce GT640 isstill able to increase
the block size up to 1024 threads. Whereas, the input sizes larger than the thread block sizes need
explicitsynchronization because the API syncthread() is not applied on different thread blocks. The
cerebellum regulates the force and range of movement and is related to the learningof motor skills.
The execution time per time step of the model is fast enough to be considered for real timesimulation
with the simulation of up to 256 cells.
The biggest difference compared to the DUT11 is the four-wheel-drive system introduced this year.
The interconnection among cells is represented by getting all the dendrite’s voltages of neigh-bor
cells to compute the dendritic voltage. In conclusion, the Tesla C2075 platform is essential for
doubleprecision simulation and the GeForce GT640 platform is more suitable for reducing
executiontime of single precision simulation. As the output from the computation for each time step
is required to be build 0.18 Page 64. The potential of the dendrites and the soma is combined to
generate the neuron’spotential. This way we can ensure a strong and consistent TU Delft brand in all
of our visual communication together. In this thesis, two GPU platforms ofthe two latest Nvidia
GPU architectures are used to simulate the IO model in a network setting.The performance is
improved significantly on both platforms in comparison with that on theCPU platform. The medulla
oblongata takes over vital autonomicfunctions such as digestion, breathing and heart beating. To
understand the importance and behaviors of a neuron model in general, fundamentalknowledge on
the human brain, central nervous system, neuron dynamics and various neuronmodels are introduced
in this chapter. The figure shows that speed-up is low for smallinput size and increases with
increasing input size until it reaches a saturation point beyondinput size of 50,000 cells. Despite its
complexity, SNN models have beenwell-developed so that simulation on those models is providing
plenty of insight on brain opera-tions. In comparison with simulation on costly server machine, a
GPU platformcan offer a cheaper and simpler platform for such problem with high efficiency.
Another method in GPU time measurement is using the CUDA event record. The first step to reach
that goal is to be able to constructthese neuron models in real time. This change may result in a
generation ofa new pulse in an axon of another cell if the two following conditions are fulfilled: The
potentialchange at the axon hillock exceeds the threshold; The axon has passed the refractory period
ofits preceding firing. In order to make this analytic solution more accurate,the dendritic tree is split
into small cylindrical compartments which can be considered as anapproximately uniform membrane
potential. These results encourage similar research on a more complex SNN model, for examplean
SNN model in a network setting. Fermi architecture implements a memory hierarchy as shown in
Figure 3.7 with a registerfile, a L1 cache, a L2 cache, a shared memory and a global DRAM
memory. Simulations based on those models could have a large impact onsociety such as the repair
of damaged part of the human brain. Therefore, each cell is considered as always having eight
neighbors. The code in the kernel should avoid any conditional instruction so that theexecution is
straightforward for all kernels and avoids any delay to read updated values amongkernels. Those
additional parameters could be the concentration of free, intra-cellular calciumand slow positive
feedback currents, respectively. Compartmental Models In the preceding models, the spatial
extent of a neuron is not taken into consideration. The editors will have a look at it as soon as
possible. This synchronization forces all the commands of the previous kernel execu-tion to finish
before launching a new kernel. The transmitter will bind to receptors onthe postsynaptic membrane
after traveling through a very small synaptic cleft. The summary of our comparisonis in Fig. 2.
Throughout this section, denotes the membrane potentialand denotes its derivative with respect to
time. In addition to the 20 neuro-computationalfeatures reviewed above, we also consider whether
the modelshave biophysically meaningful and measurable parameters, andwhether they can exhibit
autonomous chaotic activity. It is capable of processinginformation at a very high speed, low power
