Professional Documents
Culture Documents
Review of Os Controlled Noc From Imec: Jim Stevens RC Reading Group 01/30/2008
Review of Os Controlled Noc From Imec: Jim Stevens RC Reading Group 01/30/2008
Todays Papers
Operating-system controlled network on chip. Nollet, V.; Marescaux, T.; Verkest, D. Design Automation Conference (DAC), 2004. Proceedings. 41st Volume , Issue , 2004 Page(s): 256 - 259 Centralized run-time resource management in a network-on-chip containing reconfigurable hardware tiles. Nollet, V.; Marescaux, T.; Avasare, P.; Verkest, D.; Mignolet, J.-Y. Design, Automation and Test in Europe (DATE), 2005. Proceedings Volume , Issue , 7-11 March 2005 Page(s): 234 - 239 Vol. 1
Abstract
Managing NoC is challenging OS needs to control NoC Tight integration allows for efficiency OS can
Optimize communication resource usage Reduce interference between applications
Introduction
Future systems will consist of tiles of processing elements (PE) Tiles connected by NoC Mapping tasks onto tiles and dynamically managing communication is extremely challenging Goals
Ensure that compute power matches communication needs Provide required QoS
Multiprocessor Emulation
System consists of a StrongARM processor in a Compaq iPAQ PDA connected to an FPGA using iPAQ expansion port Two NoCs built in FPGA
Packet-switched 3x3 bidirectional mesh called data NoC Another network for OS control messages
Transport Layer
Blocked message count: number of received messages that were blocked in the data router while waiting for the PE input buffer to be released. Injection rate control mechanism: throttles rate of messages being sent from PE
Operating System
One PE is denoted as master
Monitors system and assigned tasks to slave PEs
Slaves contain a basic RPC-like mechanism to execute OS functions for master Slaves can also call back to the OS using similar functionality for tasks such as synchronization
Case Study
Tested system with MJPEG decoder Consists of four tasks running on PEs Two tasks run on StrongARM (tile 3) Two other tasks are hardware blocks:
Huffman decoder/dequantisation 2D-IDCT and YUV to RGB converter
Added message gen/sink modules to put traffic on the network to interfere with channel from node 7 to node 6. OS samples cNICs every 20 ms.
Decoder Communication
Played same sequence with two different windowing techniques (window spreading and allocating continuous blocks) with no interference Decrease the window size from 100% to ~0.02% For window spreading, throughput of the video decoder does not decrease until effective window is less than 2% of bandwidth, half throughput occurs at 1.5% of bandwidth For continuous allocation, half throughput occurs at 75% of bandwidth When inference is enabled, window spreading helps reduce jitter because communication is more evenly spread.
DATE 05
Introduction
Same assumptions and system setup as previous paper This paper focusing on the task assignment heuristic and dynamic task migration Claims to be first paper to address run-time task migration in an NoC context
System Description
Same as before Task mapping heuristic must find the best PE for each task Want to reduce internal fragmentation and optimize communication paths
Heuristic Steps
Calculate requested resource load Calculate task execution variance Calculate task communication weight Sort tasks according to mapping importance Sort PEs for most important unmapped tasks Map task to best computing resource
Backtracking
Some inputs will result in no valid mapping Use backtracking to attempt to find another mapping
Undo previous N steps, select second best PE instead of best PE, then remap remaining N-1 steps If fails, then try again with N+1
RH Add-ons
For reconfigurable hardware (RH), must task into account internal fragmentation fo reconfigurable area If both first and second best tasks are reconfigurable, then want to pick one with lowest internal fragmentation Also consider if a regular PE could be used for this task instead of RH
Want to map only computationally intensive tasks to RH
Heuristic Performance
Compared to algorithm that explores full solution space for a multimedia pipeline application Defined LIGHT, MEDIUM, and HEAVY computational loads for previous load of the platform
If more than 50% of a PEs resources are used, then the PE is considered used, otherwise free.
Table 1 shows the success rate for the heuristic with varying number of backtracking steps with respect to searching the full mapping solution space. Demonstrates that RH add-ons to algorithm improve performance. Use hop-bandwidth product to show mapping quality.
Mapping Success
Migration Process
OS issues a migration request Wait until process reaches a checkpoint
OS does not know how long this will take
When checkpoint is reached, OS is signaled OS tells other processes to stop sending to process
Last message sent by a process has a tag
OS migrates the process, but does not delete the original process OS tells other processes, including the source tile, to update their routing tables (DLT) to contain the new task location When all tagged messages have been received at the new task location, the OS tells the other processes to start sending again and it frees the original location
Benchmarking Migration
Reaction time: migration request to when task is ready to migrate (checkpoint or stateless point) Freeze time: amount of time the migrating task is suspended If free resources are available, can start migration during reaction time for pipelines