The Stream Processing Paradigm: Research Report For HIT-382

2010
The Stream Processing Paradigm

Research Report for HIT-382
HIT 382 ii
s994752
Student Name: Geoff Cruickshank Supervisor: Rebecca England Date Submitted: 22/10/2010
Abstract
In todays high data throughput and multiple input source environments, traditional methods of storing and querying static data are no longer adequate for all methods of analysis. Daily data input rates will soon reach the Exabyte level (8000 million billion bits) for some businesses and to accommodate the examination of these data streams new computing paradigms are required. One of these, stream processing, allows real-time analysis of multiple input feeds by using clusters of machines, sharing common resources and implements software that has a distributed runtime and can be described as being middleware. This project surveyed the platforms that are available to implement stream processing and examined how they perform queries on the data streams, with a strong focus on the IBM InfoSphere Streams product. In the process of investigating how it distributes work amongst processing elements, using parallel computing and continuous queries, it was found that the results returned by this paradigm proved very accurate. It also carried out an in-depth analysis of the SPADE programming language and the available development environments, and determined that the write once, use often philosophy it uses, coupled with similarities to SQL, C++ and Java, enabled programmers to create flow graphs and complex queries quickly and effectively, without the need to learn a completely new language or development environment.
HIT 382 i
s994752
Table of Contents
Abstract.................................................................................................................................................ii Table of Contents.................................................................................................................................iii List of Figures......................................................................................................................................iii List of Tables ..........................................................................................................................iv 1. Introduction.......................................................................................................................................1 2. Background........................................................................................................................................2 3. Project Aim........................................................................................................................................3 4. Methodology......................................................................................................................................4 5. Research Findings..............................................................................................................................6 5.1 Theory of Operation.....................................................................................................................6 5.2 Major platform variants...............................................................................................................7 5.2.1 Aurora and Borealis..............................................................................................................7 5.2.2 STREAM..............................................................................................................................8 5.2.3 StreamBase .........................................................................................................................8 5.2.4 InfoSphere Streams...............................................................................................................8 5.3 InfoSphere Streams architecture................................................................................................10 5.4. Generating inquires on data streams.........................................................................................14 5.5 Development Environments for InfoSphere Streams.................................................................18 5.5.1 The SPADE language.........................................................................................................18 5.5.2 InfoSphere Streams Studio IDE..........................................................................................20 5.5.3 The MARIO IDE................................................................................................................22 6. Conclusions.....................................................................................................................................25 Acknowledgements..........................................................................................................................25 7. Lessons Learned..............................................................................................................................26 8. References.......................................................................................................................................27 8.1 Bibliography..............................................................................................................................30 Appendix A Glossary........................................................................................................................31
List of Figures
Figure 1: The Research Process (Trant & Bearman, 2010)....................................................................4 Figure 2: Stream processing systems timeline (Fllop et al. 2010)........................................................7
HIT 382 ii
s994752
Figure 4: Parallel stream processing using multiple cores (Thomas et al. 2009).................................11 Figure 5: Hardware pipelining time compared to software pipelining (Gordon et al. 2006)................12 Figure 6: Streams services and components at runtime (InfoSphere Install Guide 2009)....................13 Figure 7: Application Development Paradigm (Gedik 2009)..............................................................17 Figure 9: Exporting and importing streams (IBM Programming Reference 2010)..............................20 Figure 11: MARIO component description language (Ranganathan et al. 2009).................................23 Figure 12: MARIO Architecture and deployment capability (Ranganathan et al. 2009)......................24
List of Tables ................................................................................................................. .................................................................................................................................... .................................................................................................................................... .................................................................................................................................... .................................................................................................................................... .................................................................................................................................... .................................................................................................................................... .................................................................................................................................... .................................................................................................................................... .................................................................................................................................... .................................................................................................................................... ....................................................................................................................................
HIT 382 iii
s994752
1. Introduction
In the 21st century, the need for timely, accurate information and situational awareness on local and global events has become one of the pillars of modern society. Economic, social and security structures rely implicitly on knowledge of the external operating environment, and to achieve this multiple sources and large volumes of data must be analysed and acted upon in real-time. At the close of the last century these volumes and data rates were at a level where traditional methods of analysis by storing and querying information in a static database were able to be utilized. However, this common batch data paradigm, where data is read in, allocated memory space, processed and some result is output now has too great a latency period where vital information in the continuous data flow may be missed. A new paradigm, stream processing (SP), has emerged that allows real time analysis of multiple high volume data flows, using a framework of processing elements that interact with each other and can be spread across a number of commodity machines, whilst logically appearing as one single runtime environment. Stream processing is considered to be part of the data mining, complex event processing (CEP) and predictive analytics (PA) fields of computer science (Fllop et al. 2010). Instead of using static queries that return instant results, stream processing uses continuous queries (referred to herein as Inquiries) that produce results over time. These results can then be used in analysis for future events, such as the likelihood of failure of some device, intrusion detection or for timely business intelligence on stock market movements. Along with these new techniques to extract the required data, stream processing will complement other new computing technologies like cloud, grid and ubiquitous computing. It is therefore important for the information technology student and professional to keep abreast of emerging technology and keep ahead of the curve.
HIT 382 1
s994752
2. Background
The Concise Oxford Dictionary (1984) describes the word paradigm as a pattern or example. Literature relating to stream processing almost always refers to it as a new or novel paradigm (Frossard et al. 2006). So how is stream processing a new pattern or example? Does it use processors that are a departure from the von Neumann architecture, the standard in computing since the Second World War? How can it examine and process data without storing it in a database somewhere? Has the technology evolved enough to produce reliable, predictable results? The rapid advancement of information technology in the past 15 years has also produced a plethora of programming and query languages. Will stream processing require technologists to yet again learn new syntax and semantics in order to use the technology effectively? What are the interactive development environments like and are they similar to ones currently available? This paper endeavours to resolve these questions and deliver an insight into the technology.
HIT 382 2
s994752
3. Project Aim
The purpose of this report is give an overview of how stream processing works and the components it uses (both logical and physical) to students and professionals in the information technology and computer science fields. Instances of stream processing are already in use in a number of enterprises around the world, and with data rates and high speed Internet access rapidly accelerating, stream processing is very likely to become commonplace. As with any new computing paradigm, the underlying structure is extremely complex with components for load balancing, scheduling, fault tolerance and error handling. These functions have not been discussed in this paper. The research focuses on four of the most integral aspects: (1) Platforms and variants of stream processors, (2) Architecture and data transport through system, (3) Inquiries and comparisons to traditional static queries and (4) the Interactive Development Environments (IDE) and programming languages used to develop workflows. Sections 5.1 & 5.2 give an overview of the research and commercial systems presently available and how they operate. The remaining sections specifically investigate the IBM InfoSphere Streams product, which appears to have the greatest potential for market penetration due to its distributed structure and scalability. Section 5.3 describes how the components interact with each other dynamically at runtime. The section on Inquiries, 5.4, will examine the methods used to extract data being searched for and the resultant outputs accuracy and suitability in certain situations. Section 5.5 reviews the development environment used to configure and compose complex inquiries, and evaluates the ability of persons who are not familiar with the SPADE (Stream Processing Application and Declarative Engine) language or the development environments to compose these inquiries and produce meaningful and relevant results.
HIT 382 3
s994752
4. Methodology
The research methodology used was one of an iterative nature, as described by Trant and Bearman (2010). They posit that research is a multi-stage iterative process involving a series of tasks, taking place in two distinct realms. The first realm is that of the information provider (where the researcher is 'getting' information) which has two phases, discovery and retrieval. The second realm is that of the User (where the researcher is "using" information), and this can be further subdivided into collation, analysis and re-presentation.
Figure 1: The Research Process (Trant & Bearman, 2010) In order to resolve the research questions in finer detail, it was necessary to analyse the questions themselves and discover whether they were suitable for this paper. This procedure has been described as the Goldilocks approach by Clough and Nutbrown (2002), where the metaphor is used to determine if the question is too big, lacks enough substance to be a worthy subject (too small), or controversial (too hot). Certainly the technical details and mathematical theory relating to the hashing algorithm used for inquiries in a data stream (Xu 2007) would require several hundred pages
HIT 382 4
s994752
to fully explain how they return results; a general overview of the concept would be far more appropriate for the target audience. It was found, however, that a more detailed examination was required to sufficiently answer another question, which was the precision of results returned from inquiries. The drilling down through the original proposal to reveal the essence of the question is termed the Russian Doll Principle, where the question is broken down into a smaller, well defined focus, in the same way as the familiar childs toy reveals a tiny doll at the center (Clough & Nutbrown 2002). In undertaking the retrieval phase, keyword and dedicated search string methods were used in the ACM, EbscoHost Academic Premiere and IEEE online databases. As the technology has only existed since late 2001, there are few published works on stream processing, other than papers presented at conferences. Most of the information retrieved was less than 5 years old, and much of it published by the research divisions of the major vendors, such as IBM. Articles of interest were found from periodicals; however none of these were referenced in the research findings. They have been included in the bibliography sub-section. In conjunction, development of an abstract assisted further in defining research questions that could be extensively examined and postulated upon, which in turn helped target specific information during the discovery phase. Once all data had been collected, a review of the literature provided an opportunity to perform radical reading, where discovery of the central argument (what the author is trying to say) and to whom the author is speaking (the target audience) is undertaken (Clough & Nutbrown 2002). Information that was deemed relevant to exploring the research questions was then harvested for further classification. The collate phase involved grouping relevant data from different research papers into one of the four groups on subject cards, with each paragraph colour-coded by author and page number for easy intext quoting in the main body of the project. With this style of collation, it was immediately apparent with further radical reading of each subject card where two or more authors had converged on the same subject and reached similar conclusions. For example Dylan (2009) and Fllop et al. (2010) had arrived at comparable findings in their surveys, although their research was carried out almost simultaneously on opposite sides of the globe. This greatly assisted in the validation of the arguments postulated, and a common thread in the authors line of inquiry was uncovered. The subject card collate and analysis phase also presented the opportunity to carry out gap analysis on the subject matter and re-present it in condensed form. It was found that the areas relating to programming languages and IDEs lacked substantial subject matter; so too did the available platforms data. A return to the discovery phase in an effort to gather more data to close the gap in those areas lacking sufficient volumes data represents an iteration of the research process. In addition to researching how stream processing works and the accuracy of the results, the project was designed to explore the ability of programmers to use the IDE to compose workflows and inquiries, without having an in-depth knowledge of the platforms or language. The level of expertise required to use each development environment was carefully examined in order to determine whether the oft-stated theory of the analyst as a programmer is actually able to be realized.
HIT 382 5
s994752
5. Research Findings
5.1 Theory of Operation
The stream paradigm is designed to accommodate situations where there is more data entering continuously into the input ports than the processing system can handle. New data is constantly arriving, even whilst current data is being evaluated. In order to keep pace with data arrival, it is critical therefore that the amount of time taken to perform computational operations on the information must be kept as low as possible (Babcock et al. 2002). To achieve this low latency, rather than have the data stream hitting the processor head on as it were, the processors are located along the stream, like a row of houses built on the bank of a river. This topology allows multiple processors to work simultaneously on different parts of the input stream, in parallel, and provides the processing power to ingest the high speed data streams. The input feeds ingested into the system can be data that is of the structured, semi- structured or unstructured types. Examples of each include: Unstructured data radio astronomy data or signals of unknown types Semi structured raw sensor data, intrusion detection systems and video streams Structured - RSS news feeds, stock market ticker quotes, network packets (IBM mod.1 2009) For the system to make sense of this data stream, it is necessary therefore to segment it into chunks of some type for analysis. In the case of structured and sometimes semi - structured data, this may be easily achieved by utilising punctuation marks or end of file markers embedded in the stream. The Comma Separated Values (CSV) format is a well known example of structured data. This data unit, or tuple, can be defined as a sequence of named attributes each holding data of a particular type, and therefore having the same structure or schema (IBM mod 2 2009). According to Dylan (2009), this type of data is able to be manipulated using tuple based execution, where each individual processor can work on a single tuple or a window of multiple tuples. However, for unstructured data, segmentation must be carried out by the input system using time based execution. As the name suggests, time based execution occurs continuously on a pre- determined window of time, using an approach similar to the one employed for flow control of data links. A stream processing instance is not limited to using one execution style, however, and can use as many combinations of each as required to accommodate the number and type of input streams. After considering that the data is fed into the system continuously, it follows that queries must also be applied continuously in order to extract the required information. This is perhaps the most radical departure from current conventional query practices; rather than having relatively instant answers returned, the answers to continuous queries are returned over a period of time. In order to differentiate between normal queries and continuous ones, the term inquiry is often used. Once stream data is examined it is either processed further or discarded, meaning it is a one shot only type inquiry (Babcock et al. 2002). Normal database information can, however, be introduced as input into the inquiry in most of the platforms presently available. Stream processing system vendors in most cases have extended the Structured Query Language (SQL) by incorporating operators and functions to construct inquires, thus making it easier for developers and operators to transition into the streams processing environment.
HIT 382 6
s994752
5.2 Major platform variants
The following diagram gives a timeline of stream processing platform developments since mid 2001. Notable inclusions here are Aurora / Borealis, STREAM, StreamBase and IBM InfoSphere (System S), as they provide and insight into the evolution of the paradigm. A brief overview of these systems will be given in the following pages.
Figure 2: Stream processing systems timeline (Fllop et al. 2010). 5.2.1 Aurora and Borealis The Aurora software was freeware developed by Brandeis University, Brown University and the Massachusetts Institute of Technology (MIT) between 2003 and 2006 (Fllop et al. 2010). Babcock and others (2002) describe the core of the Aurora system as having a series of software boxes located on a single machine that contains seven well defined inquiry operators and fires triggers every time some condition is met in the incoming stream. It can handle large volumes of asynchronous data, implementing a windowing approach and timestamping every
HIT 382 7
s994752
input tuple. The system also uses a programming interface with arrows and boxes to construct its dataflow graph. The extended SQL used as the inquiry language is stream query algebra, also known as SQUAL (Dylan, 2009). Borealis is the second generation of Aurora, with its runtime distributed across multiple machines. The middleware used to achieve this also has a scheduler, storage manager and load shedder. It has a free software licence and has been used for applications in network monitoring, large plant facility maintenance and surveillance systems (Fllop et al. 2010). 5.2.2 STREAM Whilst Aurora was being developed at MIT, Stanford University was busy developing its own version of a stream processing system, calling the end product STREAM, and distributed it as freeware. According to Dylan (2009), STREAM uses a single centralized data stream management system (DSMS) for its runtime environment. It targets environments where the data stream rates may be very high, and the inquiry loads may increase or decrease significantly over time. The system uses inquiries that are issued declaratively using Continuous Query Language (a derivative of SQL), which is then compiled and produces an inquiry plan to apply to the stream. CQL also has built in operators for organising input and output streams, and has the ability to connect inquiry plans together to form aggregated plans. Tuples are time stamped upon inspection, and the DSMS can integrate traditional datasets into inquires. Load shedding, tuple sampling and dropping can also be accomplished when data arrival rate is high. STREAM development ceased in 2006, although the free BSD licence is still available. This type of system is suited for streams where a particular event triggers some definitive output, like a network intrusion detection system or closed circuit television (CCTV) network. 5.2.3 StreamBase StreamBase is commercially licensed stream processing software developed by StreamBase Systems. The company was founded in 2003 by Dr. Mike Stonebraker, who was one of the lead professors involved in the original Aurora project in 2001. It presently has a relatively large customer base, mainly amongst global financial institutions and government / military installations (StreamBase, 2010). StreamBase is installed on a single machine but utilizes a synchronized secondary machine as a backup with automatic failover. It is a tuple based system, using a windowing approach and has the ability to aggregate several tuple windows at once into inquiries (Dylan 2009, p.5). In order to compose inquiries with the application, StreamBase Systems developed StreamSQL, which has a similar syntax to the SQUAL language used by Aurora. It is a graphical event programming interface with several operators that allow data streams and relational databases to be processed at the same time in a uniform way. Connectivity to a wide variety of existing databases and interfaces has contributed tremendously to its commercial success in the telecommunications industry for network monitoring and law enforcement fraud detection operations (Fllop et al. 2010). 5.2.4 InfoSphere Streams The previous platforms described (with the exception of Borealis), each had the stream processing application installed on one large machine, with possibly a secondary machine for backup. The InfoSphere Streams system is a departure from this architecture by having a distributed runtime. In this scenario, many individual machines can be connected together logically, using an application described as middleware (that is, located between the software and the hardware) to form one, large runtime instance. The benefits of this are obvious: many multi core processing machines can be linked and interact with each other, sharing resources and providing a high system scalability and availability rate by automatically bypassing machines that have failed or are in error.
HIT 382 8
s994752
The IBM corporation began working on a stream processing paradigm in 2003, using the IBM research nomenclature System S. Most of the development occurred at the T.J. Watson Research centre in New York, and System S was commercialized under the product name of InfoSphere Streams in 2006 (Gedik 2009). The remainder of this paper will examine the InfoSphere Streams system in greater detail, as its distributed runtime and internal communications structure amongst streaming objects allow it to ingest and analyse volumes of data at speeds that would have been unimaginable only a few years ago.
HIT 382 9
s994752
5.3 InfoSphere Streams architecture
As described previously, IBM InfoSphere Streams (referred to herein as Streams) is distributed middleware whose components provide services to enable simultaneous executions of multiple processing applications on a cluster of machines. The goal of Streams is to support stream analytic and data mining applications comprising of hundreds or thousands of processing elements (PEs) working co-operatively on numerous source streams (Wu et al. 2007). These PEs are the workhorse of the Streams system; they are the physical deployable unit and are generated by defining and then compiling logical objects in the IDE. Of these objects, there are three basic types: 1. Operators of which several variations exist that can ingest data through 1 or more input ports, do some form of computation, and then write the results to one or more output ports. 2. Sources, which the raw data feeds into ( physical sensors, RSS feeds or other web sources) for transformation and annotation into streams (only 1). 3. Sinks, that ingest streams (usually 1) and export processed data out of the Streams system into databases, GUIs and other external systems (IBM mod 1 2009). In order to move information between sources, operators and sinks, the data itself becomes an object. These are termed Streaming Data Objects, (SDOs) or simply streams. The operators change the format of the stream by implementing algorithms to filter, parse or classify data. By defining what the operators actually do to the stream is how developers and analysts construct the desired inquiries. The complete collection of sources, operators and sinks connected by streams is called a processing or flow graph, and is composed in the IDE, using the SPADE programming language. This graph represents a logical, deployable job, and when compiled it is mapped onto physical execution units, the PEs. This execution unit could be a machine (node) with a single CPU, or in the case of multisocket servers, several PEs can exist on one node. A PE cannot, however, be spread across multiple nodes (Gedik 2009). The PEs in an instance of Streams use publication and subscription messaging via a dedicated management host node. For example, a PE will subscribe to the stream format its input can consume, and then publish the format of the stream it produces (Schulte et al. 2009). This feature therefore allows extreme flexibility: PEs can consume new input streams automatically when they appear in the environment, or they can go idle if no streams matching their subscription are available. All stream format types used are defined in a single XML file, which is shared by all PEs in a Streams runtime. PEs can also be situated in parallel; that is, one stream from a PE can be spread among the input ports of many other PEs (Wu et al. 2007). According to Babcock and others (2002), the key to enable streams to be processed on the fly is a very low computational time per stream element. New data is constantly arriving even as old data is being processed, and the algorithm must be able to keep pace with the input stream. A traditional approach for increasing the computational performance of a system involves partitioning the work across multiple entities and executing them in parallel. This has become even more achievable with the advent of multicore processors (Thomas et al. 2009). Whilst Streams will run on normal machines with X86 based processors, for higher performance stream computing the system can exploit specialized processors like field programmable gate array (FPGA) servers or the Cell Broadband Engine Processor (Cell), which was originally designed by a consortium of Sony, Toshiba and IBM for the Playstation 3 platform (Williams et al. 2005). The Cell has 9 cores, with one being the power processing element (PPE or master) and 8 other cores called synergistic processing units (SPU or slaves). Each core is equipped with a direct main
HIT 382 10
s994752
memory access engine and its own local memory. A special high bandwidth circular data bus, termed the element interconnect bus (EIB), connects each processor. The EIB has 4 x 16 byte single direction channels that counter rotate in pairs, and each SPU has its own memory, allowing them to load/ store instructions locally. The PPE is responsible for managing this local memory and loads and clears it to main memory, leaving the SPUs to get on with the job of examining the stream (Williams et al. 2005).
Figure 3: Cell processor architecture (Gschwind 2006)
Figure 4 illustrates the concept; here the source component of a PE operates on a typical input stream. The local sketch is like a clearinghouse area of memory where the local results of the inquiry are stored, and the point query represents the area in main memory where all SPU results are combined. The disjoint blocks are units of selected data, called windows, and will be discussed in section 5.4.
Figure 4: Parallel stream processing using multiple cores (Thomas et al. 2009) The mapping of producer to consumer PEs in the processing graph is a technique called pipelined parallelism, as the stream is piped between the PEs (Gordon et al. 2006). As Streams has a
HIT 382 11
s994752
distributed runtime, this means the middleware is able to pipe streams between PEs, even if they are not located on the same physical server. Using software rather than hardware to pipe streams also decreases the time to process dramatically, as displayed in figure 5.
Figure 5: Hardware pipelining time compared to software pipelining (Gordon et al. 2006) The system on the left does not use a distributed runtime, so each iteration must be synchronized at a point before execution can continue. An analogy is a hurdles race with the execution threads as runners; each thread can run at its own speed, but they must wait until all other threads have reached the same hurdle and jump together. The synchronization of independent runtimes results in significant time inefficiencies when compared to distributed ones, where one or more hosts are used for system control over the entire instance.
HIT 382 12
s994752
Figure 6: Streams services and components at runtime (InfoSphere Install Guide 2009) There are three types of hosts in any Streams installation. The first type, the application hosts, have one or more PEs in separate execution containers (PECs) that provide runtime access to the Streams middleware. The PEC also supplies a measure of security, preventing userwritten code in the PE from corrupting the middleware or other PEs. At the same time, other PEs cannot access the code in the PEC (unless stipulated in the process graph), which is useful in instances where sensitive data occurs and security is required. Each application machine also has a host controller service (HC) that is responsible for starting, stopping and monitoring PEs (Wu et al. 2007). The second type of host, the management hosts, contain the services required to operate the Streams install. These services include: Streams application manager this handles job management tasks including job submission, cancellation and can interact with host controllers to deploy or cancel PEs Streams resource manager monitors all services and collects metrics on hosts and components Scheduler decides job placement in the distributed system infrastructure Authentication and authorization service Streams recovery database records the state of services and allows recovery of services after failure Name service stores service references for all components, and is used for communication between components Streams web service provides web based access to services (InfoSphere Install Guide 2009)
The third type of host is called a mixed host, as it can contain both management and application services. The various management and application hosts can be located anywhere in the distributed system, as long as the data switching fabric (network devices) allow. Streams can be loaded on to each machine separately in a cluster, or it can be loaded in a shared directory accessible by all hosts. Loading the middleware on each host can improve system performance, and prevent network congestion on start-up (InfoSphere Install Guide 2009).
HIT 382 13
s994752
5.4. Generating inquires on data streams
The InfoSphere Streams platform represents an example of a Complex Event Processing (CEP) system. An event is described as anything that happens, or is contemplated as happening (Fllop et al. 2010, p.6). In the context of stream processing, events can occur in the real world, virtual world or the future. They can be a stream of physical sensor input values, tuples of data or storm prediction warnings generated from forecasting models, for example. A complex event is an amalgamation of a series of simple events; the landing of an aeroplane is comprised of a series of simple events such as the pilot setting the flaps, lowering the wheels etc. CEP systems typically examine every event in a stream using event processing agents, and compare the event values with the pre-defined values stored in its memory, and perform some action if a match occurs (Fllop et al. 2010). The agents in the case of Streams are the PEs, and the pre-defined values are determined by the operators in the source code. Other events can be generated when patterns of events are observed; these are called derived and composite events. CEP systems can synthesize new streams by enriching them through comparison with historical data stored in a database or injecting events from other streams (Schulte et al. 2009). For the system to be able to inspect every event or tuple in a stream, some mechanism must therefore exist to allocate the events to each core in the PE. Streams uses a windowing technique to define the set of tuples to operate on, and can take the form of either: 1. a tumbling window, determined by time or number of tuples. Once the window condition is achieved it is flushed and a new one started. 2. a sliding window, where the tuples are processed for a fixed amount of time and purged afterward. The boundaries of the windows can be determined by tuple count, time elapsed or some attribute in the stream. If window boundaries of a stream are unknown, the system allows punctuation markers to accompany the stream out of channel (i.e. not in the stream) to enable downstream PEs to operate on it (IBM mod 3 2009). Inquiries on a data stream have many similarities with queries on a relational database, yet with two important distinctions: One-time versus continuous queries Pre-defined versus ad-hoc queries
In the first distinction, onetime queries are those that are evaluated once, like a snapshot of the data set they are applied against, and return an answer immediately. These types of queries include those used in traditional relational databases. Continuous inquiries, however, are evaluated perpetually as long as the data stream persists. The results of a continuous inquiry are produced over time, and reflect the data seen so far. The second distinction, between pre-defined and ad-hoc queries, is more difficult to distinguish. Predefined queries are created before any relevant data has been ingested; ad-hoc queries are issued online, after data ingestion has begun. Pre-defined queries are usually continuous ones; ad-hoc queries on the other hand can be of either one-time or continuous. Ad-hoc queries can cause complications within stream processing systems, as they cannot be easily optimised and may reference data that has already been discarded from the stream (Babcock et al. 2002). After considering these distinctions, it follows that in order to compose inquiries for data stream systems the user must know in advance what the stream data types and attributes are, and should also be prepared for changing inquiry results over time as the stream tuple values change.
HIT 382 14
s994752
According to Fllop et al. (2010), there are three main styles of query languages for CEP systems: data stream query languages stream events are converted to database relations, operated on by SQL type queries and then returned to the data stream production rules specifies the actions to be taken when certain events are observed in the stream composition operators where complex operations are carried out by a series of standard or user defined operators ( joins, splits etc)
The first style does not lend itself well to instances where data ingest rates are high. Converting stream tuples to relations, operating on them and re-converting to a stream has obvious shortcomings in processing latency. The second style is rather simplistic in its approach. It is essentially a set of triggers that fire when certain events are encountered, and has limited usefulness within a complex stream environment. The third style, however, affords flexibility and adaptivity in composing inquiries, allowing the user to describe exactly what is required. Streams utilize this composition operator method, by way of the SPADE programming language. IBM developed SPADE around the concept of write once, use many style of programming, similar to Java. The operators used to compose inquires can be re-used and shared amongst programmers and analysts, thus vastly reducing development time. Collections of operators are called toolkits and SPADE comes with a wide range of type generic built -in operators in its Stream relational toolkit, which can manipulate tuples in similar way to SQL. Also included is an adapter toolkit, which allows connection to static databases using ODBC and other protocols to import and export data (IBM module 2 2009). A user defined built-in operator, or UBOP, is one that has been written by a programmer for a specific task and is able to be exported and used by any install of Streams. When an operator needs to be built for a one-off task, a user defined operator, or UDOP, can be developed quickly to satisfy the job requirements. The main difference between the two is that UBOPs are usually created from a template and error checking code must be included so it can be shared with other developers. UDOPs are tightly bound to the SPADE workflow they are associated with, and wrap legacy libraries to allow customisation. A user defined operator that has been written in Java rather than C++ is called a JDOP. User defined operators are useful where data conversion needs to be undertaken or for integration into legacy systems. Regardless of the type, operators share a common framework in their base code format (IBM Programming Model 2010). The relational toolkit, as its name implies, contains operators that anyone familiar with SQL- type languages would understand. The following table contains the names of these operators and a brief description of their capabilities. Source and Sink (edge) operators are not included, as their functions have been previously described in section 5.3. Table 1: SPADE relational toolkit operators (IBM module 3 2009).
Functor
Used to perform tuplebased manipulations such as filtering, projection mapping and attribute creation and transformation, similar to SQL SELECT statement. Splits a stream into multiple outputs, based on a condition that determines where the streams are exported. Can also incorporate for loops.
Split
HIT 382 15
s994752
Aggregator Delay Join
Used for grouping and summarizing of incoming tuples, similar to SQL Group By statement. Summarizing features include Min, Max, Avg, Sum, and Count. Artificially slows down streams, used in system start-up operations. Correlates two streams, determined by an expression (predicate) or a windowing configuration. Also capable of left and right Outer joins. Uses algorithms to impose order on incoming streams. Inserts punctuation marks into a stream. Can refer to attributes from previous tuples. Multiplexes tuples from several streams into one output stream. For this to occur, input streams must have same data types and structures. This operator is used as a synchronization point. It consumes tuples from multiple streams, and only outputs tuples when one has arrived from each stream. This allows parallel operations on tuples.
Sort Punctor
Bundles
Barrier
Using these and other operators, it is therefore possible to create inquiries that return results which are both fine-grained and very accurate. This is due to the fact that unlike predictive analysis systems, which sample a data stream for a given amount of time and make predictions against the dataset, InfoSphere Streams examines every tuple for a match against the operator logic. As long as the
HIT 382 16
s994752
Figure 7: Application Development Paradigm (Gedik 2009). operators are configured correctly and input data streams remain uncorrupted, the system will continuously return the desired results. The combination of operators from the source inputs to the sink outputs is known as a flow graph, and constitutes a deployable job to the Streams instance. Users and developers of Streams construct these flow graphs using one or more IDEs, utilizing the toolkits of operators or building their own. An examination of how this is accomplished is undertaken in the next section.
HIT 382 17
s994752
5.5 Development Environments for InfoSphere Streams
According to Nasgaard et al., Stream processing applications are more complex than most programs, due to the use of massive parallelism (2009, p.312). It therefore follows that the tools used to create stream processing applications must also have an increased level of complexity and incorporate features not normally found in typical development interfaces. In an effort to reduce this complexity, the SPADE language reference model provides commonality with other programming languages such as Java, Perl, C and C++ by having built in application programming interfaces (APIs), header libraries, pre-processor functions and similar source code structures. As previously stated, SPADE stands for Stream Processing Application Declarative Engine. The stream processing application part is fairly self-explanatory, but what about the Declarative and Engine components? In the context of InfoSphere Streams, the program describes (declares) what an output stream contains, rather than the actual algorithm to compute the output. The SPADE compiler (engine) generates the code that runs on the underlying hardware (IBM Mod 2 2009). Developers can compose applications by either writing the SPADE source code directly via a text editor (vi, Emacs etc.) or use one of two IDEs: the Eclipsebased Streams Studio or MARIO , the Mashup Automation with Runtime Invocation and Orchestration environment. Which system is chosen to develop an application is usually determined by the users technical abilities, and each one has been designed to cater for specific areas of user expertise. The following sub-section gives an overview of the SPADE language, followed by brief descriptions of how Streams Studio or MARIO can be used to compose inquiries and flow graphs. 5.5.1 The SPADE language There are six main sections to a SPADE programs source code. These are: 1. Application meta information this section provides information on the program itself. It usually identifies the program name and can optionally allow debug / trace levels required at runtime. It uses the [Application] identifier. 2. Type definitions [Typedefs] this is an optional section that can be used to create aliases for the types that are to be used in the program. 3. External libraries another optional section used to create references to libraries and their file paths, as well as header files and interfaces used by user defined operators (UDOPs). This section uses the [Libdefs] identifier. 4. Node pools allows definition of the hosts available (the pool) to be used by Streams. Also an optional section, the creation of node pools gives the designer the ability to explicitly control host assignments if required. It is identified by the [Nodepools] designator. 5. Program body this is where the application is described, with the streams taking the form of objects. The flow of the application with the operators and streams is defined here, and is identified by the [Program] title. 6. Function debugging [FunctionDebug] an optional section where developers can test apps without having to run an actual application. The basic form of communication between applications in SPADE is the stream object. The stream is the continuous sequence of tuples, and the structures of the tuples and the stream itself are called the stream and tuple schemas (IBM Programming Reference 2010). They describe what data types and literals can be expected in the stream or tuple, and as shown in figure 8, are declared in the program section. SPADE supports a huge array of data types, including (but not
HIT 382 18
s994752
limited to) the same basic types as C++ and Java (int, double, boolean, String etc), lists and matrix types, typedefs and literals (both numeric and Unicode strings).
Figure 8: SPADE source code example (IBM Programming Reference 2010)
The input stream in SPADE is defined with the source ( ) function, and the output stream uses the sink ( ) function. It is between these two functions that the tuple / stream manipulations of the intended inquiry take place, by using combinations of the operators described in section 5.4, as well as the many other in-built functions available (note : the := operator in SPADE is equivalent to the = operator in C / Java). The example in figure 8 does not perform any calculations on the input stream; it simply exports it in CSV format to a file named ResultSink.dat. Many applications running on InfoSphere Streams could possibly produce and consume streams with the same schema. For programming convenience, SPADE allows the programmer to declare and name a virtual stream (vstream) at the beginning of the programming section, and then reference it later on with the keyword schemaFor. This feature is analogous to declaring an abstract class in the Java language. Figure 8 illustrates the concept. Giving each stream running on the system a unique name allows the import and export of streams between different applications. These names are registered with the name server component of the host management services at application instantiation. By including the keywords import and export in the program section, applications can consume and produce the desired stream. As shown in figure 9, the keyword tapping must be used when importing streams, as this describes the subscription criteria or application producing the stream. SPADE also has many in-built features to assist in efficient code and algorithm description construction. This includes helper functions, tools that assist in ensuring operators conform to
HIT 382 19
s994752
the correct syntax (i.e. they have the right amount of input and output ports) and the spadec compiler. Spadec has a large assortment of command-line switches, and supports incremental compilation of source code. This feature evaluates the changes made to previous versions of code and recompiles only those sections, greatly reducing development time. SPADE applications are usually stored in a plaintext file with a .dps extension. Applications developed with a mixture of SPADE and Perl scripts are called mixed-mode, and use the .dmm extension, and a set of both types of applications uses the .dpsm extension. The compiler can be invoked from the shell prompt by simply calling it with the f option, along with application file name and file extension. When invoked, the compiler also creates the necessary sub-directories and files where the source file is located (IBM Programming Reference 2010). The compiler takes the application created by the developer (the declaration) and generates the appropriate machine code to achieve the desired outcome (the engine). The resulting low level code is therefore an abstraction to the developer, meaning that they are oblivious to it and are not required to know of its existence.
Figure 9: Exporting and importing streams (IBM Programming Reference 2010) The use of one of the common text editors and the spadec compiler for generating source code would normally be restricted to expert programmers with explicit knowledge of the SPADE syntax and semantics. 5.5.2 InfoSphere Streams Studio IDE In order to extend the commonality of the SPADE language with other well-known languages, Streams applications are also able to be constructed in the open source Eclipse environment. In November 2001, a consortium of industry leaders such as Borland, IBM, MERANT, QNX Software Systems, Rational Software, Red Hat, SuSE, TogetherSoft and Webgain formed the initial eclipse.org to provide a common platform where their products could be developed. By the end of 2003, this foundation had grown to over 80 members, and in January 2004 the Eclipse Foundation was created. The foundation provides IT infrastructure, intellectual property management, development process and ecosystem services to the Eclipse community (Eclipse 2010). The Eclipse environment of Streams is called the InfoSphere Streams Studio and the default view or perspective consists of 5 main panes, which have a similar look and feel to other
HIT 382 20
s994752
development environments such as NetBeans or the Microsoft Visual Studio software suites. They are: 1. Project explorer - this pane displays the directories, files and folders of the project. It takes the form of a tree-type structure, allowing the user to expand and collapse directories and folders as required. 2. Editor Streams Studio provides three types of editors, depending on the type of application being constructed (.dps, .dmm or .dpsm). Like other IDEs, the syntax of SPADE is highlighted by different colours. Errors are indicated to the left of the code by a red marker, and mousing over the code brings up information on the error type. The editor pane supports tabs, so multiple applications can exist on the pane at the same time. 3. Outline the hierarchical structure of the currently selected application are displayed in the outline pane. 4. Live graph outline the schema and format of any applications being viewed in the Live graph pane are displayed here, in a similar form to the commonly used Universal Markup Language (UML). 5. General views this pane displays several tabbed sub-panes: Live graph- displays a graphical representation of the topology, state and behaviour of all running applications in a streams instance Problems pane details errors detected by the compiler / launcher Console this is the terminal window of Streams Studio, and displays the trace messages generated by the Streams instance Application graph displays the topology of an application in its current state (Streams Studio Installation Guide, 2010)
HIT 382 21
s994752
Figure 10: InfoSphere Streams Studio layout (Streams Studio Installation Guide, 2010)
The Streams Studio toolbar provides many features common to most IDEs. Amongst other things, it is here that projects can be imported / exported, opened and closed, run and / or debugged. Links to Concurrent Version System (CVS) or Apache Subversion (SVN) repositories can also be defined, allowing collaboration between developers across the network or across the planet. As Eclipse is open source, plugins for other software such as SAP or Rational Softwares ClearCase versioning tool can enable Studio to open in different perspectives other than the default Streams Studio view. To create JDOPs, pure Java views are available with packages, classes etc. displayed in the project explorer pane, and similar views exist for C/C++. Launcher configurations to run, stop, submit and cancel job deployments from the project explorer pane can be created, allowing the developer to test the application in real time with the click of the mouse (Streams Studio Installation Guide, 2010). A typical user of Streams Studio would have previous programming experience using the Microsoft Visual Studio packages or NetBeans. 5.5.3 The MARIO IDE The MARIO IDE is similar to the Studio environment with the edition of a Server tab in the General views pane. MARIO has been designed as a drag and drop interface for the user to specify components that have already been constructed, and deploys jobs that can integrate data flows from other platforms (i.e. not InfoSphere Streams). The end-user therefore does not need to know the underlying characteristics of the source platforms to construct workflows (Bouillet et al. 2009). MARIO uses an automated dataflow assembly service and a deploying agent to take SPADE and other systems source code, modify and assemble code segments and then deploy the result to the target systems. Once the application is deployed, MARIO is not involved in dataflow execution (Bouillet et al. 2009). It uses a common model of objects called components that wrap fragments of SPADE code within tags of the commonly used extensible markup language (XML). These components dont require knowledge of specific syntax of the underlying platform, they only recognise the platforms language references to parameters or input / output ports so they can re-reference them between XML tags of @s. In this way, inputs and outputs of disparate platforms can be mashed together and inquiries performed on them. The bridging components that accommodate these functions are generated automatically by the MARIO middleware (Ranganathan et al. 2009). Each component in MARIO contains some form of executable code that describes how the input data is manipulated to produce the desired output data. Code fragments of deployment instructions in the native language of the platform it is to be invoked upon are also associated with the component. If the platform happens to be an InfoSphere Streams system, the deployment instructions would be SPADE fragments. Other platforms, such as IBMs DAMIA source aggregator use BPEL, the Business Process Execution Language for deployment (Ranganathan et al. 2009).
HIT 382 22
s994752
Figure 11: MARIO component description language (Ranganathan et al. 2009). To construct MARIO application flows, an instance of InfoSphere Streams must first be running, as well as a MARIO server instance. Users can manually drag and drop on tags in the editor, and the flow assembler determines if the connections are valid by examining the tags in the component descriptions. Tagged components from other applications can also be imported via the use of built-in wizards. If the MARIO program compiles successfully it is then deployed to the required platform via a web interface provided by the server (Bouillet et al. 2009). Construction of components by programmers using imported SPADE and other platform source code is also able to be undertaken in the editor pane, and then added to repositories. Components are able to be tagged and accessed by other developers, in a similar way to the repositories of applications in Streams Studio. It follows therefore that the only real expertise needed to use MARIO once significant repositories exist is knowledge of the end data required and which of the tagged sources are to be used (IBM mod 2 2009). The generated flow determines the order of processing amongst the various platforms, and this constitutes the orchestration of the systems. Figure 12 illustrates the composition / deployment process, and when compared with the SPADE development paradigm in Figure 7 of section 5.4, the similarities between the two become quite apparent.
HIT 382 23
s994752
Figure 12: MARIO Architecture and deployment capability (Ranganathan et al. 2009).
HIT 382 24
s994752
6. Conclusions
Stream processing represents a way to process large volumes of information by utilising existing technology in a new form. The underlying physical components are not some radical departure from familiar, commercial hardware currently available. It is, however, a new methodology that uses middleware in most cases to manage multiple machines working in harmony, and distributes the work load amongst them as required. It leverages advances in multi-core processor technology to achieve this by implementing parallel processing, with input and output pipelined to other processors via the software, thus greatly reducing processing duration. By utilizing these low latency processing times, systems such InfoSphere Streams are therefore able to examine every tuple in a stream against the inquiry algorithm, returning very accurate and timely results. The IBM InfoSphere Streams platform has been developed using a language that implements elements of the mainstream programming and query languages, thereby greatly assisting transition from traditional relational database systems to the stream processing paradigm. Similarities between the Streams IDEs and those commonly used by programmers also greatly reduces the developers learning time and therefore lowers deployment time. Some previous exposure to programming is still required to compose source code, operators, components or flow graphs. However, once large repositories of these have been created or shared, users with little or no coding experience would be able to use the MARIO system to compose applications with only a limited amount of training. Billions of devices have already been embedded with networking hardware, and this trend is expected to continue well into the future. Each of these will emit a stream of data in some form or another, and this data will have to analysed in real time if it is to be used effectively. Stream processing systems are very likely to be the only way this can realistically be achieved. In todays highly competitive marketplace, the age old adage of knowledge is power is probably more relevant now than at any point in human history. An enterprise can gain a competitive edge in Business Intelligence by analysing huge data input feeds in real-time, and stream processing in some form or another will very likely be one of the core components of nearly every medium to large enterprise and government department within the next decade.
Acknowledgements
I would like to thank the project supervisor, Mrs Rebecca England for her guidance and direction during this research. I would also like to thank Mr Peter Wearden for providing the SPADE training course manuals and InfoSphere Streams install documentation.
HIT 382 25
s994752
7. Lessons Learned
During the course of this research the following points were discovered as a result of the research process itself: 1. Only about half of the entire research material collected is actually used in any research project. In this instance, this was mainly due to not actually knowing the direction the project would take in the earlier stages, and also discovering that information considered to be irrelevant would later resurface in other areas that were initially thought not to be connected. 2. Time management is critical during all stages of any project. Without the use of milestones provided by lecturer Barbara White and then translating these into a project plan it would have been quite easy to get carried away with the research and achieve very little. Guidance from the project supervisor Rebecca England also assisted in keeping the project on track. 3. The project diary provides an excellent means of observing the iterations of the research methodology. On occasion it was felt that the supply of relevant information had dried up, and comments to that effect were posted on the diary. Sometimes, though, a chance discovery of an image assisted greatly in connecting the dots with other research gathered. A prime example here was the information on the Cell processor, as it was only mentioned once in the literature relating to InfoSphere Streams. However, using a search string in the ACM database with Cell processor provided a wealth of information. The architecture of this processor is the major reason why data can be examined so quickly. Using a picture of its layout, together with another diagram found with the query research, allows the viewer to see exactly how the mechanism works. I can only assume that the authors of the published literature possibly wished the underlying architecture to remain an abstraction so as not to overly complicate an already complex subject. 4. I had very little exposure to Object-Oriented programming at the beginning of the semester, and therefore did not fully understand how the SPADE programming model worked. However, as I was also undertaking studies in Java programming at the same time, as the semester progressed, subjects that were discussed in the Java class were beginning to appear in the literature relating to SPADE. This assisted me greatly in answering one of the research questions: if a Java newbie like me could recognise similarities with SPADE, a seasoned, hardcore Java or C++ programmer would have no trouble transitioning to building stream processing applications.
HIT 382 26
s994752
8. References
Amini, L, Andrade, H, Eskesen, F, King, R, Selo, P, Park, Y and Venkatramani, C 2005 The Stream Processing Core Technical Report RSC 23798, IBM T.J. Watson Research Center, Hawthorne, NY, viewed 29 July via ACM portal.
Babcock, B, Babu, S, Datar, M, Motwani, R & Widom, J 2002 Models and Issues in Data Stream Systems ACM Principles of Database Systems symposium, Madison, Wisconsin 3 6 June, viewed 2 August via ACM portal.
Bouillet, E, Feblowitz, M, Feng, H, Ranganathan, A, Riabov, Udrea, O and Liu, Z 2009 MARIO: Middleware for Assembly and Deployment of Multi-platform Flow-Based Applications 10th ACM/IFIP/USENIX International Conference on Middleware, Urbanna, Illinois 30 November 4 December 2009, viewed 3 August 2010 via ACM portal.
Clough, P & Nutbrown, C 2002, A Students Guide to Methodology, SAGE Publications Company, London, UK.
Dylan, M 2009 An Analysis of Stream Processing Languages Final Workshop Paper for ITEC810, Maquarie University, Semester 2 2009. http://web.science.mq.edu.au/~rdale/teaching/itec810/2009H1/WorkshopPapers/Dylan_Miran_Final WorkShopPaper.pdf
Eclipse Foundation website 2010, About Us, viewed 28 September 2010. http://www.eclipse.org/org/
Frossard, P, Verscheure, O and Venkatraman, C 2006 Signal Processing Challenges in Distributed Stream Processing Systems IEEE International Conference on Acoustics, Speech and Signal Processing 14 -19 May, Toulouse, France viewed 25 August 2010 via IEEExplore portal.
Fllop, L, Tth, G, Rcz,R, Pnczl, J, Gergely, T and Beszdes, 2010 Survey on Complex Event Processing and Predictive Analytics University of Szeged, Hungary, viewed 6 August 2010 http://www.inf.u-szeged.hu/~gtoth/research/cep_pa_tech2010.pdf
Gedik, B 2009 High performance Event Stream Processing Infrastructures and Applications with System S, Canada Norway Partnership Program in Higher Education Summer School 3 -9 August 2009, viewed 6 August 2010 via Google Scholar.
HIT 382 27
s994752
Gordon, M , Thies, W & Amarasinghe, S 2006 ,Exploring Coarse-Grained Task, Data and Pipeline Parallelism in Stream Processing 12th International Conference on Architectural Support for Programming Languages and Operating Systems , San Jose, California, 21 25 October, viewed 6 August 2010 via ACM portal.
Gschwind, M 2006, Chip Multiprocessing and the Cell Broadband Engine 3rd Conference on Computing Frontiers, Ischia, Italy 2 - 5 May, viewed 15 August 2010 via ACM portal.
IBM Corporation 2009 IBM InfoSphere Streams: Based on the IBM Research System S Stream Computing System, Somers, NY, viewed 6 August 2010. http://www.monash.com/uploads/IBM-InfoSphere-Streams-Overview.pdf
IBM Corporation 2009 InfoSphere Streams Installation and Administration Manual IBM Publications Center Online, viewed 8 August 2010. http://www-05.ibm.com/e-business/linkweb/publications/servlet/pbi.wss
IBM Corporation 2010 InfoSphere Streams Studio Installation and Users Guide IBM Publications Center Online, viewed 30 August 2010. http://www-05.ibm.com/e-business/linkweb/publications/servlet/pbi.wss
IBM Corporation 2009 Modules 1, 2, 3 & 4 InfoSphere Streams course manuals, obtained 29 August 2010 from Mr Peter Wearden.
IBM Corporation 2010 SPADE Programming Model and Language Reference IBM Publications Center Online, viewed 16 August 2010. http://www-05.ibm.com/e-business/linkweb/publications/servlet/pbi.wss
Nasgaard, H, Gedik, B, Komar, M and Mendell, M 2009 IBM InfoSphere Streams: event processing for a smarter planet Conference of the Center for Advanced Studies on Collaborative Research, Markham, Canada 2 -5 November, viewed 6 August 2010 via ACM portal.
Ranganathan, A, Bouillet, E, Feblowitz, M, Feng, H, Riabov, A, Udrea, O and Liu, Z 2009 MARIO: Middleware for Assembly and Deployment of Multi-platform Flow-Based Applications 10th ACM/IFIP/USENIX International Conference on Middleware, Urbanna, Illinois 30 November 4 December 2009, viewed 3 August 2010 via ACM portal.
Schulte, W, Natis, V, Pezzini, M, Scholler, D, Gassman, B and Thompson, J 2009 Six Design Patterns for Event- Processing Applications Gartner Reaserch, viewed 21July 2010 via Gartner portal.
HIT 382 28
s994752
StreamBase Systems website 2010, About Us: Home, viewed 12 August 2010. http://www.streambase.com/about-home.htm
Thomas, D, Bordawekar, R, Aggarwal, C & Yu, P 2009 On Efficient Query Processing of Stream Counts on The Cell Processor IEEE 25th International Conference on Data Engineering, Shanghai, China 29 March 2 April , viewed 6 August 2010 via IEEE Xplore portal.
Trant, J and Bearman, D 2010, Archives & Museum Informatics website, Toronto, Canada, viewed 23 August 2010, < http://www.archimuse.com/papers/ukoln98paper/section6.html>.
Williams, S, Shalf, J, Oliker, L, Husbands, P, Shoaib, K & Yelick, K 2006 The Potential of the Cell Processor for Scientific Computing 3rd Conference on Computing Frontiers, Ischia, Italy 2 - 5 May, viewed 8 August 2010 via ACM portal.
Wu, K-L, Yu, P, Gedik, B, Hildrum, K, Aggarwal, C, Bouilett, E, Fan, W, George, D, Gu, X, Luo,G & Wang, H 2007 Challenges and Experience in Prototyping a Multi-Modal Stream Analytic and Monitoring Application on System S Very Large Data Base Conference, Vienna, Austria , 23 28 August, viewed 29 July via Ebsco portal
Xu, J 2007 A Tutorial on Network Data Streaming Georgia Institute of Technology, Atlanta, viewed 5 August 2010. http://www.cc.gatech.edu/~jx/reprints/talks/sigm07_tutorial.pdf
HIT 382 29
s994752
8.1 Bibliography
Andrade, H & Turaga, D 2009, Large scale stream processing: systems, applications and research challenges, T.J. Watson Research Institute, Hawthorne, NY, Access Date 24th July 2010 http://sites.google.com/site/streamprocessingspring2010/
Lohman, T 2009, SKA telescope to provide a billion PCs' worth of processing, Computerworld, 18th September 2009, Access Date 21 July 2010 http://www.computerworld.com.au/article/319128/ska_telescope_provide_billion_pcs_worth_process ing_updated_/
Nehme, R, Rundenstiener, E and Bertino, E 2009 Tagging Stream Data for Rich Real-Time Services Very Large Data Base Conference, Lyon, France 24 28 August, viewed 2 September 2010 via ACM portal
HIT 382 30
s994752
Appendix A Glossary
Abstraction - reducing the information content of a concept or an observable phenomenon Algorithm - an effective method for solving a problem expressed as a finite sequence of steps BPEL (Business Process Execution Language) OASIS standard execution language CCTV Closed Circuit Television system, usually used with security cameras Collate - to examine and compare carefully in order to note points of disagreement Complex Event an event that is an abstraction of other events Consumer an object that takes in a stream of data Continuous Inquiry a query that continues to be applied as long as a data stream exists Core - the part of the processor that actually performs the reading and executing of the instruction CSV Comma Separated Values file format CVS - Concurrent Versioning System, a free software revision control system Engine - Software engines drive the functionality of the program Event anything that happens, or is contemplated as happening FPGA (Field Programmable Gate Array) a type of programmable logic circuit Graph an abstract data type that depicts the relationship between two or more variables IDE Integrated Development Environment, an application that provides facilities to programmers Instantiation creating an instance of an object JDOP SPADE operator programmed using the Java language Latency - a measure of the time delay experienced by a system Methodology a description of processes Middleware an application that sits in-between software and hardware Node an active device attached to a network ODBC - Open Database Connectivity, a standard interface for accessing databases Paradigm a new pattern or example Parse the process of analysing data by the compiler to determine its structure Processing Element (PE) the device that applies its inquiry algorithm to the stream Processing Element Container logic construct that protects source code / other system code Producer an object that emits a data stream
HIT 382 31
s994752
Relational a type of database that stores data in rows and columns Runtime the time during which a program is executing Schema - a way to define the structure and content Semantics the meaning of a programming language Simple Event an event that is not a composition or abstraction of other events Sink An output stream of data, external to an InfoSphere Streams instance Source - An input stream of data, external to an InfoSphere Streams instance SQL - Structured Query Language, a programming language used in relational databases SVN - Subversion, a software revision control system produced by the Apache Foundation Syntax the rules of a programming language Tag metadata or keyword assigned to or associated with a specific piece of data Tuple a sequence of named attributes each holding data of a particular type UBOP a SPADE user built operator that can be shared amongst applications UDOP a SPADE user defined operator Workflow a sequence of connected steps, declared as work performed by a system XML (eXtensible Markup Language) a set of rules for encoding data in machine readable form
HIT 382 32
s994752

The Stream Processing Paradigm: Research Report For HIT-382

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

The Stream Processing Paradigm: Research Report For HIT-382

Uploaded by

Copyright:

Available Formats

2010

The Stream Processing Paradigm

HIT 382 iii

5.2 Major platform variants

5.3 InfoSphere Streams architecture

Figure 3: Cell processor architecture (Gschwind 2006)

5.4. Generating inquires on data streams

Aggregator Delay Join

5.5 Development Environments for InfoSphere Streams

Figure 8: SPADE source code example (IBM Programming Reference 2010)

You might also like