Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

BLUE SKIES

Processing Distributed
Internet of Things Data
in Clouds

ecent studies by Cisco and IBM show that we generate 2.5


quintillion bytes of data per day, and this is set to explode
Lizhe Wang to 40 yottabytes by 2020—that’s 5,200 gigabytes for every
Chinese
Academy of
person on earth.1,2 Much of this data is and will be gener-
Sciences and ated from Internet of Things (IoT) devices and sensors. IoT com-
China University
of Geosciences
prises billions of Internet-connected devices (ICDs) or “things,”
each of which can sense, communicate, compute, and potentially
actuate, and can have intelligence, multimodal interfaces, physi-
Rajiv Ranjan
Commonwealth cal/virtual identities, and attributes. ICDs can be sensors, RFIDs,
Scientific and social media, clickstreams, business transactions, actuators (such
Industrial
Research as machines/equipment fitted with sensors and deployed for min-
Organization ing, oil exploration, or manufacturing operations), lab instruments
(such as a high energy physics synchrotron), and smart consumer
appliances (TV, phone, and so on).

The IoT vision is to allow IoT Big Data Application Requirements


things to be connected anytime, anywhere, with The current generation of IoT big data applica-
anything and anyone, ideally using any path, net- tions (such as smart supply chain management,
work, and service. This vision has recently given rise syndromic surveillance, and smart energy grids)
to the notion of IoT big data applications that are ca- combines multiple independent data analytics
pable of producing billions of datastreams and tens models, historical data repositories, and real-time
of years of historical data to provide the knowledge datastreams that are likely to be available across
required to support timely decision making. These geographically distributed datacenters (both pri-
applications need to process and manage stream- vate and public). For example, in a smart supply
ing and multidimensional data from geographically chain management IoT application, advanced an-
distributed data sources that can be available in dif- alytics provides the next frontier of supply chain
ferent formats, present in different locations, and innovation. However, data management in supply
reliable at different levels of confidence. chains is challenging because:

76 I EEE CLO U D CO M P U T I N G P U B L I S H ED BY T H E I EEE CO M P U T ER S O CI E T Y 2325-6095/15/$31 .00 © 2015 IEEE


Authorized licensed use limited to: Amrutvahini College of Engineering. Downloaded on August 23,2022 at 05:53:46 UTC from IEEE Xplore. Restrictions apply.
• datasets span multiple continents and datasets are available across geo- software/middleware stacks. Examples
are independently managed by hun- graphically distributed locations. include virtual machine management
dreds of suppliers and distributors; systems such as Eucalyptus and Amazon
• datasets are updated in real time Despite the requirements posed by IoT Elastic Compute Cloud (EC2); image
based on feeds from sensors attached big data applications, the capability of management tools such as the Future-
to manufacturing devices and deliv- existing big data processing technologies Grid image repository3; massive data
ery vehicles; and and datacenter computing infrastruc- storage/file systems such as Google File
• customers express their sentiments ture is limited. For example, they can System (GFS), the Hadoop distributed
regarding products via a mix of ven- only process data on compute and stor- file system (HDFS), and Amazon Simple
ues such as social media, product age resources within a centralized local Storage Service (S3); and data-intensive
review portals, and blogs. area network, such as a single cluster execution frameworks such as Amazon
within a datacenter. In addition, they Elastic MapReduce. In addition, Future-
Companies must combine and analyze don’t provide mechanisms to seamlessly Grid (http://FutureGrid.org) and Open-
this distributed data along with contex- integrate data spread across multiple Stack provide software stack definitions
tual factors such as weather forecasts distributed heterogeneous data sources. for cloud datacenters.
and pricing positions to establish which
factors strongly influence the demand
of particular products and then quick-
ly take action to adapt to competitive
and evolving environments. Similarly, The IoT vision is to allow things to be
syndromic surveillance IoT applica- connected anytime, anywhere, with
tions require churning through massive
amounts of heterogeneous, real-time in-
anything and anyone, ideally using
formation available from social media, any path, network, and service.
emergency rooms, health departments,
hospitals, and ambulatory care sites
to detect outbreaks of deadly diseases
such as SARS, avian flu, cholera, and
dengue fever. Finally, they can’t ensure security and On the other hand, private datacen-
Clearly, these IoT applications pro- privacy-preserving processing of hetero- ters typically build basic infrastructure
duce big datasets that can’t be trans- geneous data governed by heterogeneous services by combining available soft-
ferred over the Internet to be processed policies and access control rules. ware tools and services. This software
by a centralized public or private data- includes cluster management systems
center. The main reasons for this state State of the Art in Distributed IoT such as Torque, Oscar, and Simple
of affairs are: Data Processing Linux Utility for Resource Management
Existing big data processing technolo- (Slurm); parallel file/storage systems
• the datasets have strict privacy, se- gies and datacenter infrastructures have such as storage area network/network-
curity, and regulatory constraints varied capabilities with respect to meet- attached storage (SAN/NAS)4 and
that prohibit their transfer outside ing the distributed IoT data processing Lustre (http://wiki.lustre.org); as well
the parent domain; challenges. as data management systems such as
• the datasets flow at a volume and the Berkeley Storage Manager (BeST-
velocity too large and too fast to be Datacenter Cloud Computing Man, https://sdm.lbl.gov/bestman) and
processed by a single centralized Infrastructure Service Stack dCache (www.dcache.org). In addition,
datacenter as it could lead to high Commercial and public datacenters such some private datacenters are enabled
network communication overhead; as Amazon Web Services and Micro- for resource sharing with grid comput-
and soft Azure provide computing, storage, ing middleware, such as Globus tool-
• the analytics models and intel- and software resources as cloud ser- kits, Uniform Interface to Computing
ligence required to process the vices, which are enabled by virtualized Resources (Unicore, www.unicore.eu),

J A N U A R Y/ F E B R U A R Y 2 0 1 5 I EEE CLO U D CO M P U T I N G 77
Authorized licensed use limited to: Amrutvahini College of Engineering. Downloaded on August 23,2022 at 05:53:46 UTC from IEEE Xplore. Restrictions apply.
BLUE SKIES

and Lightweight Middleware for Grid consistency, isolation, durability) PigMix, fueled by the need to analyze
Computing (gLite). transactional properties. the performance of different big data
technologies. These benchmark suites
Massive Data Processing Models and Accordingly, several research efforts model workloads for stress testing one
Framework have integrated different cloud data stor- or more categories of big data processing
The MapReduce paradigm has been age services by providing a transparent technologies. Among these frameworks,
widely used for large-scale data-intensive interface. Examples are Simple Cloud BigDataBench is most comprehensive
computing within datacenters due to API (http://simplecloud.com), Simple because it constitutes workload mod-
its low cost, massive data parallel- API for Grid Applications (SAGA) with els for NoSQL, database management
ism, and fault-tolerant processing. The an SRM interface, and some uniform systems (DBMSs), SPEs (Stream Pro-
most popular implementation, Ha- services such as PDC@KTH’s proxy cessing Engines), and batch processing
doop framework allows applications service12 and Open Grid Services Ar- frameworks. Primarily, BigDataBench
to run on large clusters and provides chitecture Data Access and Integration targets the search engine, social network,
transparent reliability and data trans- (OGSA-DAI, www.ogsadai.org.uk) Web and e-commerce application domains.
fer. Other implementations include services. In addition, a number of third- However, there are limited bench-
Compute Unified Device Architecture party providers (DropBox, Mozy, and so marks and application kernels for het-
(CUDA),5 field programmable gate array on) simplify online cloud storage access. erogeneous datacenters. In fact, there’s
(FPGA),6 virtual machines,7 as well as no agreement on available performance
streaming runtime,8 grid,8 and oppor- Data-Intensive Workflow Computing benchmarking for executing large-scale
tunistic environment.9 Apache Hadoop Typical data-intensive scientific work- IoT applications across distributed data-
on Demand (HOD) provides virtual flow frameworks include Pegasus, centers. Actually, the lack of intercent-
Hadoop clusters over a large physical Kepler, Taverna, Triana, Swift, and er benchmarks and standards should
cluster based on Torque. MyHadoop Trident. Various business workflow be the key research agenda for the fu-
provides on-demand Hadoop instances technologies have also been applied to ture. Currently, the National Institute
on high-performance computing (HPC) data-intensive workflow systems. Ex- of Standards and Technology (NIST),
resources via traditional schedulers.10 amples include service orchestration Open Grid Forum (OGF), Distributed
Other MapReduce-like projects in- with Business Process Execution Lan- Management Task Force (DMTF) Cloud
clude Twister (www.iterativemapreduce guage  (BPEL) and YAWL (Yet Another working group, Cloud Security Alliance,
.org), Sector/Spear (http://sector Workflow Language),13 service choreog- and Cloud Standards Customer Council
.sourceforge.net), and All-pairs.11 raphy with Web Services Choreography are all working on cloud standards.
Description Language (WS-CDL, www
Data Management Service across .w3.org/TR/ws-cdl-10), and service- Research Issues
Datacenters oriented architectures.14 Big IoT data processing across multiple
The following four storage service ab- distributed datacenters remains chal-
stractions supported by cloud providers Benchmark, Application Kernels, lenging, mainly because of technical
differ in how they store, index, and ex- Standards, and Recommendations issues related to basic service stacks
ecute queries: Several benchmarks and application for datacenter computing infrastruc-
kernels have been developed, including tures, massive data processing models,
• Binary Large Object (Blob) for un- Graph 500 (www.graph500.org), Hadoop trusted data management services,
structured data such as Amazon S3 Sort (http://wiki.apache.org/hadoop/Sort) data-intensive workflow computing, and
and Azure Blob; and Sort benchmark (http://sortbenchmark benchmarks.
• key-value storage such as HBase, .org), MalStone,15 Yahoo Cloud Serving
MongoDB, and BigTable; Benchmark (http://research.yahoo.com/ Service Stacks in a Multidatacenter
• message queuing systems such as Web_Information_Management/YCSB), Computing Infrastructure
SQS and Apache Kafka; and Google cluster workload (http://code Despite significant advances, public
• relational database management sys- .google.com/p/googleclusterdata), TPC-H cloud computing technologies are still
tems such as Oracle and MySQL, benchmarks (www.tpc.org/tpch), Big- technically challenging for serving
which support ACID (atomicity, DataBench, BigBench, Hibench, and large-scale IoT applications across data-

78 I EEE CLO U D CO M P U T I N G W W W.CO M P U T ER .O RG /CLO U D CO M P U T I N G


Authorized licensed use limited to: Amrutvahini College of Engineering. Downloaded on August 23,2022 at 05:53:46 UTC from IEEE Xplore. Restrictions apply.
centers. First, cloud technologies must To process and store massive data- across distributed datacenters. Finally,
be integrated into the resource manage- sets across geographically distributed data storage and management servic-
ment and file systems of existing private storage services while providing re- es aren’t incorporated in the service-
datacenter infrastructures to provision quired quality of service (QoS) guaran- oriented framework for data-intensive
cloud services. tees raises several concerns. workflow systems.
First, current cloud storage services
Massive Data Processing Models for aren’t secure by nature because of the Benchmark and Application Kernels
Datacenters inherent risk of data exposure, temper- Currently, there’s no agreement on
There are several limitations in using ing, and denial of data access. Ensuring available performance for executing
MapReduce and Hadoop for large-scale data confidentiality, integrity, and avail- large-scale IoT applications in distrib-
distributed massive data processing. ability is a great concern. uted datacenters. Even worse, there
First, this framework is limited to com- Another issue concerns the intelli- are currently no intercenter benchmark
pute infrastructures within a local area gence to automate the choice of the best and application kernels or standards for
network or datacenter, and can’t be di- storage services and network routes for running large-scale IoT applications on
rectly used for large-scale IoT applica- optimal application QoS. Existing quanti- distributed datacenters.
tions across geographically distributed tative criteria approaches applied optimi-
datacenters. Second, MapReduce  suf- zation17 and performance measurement
fers from performance degradation due techniques18 for selecting cloud services. arge-scale IoT applications need to
to the absence of a high-performance Other research focuses on static XML process and manage massive data-
parallel and distributed file system that schema matching methods.19 sets across geographically distributed
can seamlessly operate across multiple The uncertainty of cloud storage datacenters. These applications need
datacenters. Third, MapReduce uses services and network routes in a multiple to be provisioned across multiple data-
a task “fork” mechanism that can’t be datacenter environment is another major centers to exploit independent and geo-
directly deployed in traditional private concern. Several reactive techniques rely graphically distributed data sources and
datacenters with local task managers on service state monitoring and action IT infrastructure. The capability of ex-
such as Torque and Globus, not to men- triggering to ensure QoS while adapting isting data processing computing tools
tion the lack of security models. Fourth, to run-time variation in resource loading (for example, file systems, MapReduce,
the limited semantics of MapReduce and failures.20,21 Some QoS prediction and workflow technologies), however, is
can’t easily present the diverse paral- methods such as the Network Weather optimized for single datacenter. Future
lel patterns of large-scale scientific Service use both monitoring and fore- research efforts will need to tackle the
applications. In addition, MapReduce casting. Another proposed network challenge of provisioning IoT applica-
and Hadoop aren’t currently widely QoS-aware approach uses QoS profiling, tions across multiple datacenters by ex-
supported by data-intensive workflow modelling, and prediction.22 tending existing big data processing tools
systems, although there are some pre- with the ability to process data across
liminary efforts.16 Data-Intensive Workflow Computing geographic locations; developing tech-
IoT applications typically require dis- niques for ensuring security and privacy
Optimized Data Management across tributed processing of data as a work- of sensitive data; and developing intel-
Datacenters flow that spans across multiple data ligent techniques for application provi-
Several research efforts have integrated processing services and repositories. sioning based on cost, performance, and
heterogeneous types of cloud storage Several open source products exist for other QoS requirements.
services by providing a transparent in- running data-intensive workflows. How-
terface. However, these services and ever, current systems suffer from some References
interfaces can’t guarantee that data limitations. For example, there’s limited 1. R. Pepper and J. Garrity, “The In-
is secured both in motion and at rest, support for workflow walk across het- ternet of Everything: How the Net-
don’t support automated ranking of erogeneous file systems, such as Lustre, work Unleashes the Benefits of Big
competing storage services, and cannot HDFS, and GFarm. There’s also limited Data,” Global Information Technol-
handle uncertainties regarding cloud support for MapReduce task and sub- ogy Report, Cisco, 2014, pp. 35–42;
storage services and network routes. workflow in data-intensive workflows http://blogs.cisco.com/wp-content/

J A N U A R Y/ F E B R U A R Y 2 0 1 5 I EEE CLO U D CO M P U T I N G 79
Authorized licensed use limited to: Amrutvahini College of Engineering. Downloaded on August 23,2022 at 05:53:46 UTC from IEEE Xplore. Restrictions apply.
BLUE SKIES

uploads/GITR-2014-Cisco-Chapter sources, tech. report, San Diego Ann. Conf. Internet Measurement,
.pdf. Supercomputer Center, Univ. of Cal- 2010, pp. 1–14.
2. “Bringing Big Data to the Enter- ifornia, San Diego, 2011. 19. A. Ruiz-Alvarez and M. Humphrey,
prise,” IBM, http://www-01.ibm 11. C. Moretti et al., “All-Pairs: An Ab- “An Automated Approach to Cloud
.com/software/in/data/bigdata. straction for Data-Intensive Cloud Storage Service Selection,” Proc.
3. J. Diaz et al., “FutureGrid Image Computing,” Proc. IEEE Int’l Symp. 2nd Int’l Workshop Scientific Cloud
Repository: A Generic Catalog and Parallel and Distributed Processing Computing (ScienceCloud 11),
Storage System for Heterogeneous (IPDPS 08), 2008, pp. 1–11. 2011, pp. 39–48.
Virtual Machine Images,” Proc. 3rd 12. I. Livenson and E. Laure, “Towards 20. L.M. Vaquero et al., “Dynamically
IEEE Int’l Conf. Cloud Computing Transparent Integration of Hetero- Scaling Applications in the Cloud,”
Technology and Science (CloudCom geneous Cloud Storage Platforms,” SIGCOMM Computer Comm. Rev.,
11), 2011, pp. 560–564. Proc. 4th Int’l Workshop Data- vol. 41, no. 1, 2011, pp. 45–52.
4. P. Radzikowski, “SAN vs DAS: A Intensive Distributed Computing 21. Z. Shen et al., “CloudScale: Elastic
Cost Analysis of Storage in the En- (DIDC 11), 2011, pp. 27–34. Resource Sharing for Multi-Tenant
terprise,” Capitalhead, 2008; http:// 13. C. Ouyang, M. Adams, and A.H.M. Cloud Systems,” Proc. 2011 ACM
capitalhead.com/articles/san-vs-das Hofstede, “Yet Another Workflow Symp. Cloud Computing, 2011.
-a-cost-analysis-of-storage-in-the Language: Concepts, Tool Support, 22. K. Alhamazani et al., “Cloud Moni-
-enterprise.aspx. and Application,” Handbook of Re- toring for Optimizing the QoS of
5. B. He et al., “Mars: A MapReduce search on Business Process Modeling, Hosted Applications,” Proc. 4th IEEE
Framework on Graphics Processors,” IGI Global, 2009, pp. 92–121. Int’l Conf. Cloud Computing Technol-
Proc. 17th Int’l Conf. Parallel Archi- 14. K.K. Droegemeier et al., “Service- ogy and Science, 2012, pp. 765–770.
tectures and Compilation Techniques Oriented Environments for Dynami-
(PACT 08), 2008, pp. 260–269. cally Interacting with Mesoscale LIZHE WANG is a professor at the In-
6. Y. Shan et al., “FPMR: MapReduce Weather,” Computing in Science stitute for Remote Sensing and Digital
Framework on FPGA,” Proc. 18th and Engineering, vol. 7, no. 6, 2005, Earth, Chinese Academy of Sciences, and
Ann. ACM/SIGDA Int’l Symp. Field pp. 12–29. at the School of Computer at the China
Programmable Gate Arrays (FPGA 15. C. Bennett, R. Grossman, and J. Se- University of Geosciences. Wang has a
10), 2010, pp. 93–102. idman, MalStone: A Benchmark for PhD in computer science from Karlsruhe
7. S. Ibrahim et al., “Cloudlet: To- Data Intensive Computing, tech. report University. He is a senior member of IEEE.
wards MapReduce Implementation TR-09-01, Open Cloud Consortium, Contact him at lizhewang@icloud.com.
on Virtual Machines,” Proc. 18th 2009; http://malgen.googlecode.com/
ACM Int’l Symp. High Performance files/malstone-TR-09-01.pdf. RAJIV RANJAN is a senior research
Distributed Computing (HPDC 09), 16. J. Wang, D. Crawl, and I. Altintas, scientist, Julius Fellow, and project
2009, pp. 65–66. “Kepler + Hadoop: A General Archi- leader at the Commonwealth Scientific
8. S. Pallickara, J. Ekanayake, and tecture Facilitating Data-Intensive and Industrial Research Organization.
G. Fox, “Granules: A Lightweight, Applications in Scientific Work- At CSIRO, he leads research projects re-
Streaming Runtime for Cloud Com- flow Systems,” Proc. 4th Workshop lated to cloud computing, content deliv-
puting with Support for Map-Reduce,” Workflows in Support of Large- ery networks, and big data analytics for
Proc. Cluster Computing and Work- Scale Science (WORKS 09), 2009; Internet of Things (IoT) and multimedia
shops (CLUSTER 09), 2009, pp. 1–10. doi:10.1145/1645164.1645176. applications. Ranjan has a PhD in com-
9. H. Lin et al., “Moon: MapReduce 17. M. Hajjat et al., “Cloudward Bound: puter science and software engineering
on Opportunistic Environments,” Planning for Beneficial Migration from the University of Melbourne. Con-
Proc. 19th ACM Int’l Symp. High of Enterprise Applications to the tact him at rajiv.ranjan@csiro.au.
Performance Distributed Computing Cloud,” ACM SIGCOMM Computer
(HPDC 10), 2010, pp. 95–106. Comm. Rev., vol. 40, no. 4, 2010, pp.
10. S. Krishnan, M. Tatineni, and C. 243–254. Selected CS articles and columns
Baru, MyHadoop—Hadoop-on- 18. A. Li et al., “CloudCMP: Comparing are also available for free at http://
ComputingNow.computer.org.
Demand on Traditional HPC Re- Public Cloud Providers,” Proc. 10th

80 I EEE CLO U D CO M P U T I N G W W W.CO M P U T ER .O RG /CLO U D CO M P U T I N G


Authorized licensed use limited to: Amrutvahini College of Engineering. Downloaded on August 23,2022 at 05:53:46 UTC from IEEE Xplore. Restrictions apply.

You might also like