Download as pdf or txt
Download as pdf or txt
You are on page 1of 68

Learner Guide

BIG DATA AND IOT 600


Faculty of Information Technology

Year 2 Semester 2

richfield.ac.za
Table of Contents 2
Chapter 1: Big Data Analysis and Extraction Techniques ................................................................ 4
1.1 Big Data................................................................................................................................... 4
1.2 Big Data Analysis Techniques................................................................................................. 5
Chapter 2: IoT applications and architectures ............................................................................... 9
2.1 IoTs defined ............................................................................................................................ 9
2.2 How IoTs work? .................................................................................................................... 10
2.3 IoT Applications .................................................................................................................... 11
2.4 IoT Architectures .................................................................................................................. 14
Chapter 3: Big Data Storage and Security .................................................................................... 19
3.1 Big Data Storage ......................................................................................................................... 19
3.2 Big Data Security ........................................................................................................................ 23
Chapter 4: Big Data Strategies and Legal Compliance ................................................................. 27
4.1 How big data can help guide your strategy ............................................................................... 27
4.2 Forming your strategy for big data and data science ............................................................... 36
4.3 Analytics, algorithms and machine learning ............................................................................. 41
4.4 Governance and legal compliance ............................................................................................. 42
4.5 Governance for reporting .......................................................................................................... 46
Case study – Netflix gets burned despite best intentions .......................................................... 47
Chapter 5: IoT technologies and Standards ................................................................................. 48
5.1 IoT Protocols Background .......................................................................................................... 48
5.2 The Best Tools for Internet of Things (IoT) Development ........................................................ 49
5.3 IoT Development Platforms ....................................................................................................... 51
5.4 IoT Operating Systems ............................................................................................................... 52
5.5 IoT Programming Languages ...................................................................................................... 53
5.6 Open-Source Tools for the Internet of Things ........................................................................... 54
5.7 Best IoT Development Kits ......................................................................................................... 55
5.8 IoT Security ................................................................................................................................. 55
5.9 IoT Statistics and Forecast ......................................................................................................... 56
5.10 Types of IoT Connections ......................................................................................................... 56
5.11 Top Seven IoT Platforms .......................................................................................................... 58
5.12 Most Popular Internet of Things Protocols, Standards and Communication Technologies .. 62
5.13 Standards Bodies ...................................................................................................................... 66
3
PRESCRIBED OR RECOMMENDED BOOKS

Big Data and the


Internet of things
Enterprise Information
Architecture for A new
Age
APress by Robert
Stackowiak, Art Licht,
Venu Manthha
Big Data Demystified

Pearson by
David Stephenson

Chapter 1: Big Data Analysis and Extraction Techniques


Chapter 2: IoT Architectures and Applications
Chapter 3: Big Data Storage and Security
Chapter 4: Big Data ethics, standards and policies
Chapter 5: IoT technologies and Standards
Chapter 1: Big Data Analysis and Extraction Techniques 4

LEARNING OUTCOMES
After reading this Section of the guide, the learner should be able to:

Learning Objectives

 apply various Big Data techniques for analysis of knowledge to


support business decisions

1.1 Big Data

The millennium brought with it exponential data volumes driven by a variety of sources of digitised
data. Every individual, company, business and organisation has a digital footprint, as depicted in Figure
1 below.

Figure 1: Data Evolution (Vozabal, 2016)


Big Data is the term used to describe the application of specialized techniques to process very large
sets of data, which are often too complex to process using regular database management tools. Some 5
common examples of large data sets (at least one terabyte) include ‘web logs, call records, medical
records, military surveillance, photography archives, video archives and large-scale e-commerce’
(Stephenson, 2013).

1.2 Big Data Analysis Techniques


‘Facebook is estimated to
store at least 100
petabytes of pictures and
The most common Big Data Analysis Techniques videos alone’ (Stephenson,
2013).
include:

Association rule

Association rule learning is an analysis technique adopted to find patterns in data through
correlations between variables in large databases. It was first used by major supermarket
chains to discover relations between products, using data from supermarket point-of-sale
(POS) systems. This adoption has been expanded to other areas, for example to assist in the:

•’placement of products in better proximity to each other in order to increase sales


• extraction of information about visitors to websites from web server logs
• analysis of biological data to uncover new relationships
• monitoring of system logs to detect intruders and malicious activity
• identification of people who buy milk and butter are more likely to buy diapers’
(Stephenson, 2013).

Classification tree analysis

This is a type of machine learning algorithm that adopts a structural mapping of binary
decisions which lead to a decision about the class of an object. Although sometimes referred
to as a decision tree, it is more properly a type of decision tree that leads to categorical
decisions. This statistical classification technique is sometimes used to:
• automatically assign documents to categories
• categorize organisms into groupings
• develop profiles of students who take online courses (Stephenson, 2013).
Genetic algorithms
6

Genetic algorithms are inspired by inheritance, mutation and natural selection. Essentially
these mechanisms are used to “evolve” useful solutions to problems that require
optimization. Some common applications of Genetic algorithms include the:

• ‘scheduling of doctors for hospital emergency rooms


• developing combinations of optimal materials and engineering
practices required to develop fuel-efficient cars
• generating “artificially creative” content such as puns and
jokes’(Stephenson, 2013).

Are people who


purchase coffee more
likely to purchase
softdrinks?

Machine Learning (ML)

ML works with computer algorithms to produce assumptions based on data collected to


provide predictions that would be impossible for human analysts. It gives computers the
ability to learn without being explicitly programmed, thus it is able to make predictions based
on known properties learned from sets of “training data.” Machine learning is applied in:

• ‘distinguishing between spam and non-spam email messages


• learning user preferences and make recommendations based on this information
• determining the best content for engaging prospective customers
• determining the probability of winning a case, and setting legal billing rates’
(Stephenson, 2013).

Regression analysis

Regression analysis is a powerful statistical method that investigates the relationship


between two or more variables. Typically, it examines the influence of one or more
independent variables on a dependent variable, like weight, speed or age. Examples of the
application of regression analysis include determining how the:

• ‘levels of customer satisfaction affect customer loyalty


• number of supports calls received may be influenced by the weather forecast given
the previous day
• neighbourhood and size affect the listing price of houses’ (Stephenson, 2013).
Sentiment analysis
7

Sentiment Analysis is a type of Natural Language Processing (NLP) technique that automates
the process of understanding an opinion about a given subject from written or spoken
language. Thus it helps researchers determine the sentiments of speakers or writers.
‘Sentiment analysis is being used to help:

• improve service at a hotel chain by analysing guest comments


• customize incentives and services to address what customers are
really asking for
• determine what consumers really think based on opinions from
social media’ (Stephenson, 2013).

How well is our new


exchange policy on sale
items being received by
Social network analysis our customers?

Social network analysis maps and measures the relationships


and flows between people, groups, organizations, computers, URLs, and
other connected information or knowledge entities. The nodes in
the network represent the people and groups while the links identify the relationships or
flows between the nodes. Some examples of the application of social network analysis
include:

• ‘understanding how people from different ethnic groups form ties with outsiders
• finding the importance of a particular individual within a group
• determining the social structure of a customer base’ (Stephenson, 2013).

Data Mining

Data mining extracts patterns from large data sets by combining methods from statistics and
machine learning, within database management. It is also referred to as the process of finding
anomalies, patterns and correlations within large data sets to predict outcomes.
An example would be when customer data is mined to determine which market segments are
most likely to react to an offer.
Natural Language Processing (NLP)
8

NLP is as a sub specialty of computer science, artificial intelligence, and linguistics, which uses
algorithms to analyse human (natural) language.
For example, if you have shopped online it is most likely that you were interacting with a
chatbot rather than an actual human being. These AI customer service agents are typically
algorithms that use NLP to be able to understand your query and respond to your questions
adequately, automatically, and in real-time.

In
2016, Mastercard launched
its own chatbot that was
compatible with Facebook
Messenger, but compared
to Uber’s bot, the
Mastercard bot functions
more like a virtual assistant.
Chapter 2: IoT applications and architectures 9

LEARNING OUTCOMES
After reading this Section of the guide, the learner should be able to:

Learning Objectives

 Demonstrate an understanding of IoT applications and


architectures

2.1 IoTs defined

IoTs is an acronym for the Internet of Things.

Figure 2: IoTs explained visually


(Source: http://telco.co.zw/telco2017/the-internet-of-things/things/)
IoTs are often referred to as a system of interrelated computing devices, mechanical and
10
digital machines, objects, animals or people that are provided with unique identifiers (UIDs),
and the ability to transfer data over a network without
requiring human-to-human or human-to-computer
interaction (Rouse, 2019). Simply put,
The first internet appliance
it refers to the concept of connecting was a Coke machine at
any device with an on and off switch Carnegie Mellon University in
the early 1980s. Using the
to the Internet (and/or to each web, programmers could
other). This includes everything from check the status of the
machine and determine
mobile phones, coffee makers, whether there would be a cold
washing machines, headphones, drink available for them
lamps, and wearable devices, to (Rouse, 2019).

name a few.

2.2 How IoTs work?

An IoT usually consists of web-enabled smart devices that use embedded processors, sensors
and communication hardware to collect, send and act on data they acquire from their
environments, as depicted in Figure 3 below.

Figure 3: Example of an IoT system


(Source: Rouse, 2019)

IoT devices share the sensor data they collect by connecting to an IoT gateway, where data is
either sent to the cloud to be analysed or analysed locally. Sometimes, these devices
communicate with other related devices and act on the information they get from one
another. The devices do most of the work without human intervention, although people can
interact with the devices - for instance, to set them up, give them instructions or access the
data.
The connectivity, networking and communication protocols used with these web-enabled
11
devices largely depend on the specific IoT applications deployed.

2.3 IoT Applications

The proper use of IoT technologies in businesses can reduce overall operating costs, help
increase business efficiency, and create additional revenue streams through new markets and
products to enhance their competitive advantage. Some common applications for IoTs are
described below.

Retail and supply chain management

The leaders in “smart” operations are the retail and supply chain industry. This is not limited
to the use of IoT devices and applications for shopping and supply chain management.
Restaurants, hospitality providers, and other businesses have also adopted IoTs to manage
their supplies and gather valuable insights like avoiding over-ordering, effectively restricting
staff members who abuse their privileges, as well as better management of logistical and
merchandising expenses.

Examples of IoTs in the retail and supply chain industry include:


 Proximity-based advertising with Beacons chocolates
 Smart inventory management technologies used at Amazon Go checkout store
 Queue Hop is an innovative inventory tracking and theft prevention IoT solution.
Smart tags are attached to the items on sale, which unclip automatically only after the
payment is made.
 The system speeds up the checkout process by offering mobile self-service capabilities
and allows business owners to manage their inventory in real time.
 As a result, this technology has the potential of disrupting the whole shopping process,
by allowing business owners to reallocate resources for better efficiency and
improved customer service.

Home automation

Smart homes that are


It is impossible to ignore the impact that IoT equipped with smart
technologies have had on our homes. Smart thermostats, smart
appliances, lighting, security, and appliances and connected
heating, lighting and electronic
environment controls, as illustrated in Figure devices can be controlled
4 below, make our life easier and more remotely via computers,
convenient. smartphones or other mobile
devices (Rouse, 2019).
12

Figure 4: Smart Home


(Source: Timokhina, 2017)
As illustrated in Figure 4 above,

 Indoor cameras, and alarms, help you better manage your home.
 The thermostat learns about your preferences and automatically adjusts the
temperature. In addition to a comfortable environment at home, it will help you save
on heating and use your energy more efficiently.
 Cameras together with smoke and CO2 alarms make your home a safer place.
 You can monitor and manage all of these devices with your smartphone using a
dedicated application.

Healthcare & fitness

Multiple wearable devices that monitor heart rate, calorie intake, sleep, track activity, and
other metrics to help us stay healthy have flooded the IoT market recently. These health and
fitness devices have been made available by Apple, Samsung, Jawbone, and Misfit, to name a
few wearables that represent IoT use.

In some cases, these wearable devices can communicate with third-party applications and
share information about the user’s chronic conditions with a healthcare provider.

In addition to the personal use of health wearable devices, there are some advanced smart
appliances, including scales, thermometers, blood pressure monitors, and even hair brushes.

Smart medication dispensers, are also widely used for home treatment and elderly care. The
appliance allows you to load the prescribed pills and monitor the intake. The mobile
application paired with the device sends timely alerts to the family members or caregivers to
inform them when the medicine is taken, or skipped. It also provides useful data on the
medication intake and sends notifications when your medication is running low.
Automotive
13

Most cars today are becoming increasingly connected through the inclusion of smart sensors.
These smart solutions are sometimes provided by the car
manufacturer itself, while others are offered as a third-
party solution to make your car “smart”,
The concept for the Smart car
like remote control and monitoring of your started in the early 1970’s.
car. Through a mobile application allows Mercedes-Benz engineer
you to control such functions of your car as Johann Tomforde began to
explore city car concepts and
opening/closing the doors, engine metrics, designs. He created the first
the alarm system, detecting the car’s concept sketch, but it wasn’t
location and routes, etc. until the 1990’s that Mercedes
assembled a team to start the
design process.

While connected or even self-driven cars


have already become a reality, automotive IoT is expanding to other types of ground
transport, including railway transport.

Agriculture

Smart farming although commonly ignored has grown to include many innovative products
toward progressive farmers. Some of these “smart products” include a distributed network
of smart sensors to monitor various natural conditions, such as humidity, air temperature,
and soil quality. Other products are used to automate irrigation systems. An example of an
IoT agriculture device is a smart watering system that uses real-time weather data and
forecasts to create an optimal watering schedule for the agricultural area. A smart Bluetooth-
powered controller and a mobile application are used to control the system, making it easy
to install, setup, and manage, as is shown in Figure 5 below.

Figure 5: A drone used to collect data about the crops


(Source: Levy, 2017)
Logistics
14

Freight, fleet management, and shipping represent another promising area of use for IoT.
With smart tags or sensors, attached to the parcels or items being transported, customers
can track their location, speed, and even transportation or storage conditions.

2.4 IoT Architectures

According to Bilal (2018), the term Internet of Things (IoT) a is expressed through a simple
formula such as: - IoT= Services+ Data+ Networks + Sensors. These basic building blocks of an
IoT system are illustrated in Figure 7 below, and each of these elements are discussed
thereafter:

Figure 6: IoT Architecture


(Source: Grizhnevich, 2018)
‘A “smart” object is equipped with sensors that gather data on the action to be taken which
15
is transferred over a network and actuators that allow things to act. This concept includes
fridges, street lamps, buildings, vehicles, production machinery, rehabilitation equipment and
all other smart devices. These sensors may not necessarily be physically attached to the object
but be located the closest environment of the object in order to monitor it. The actions that
require to be taken could include switching on or off a light; opening or closing a door,
increase or decrease engine rotation speed… The IoT four key technological enablers are, for:

- tagging - RFID technology


- sensing - sensor technology
- thinking - smart technology
- shrinking - Nanotechnology

is used (Bilal, 2018)

The gateway provides connectivity between the object and the cloud part of the IoT solution,
to enables data pre-processing and filtering before moving it to the cloud (to reduce the
volume of data for detailed processing and storing) and transmits control commands going
from the cloud to things. The objects then execute commands using their actuators. The
advantage of adopting a cloud gateway is that it ensures compatibility with various protocols
and communicates with field gateways using different protocols depending on what protocol
is supported by the relevant gateway.

A data lake is used for storing the data generated by connected devices in its original format.
This data that is generated in "batches" or in “streams” in a large in volume, therefore
commonly referred to as “Big data”. When specific data is needed for analysis it is extracted
from the data lake and loaded to a big data warehouse, where it is filtered cleaned, structured
and matched.

Data analysts use data from the big data warehouse, visualized in schemes, diagrams, or
infographics, to find trends and decide what actions to implement or understating the
correlations and patterns to create more suitable algorithms for control applications.

To create more precise and more efficient models for control applications, machine learning
is often adopted. Models are regularly updated based on historical data accumulated in the
big data warehouse.

Control applications are responsible for sending automatic commands and alerts to
actuators, for example:

 Windows of a smart home can receive an automatic command to open or close depending
on the forecasts taken from the weather service.
 When sensors show that the soil is dry, watering systems get an automatic command to
water plants.
 Sensors help monitor the state of industrial equipment, and in case of a pre-failure
situation, an IoT system generates and sends automatic notifications to field engineers.
The commands sent by control applications to actuators can be also additionally stored in a
16
big data warehouse to help investigate problematic cases.

User applications are the software component of an IoT system which connects IoT users to
the devices and gives them the option to monitor and control their smart object through a
mobile phone or web application’ (Grizhnevich, 2018).

A simple example of “smart” outside lighting system, as a part of a “smart home”, is described
in Figure 7, below.

Figure 7: IoT architecture example – Intelligent lighting


(Source: Grizhnevich, 2018)

Different architectures have been proposed as there is no single consensus on architecture


for IoTs. The most basic architecture is a three-layer architecture [3–5] as shown in Figure 8.
17

Figure 8: 3 Layer Architecture of IoTs (Source: Sethi & Sarangi, 2017)

The perception layer is the physical layer, which has sensors for sensing and gathering
parameters about the environment, or identifies other smart objects in the environment.
The network layer is responsible for connecting to other smart things, network devices, and
servers, and is also used for transmitting and processing sensor data.
The application layer is responsible for delivering application specific services to the user. An
extension of the three-layer architecture is the five-layer architecture, which additionally
includes the processing and business layers as depicted in Figure 9 below.

Figure 9: 5 Layer Architecture of IoTs


(Source: Sethi & Sarangi, 2017)

The transport layer transfers the sensor data from the perception layer to the processing
layer and vice versa through networks such as wireless, 3G, LAN, Bluetooth, RFID, and NFC.
The processing layer stores, analyses, and processes huge amounts of data that come from
the transport layer through various technologies such as databases, cloud computing, and big
data processing. The business layer manages the whole IoT system, including applications,
business and profit models, and users’ privacy.

Another architecture proposed by Ning and Wang in Figure 10 below is inspired by the layers
of processing in the human brain. It is inspired by the intelligence and ability of human beings
to think, feel, remember, make decisions, and react to the physical environment. It is
constituted of three parts. First is the human brain, which is analogous to the processing and
data management unit or the data centre. Second is the spinal cord, which is analogous to
the distributed network of data processing nodes and smart gateways. Third is the network
of nerves, which correspond to the networking components and sensors.
18

Figure 10: 5 Layer Architecture of IoTs


(Source: Ning & Wang, 2011)
Chapter 3: Big Data Storage and Security 19

LEARNING OUTCOMES
After reading this Section of the guide, the learner should be able to:

Learning Objectives

 Demonstrate both practical and theoretical understanding of


concepts in Big Data storage and security
 Awareness of the ethics, standards and policies governing the application
and
3.1 Big Data implementation of Big Data
Storage
 Demonstrate understanding of various technologies and standards relating
Big data storage
to IoTsis a storage infrastructure designed specifically to store, manage and retrieve
very large amounts of data and enables real-time data analytics. Big data storage enables the
storage and sorting of big data in such a way that it can easily be accessed, used and processed
by both applications and services working on big data.

Companies apply big data analytics to get greater intelligence from metadata. In most cases,
big data storage uses low-cost hard disk drives, or hybrids mixing disk and flash storage.

Although a specific volume size or capacity is not formally defined, big data storage usually
refers to volumes that grow exponentially to terabyte or petabyte scale. Thus, big data
storage is also able to flexibly scale as required.

A big data storage system clusters a large number of commodity servers attached to high-
capacity disk to support analytic software written to crunch vast quantities of data. The
system relies on massively parallel processing databases to analyse data ingested from a
variety of sources.

Interesting facts about HADOOP:


#1 Can be easily controlled
#2 Debugs simply
#3 Analyzes high scale data
#4 Combines voluminous Data
#5 Transfers Data to HDFS form
#6 Data Compression in HDFS
# 7 Transformation in Hadoop
HADOOP
20

The data itself in big data is unstructured because it typically comes from various sources,
which means mostly file-based and object storage is required. The Apache Hadoop
Distributed File System (HDFS) is the most prevalent analytics engine for big data, and is
typically combined with some features of a NoSQL database.

Figure 11: Hadoop Services


(Source: Beyond Corner, 2018)

Hadoop is open source software written in the Java programming language. HDFS spreads the
data analytics across hundreds or even thousands of server nodes without impacting on
performance. Through its MapReduce component (as depicted in Figure 11 above), Hadoop
distributes processing thus acting as a safeguard against catastrophic failure. The multiple
nodes serve as a platform for data analysis at a network's edge. When a query arrives,
MapReduce executes processing directly on the storage node on which the data resides. Once
analysis is completed, MapReduce gathers the collective results from each server and
“reduces” them to present a single cohesive response.
21

Figure 12: Hadoop Architecture


(Source: Beyond Corner, 2018)

Hadoop has two major layers and two other supporting modules, with five building blocks
inside the Hadoop Architecture. Each of these building blocks is briefly explained below:
Hadoop Common is a collection of Java libraries and utilities that are required by/common
for other Hadoop modules, which contain all the necessary Java files and scripts required to
start Hadoop.
Hadoop Yet Another Resource Navigator (YARN) framework is a Resource Manager basically
used for job scheduling and efficient cluster resource management It takes the responsibility
of providing the computational resource (e.g., CPU storage memory devices, etc.) required
for application executions.
Hadoop Distributed File System (HDFS™) is suitable for applications having large data sets
because it is designed to be deployed on low-cost hardware. HDFS is responsible for providing
permanent, reliable and distributed storage, with unrestricted, high-speed access to the data
application. This is typically used for storing Inputs & Outputs.
Hadoop MapReduce framework is the core and integral part of the Hadoop architecture. It
efficiently handles large volume datasets by breaking them into multiple datasets and
assigning them to a cluster of computers to work parallel at same time.

Storage for big data is designed to collect voluminous data produced at variable speeds by
multiple sources and in varied formats. Industry experts describe this process as the three Vs:
the variety, velocity and volume of data.

Variety describes the different sources and types of data to be mined. Sources include audio
files, documents, email, file storage, images, log data, social media posts, streaming video and
user clickstreams.

Velocity pertains to the speed at which storage is able to ingest big data volumes and run
analytic operations against it. Volume acknowledges that modern applications scripts are
large and growing larger, outstripping the storage capabilities of existing legacy storage.
The key requirements of big data storage are that it can handle large volumes of data and
22
keep scaling to keep up with growth, so that it can provide the input/output operations per
second (IOPS) necessary to deliver data to analytics tools.

The largest big data practitioners such as Google, Facebook, and Apple, run what are known
as hyper scale computing environments.

These comprise vast amounts of commodity servers with direct-attached storage (DAS).
Redundancy is at the level of the entire compute/storage unit, and if a unit suffers an outage
of any component it is replaced wholesale, having already failed over to its mirror.

Statistical big data analysis and modelling is gaining adoption in a cross-section of industries,
including aerospace, environmental science, energy exploration, financial markets, genomics,
healthcare and retailing. A big data platform is built for much greater scale, speed and
performance than traditional enterprise storage. Also, in most cases, big data storage targets
a much more limited set of workloads on which it operates.

Figure 13: Big Data Storage Architecture


(Source Rouse, 2017)

For example, as summarised in Figure 13 above, an enterprise resource planning systems


might be attached to a dedicated storage area network (SAN). Meanwhile, the clustered
network-attached storage (NAS) supports transactional databases and corporate sales data,
while a private cloud handles on-premises archiving.

It's not uncommon for larger organizations to have multiple SAN and NAS environments that
support discrete workloads. Each enterprise storage silo may contain pieces of data that
pertain to your big data project.
Big data can bring an organization a competitive advantage from large-scale statistical
23
analysis of the data or its metadata. In a big data environment, the analytics mostly operate
on a defined set of data, using a series of data mining-based predictive modelling forecasts to
gauge customer behaviours or the likelihood of future events.

3.2 Big Data Security

Big Data security is the processing of guarding data and analytics processes, irrespective of
where it is housed, from any vulnerabilities that could compromise their
confidentiality (Taylor, 2017).

Concerns surrounding the storage,


management, transmission, mining, From Big Data adoption arises
and analysing of data are also a key one question: How can you
concern with Big Data (Digital leverage big data’s potential
while effectively mitigating big
Guardian, 2017). data security risks?
(Digital Guardian, 2017)
Security Risks in Big Data

These security issues related to Big Data are briefly discussed


below:

Since Big Data relies heavily on the cloud, many enterprises fear the sense of loss of control
over data that comes with utilizing cloud storage providers and third-party data management
and analytics solutions. The impact of this is significant, as many regulations hold enterprises
accountable for the security of data that may not be in their direct control.

Furthermore third-party applications of unknown lineage, can easily introduce risks into
enterprise networks when their security measures are not equivalent to the same standards
as established enterprise protocols and data governance policies of the enterprise.

Devices introduce yet another layer of big data security concerns, with workers embracing
mobility and taking advantage of the cloud to work anywhere, at any time. With BYOD, a
multitude of devices may be used to connect to the enterprise network and handle data at
any time, so effective big data security for business must address endpoint security with this
in mind.

Additionally, securing privileged user access must be a top priority for enterprises. Certain
users must be given access to highly sensitive data in certain business processes, but avoiding
potential misuse of data can be tricky. Securing privileged user access requires well-defined
security policies and controls that permit access to data and systems required by specific
employee roles while preventing privileged user access to sensitive data where access is not
necessary – a practice commonly referred to as the “principle of least privilege.” It is critically
important to provide a system in which encrypted authentication/validation verifies that
users are who they say they are, and determine who can see what.
Since Big Data implementations distribute their processing jobs across many systems for
24
faster analysis, due to their large volumes, this means a lot more systems where security
issues can crop up.

Non-relational data stores like NoSQL, which are used in Big Data systems, usually lack
security.

In Big Data architecture, the data which is usually stored in the cloud, is also typically stored
on multiple tiers, depending on business needs for performance. For example, high-priority
“hot” data will usually be stored on flash media for faster performance. Hence securing
storage will mean creating a tier-conscious security strategy.

Security solutions that draw logs from endpoints will need to validate the authenticity of
those endpoints.

Real-time security tools generate a large amount of information; the key is finding a way to
ignore the false alarms, so human talent can be focused on the true breaches.

Data mining solutions should ensure that they are secured against not just external threats,
but insiders, who abuse network privileges, to obtain sensitive information.

Security Technologies/Approaches in Big Data

Granular auditing can help determine when missed attacks have occurred, what the
consequences were, and what should be done to improve matters in the future. This in itself
is a lot of data, and must be enabled and protected to be useful in addressing big data security
issues.

Data provenance primarily concerns metadata (data about data), which can be extremely
helpful in determining where data came from, who accessed it, or what was done with it.
Usually, this kind of data should be analysed with exceptional speed to minimize the time in
which a breach is active. Privileged users engaged in this type of activity must be thoroughly
vetted and closely monitored to ensure they do not become their own big data security issues.

Encryption tools need to secure data in-transit and data at-rest, and more importantly, these
needs to be achieved across massive data volumes. Furthermore, encryption needs to operate
on many different types of data, both user and machine-generated. Encryption tools also need
to work with different analytics toolsets and their output data, and on common big data storage
formats including relational database management systems (RDBMS), non-relational
databases like NoSQL, and specialized file systems such as Hadoop Distributed File System
(HDFS). Encrypted data is useless to external entities, such as hackers, if they do not have the
key to unlock it. Moreover, encrypting data means that both at input and output, information is
completely protected.
Centralized key management has been a security best practice for many years even in big
25
data environments, especially those with wide geographical distribution. Best practices
include policy-driven automation, logging, on-demand key delivery, and abstracting key
management from key usage.

User access control may be the most basic network security tool, but many companies
practice minimal control because the management overhead can be so high. This is dangerous
at both the network level, as well as on the big data platform. Strong user access control
requires a policy-based approach that automates access based on user and role-based
settings. Policy driven automation manages complex user control levels, such as multiple
administrator settings that protect the big data platform against inside attack.

Controlling who has root access to Business Intelligence tools and analytics platforms is
another key to protecting your data. By developing a tiered access system, the opportunities
for an attack can be reduced.

Intrusion detection and prevention systems are security pillars equally so to the big data
platform. Big data’s value and distributed architecture naturally lends itself to intrusion
attempts. Intrusion Prevention Systems (IPS) enable security administrators to protect the big
data platform from intrusion. However, should an intrusion succeed despite the IPS, Intrusion
Detection Systems (IDS) quarantine the intrusion before it does significant damage.

Physical Security must not be ignored. Physical security should be deploying when the big
data platform in the data centre is being built. If your data centre is cloud based carefully do
due diligence to the cloud provider’s data centre security. Physical security systems serve an
important role in that they can deny data centre access to both strangers, or to staff members
who should not have access to sensitive areas. Video surveillance and security logs will also
serve the same purpose.

Building a strong firewall is another useful big data security tool. Firewalls are effective at
filtering traffic that both enters and leaves servers. Organizations can prevent attacks before
they happen by creating strong filters that avoid any third parties or unknown data sources.

Essentially, big data security requires a multi-faceted approach. When it comes to enterprises
handling vast amounts of data, both proprietary and obtained via third-party sources, big data
security risks become a real concern.
A comprehensive, multi-faceted approach to big data security encompasses:
 Visibility of all data access and interactions
 Data classification
 Data event correlation
 Application control
 Device control and encryption
 Web application and cloud storage control
 Trusted network awareness
 Access and privileged user control
26
As illustrated in Figure 14, below, Taylor (2017), has summarised the essential areas of
security required for big data, that is, to:

 keep out unauthorized users


 identify intrusions with firewalls
 implement strong user authentication
 hold regular end-user training
 integrate intrusion protection systems (IPS) and intrusion detection systems (IDS)
 enforce encryption of your data in-transit and at-rest.

Thus big data security environments must operate during three data stages. These are 1) data
ingress (what data is coming in), 2) stored data (what data is stored), and 3) data output (what
data is going out to applications and reports) (Taylor, 2017).

Figure 14: Big Data - Areas of Security


(Source: Taylor, 2017)

Many enterprises have slowly accumulated a series of point solutions, each addressing a
single component of the full big data security picture. While this approach can address
standalone security concerns, the best approach to big data security integrates these
capabilities into a unified system capable of sharing and correlating security alerts, threat
intelligence, and other activity in real time – an approach not unlike the vast and dynamic
concept of big data itself.
Chapter 4: Big Data Strategies and Legal Compliance 27

LEARNING OUTCOMES
After reading this Section of the guide, the learner should be able to:

Learning Objectives

 Awareness of ethics, standards and Strategies and Legal


Compliance

4.1 How big data can help guide your strategy

As you evaluate your company’s strategy, perhaps even considering a strategic pivot, you’ll
want to gather and utilize all available data to build a deep understanding of your customers,
your competitors, the external factors that impact you and even your own product. The big
data ecosystem will play an important role in this process, enabling insights and guiding
actions in ways not previously possible.
Your customers
Customer data is one of your most important assets. There is more data available today than
ever before, and it can tell you much more than you’d expect about your customers and
potential customers: who they are, what motivates them, what they prefer and what their
habits are. The more data you collect, the more complete your customer picture will become,
so make it your goal to collect as much data as possible, from as many sources as possible.
Getting the data
Start by identifying all customer interaction points:

 Visits to your digital platforms: websites, apps and kiosks’.


 Interactions with customer support: phone, email, online chat, etc.
 Social media, including direct messaging, tweets, and posts on accounts you own or
they own’.
 Records of physical movement, including store videos and movement logs. There are
several technologies for monitoring movement, including embedded sensors, Wi-Fi,
Bluetooth, beacons, and even light frequencies in combination with smartphone apps.
 In some industries, you will have access to additional (non-geo) sensor data from
sensors, RFID tags, personal fitness trackers, etc., which may provide data such as bio-
medical readings, accelerometer data, external temperature, etc.

For each interaction point, make an inventory of:

 what data you can possibly collect;


 what the potential uses of that data are; and
28
 what privacy and governance factors you should consider (check 5.5)?

For many organizations, the interaction points will consist of physical and web stores, Apple
(iOS) and Android apps, social media channels, and staff servicing customers in stores, call
centres, online chat and social media. I’ll illustrate with a few examples.

Digital
Start with your digital platforms. First the basics (not big data yet).
You’ll probably already have some web analytics tags on your website that record high-level
events. Make sure you also
record the key moments in the
customer journey, such as when
the visitor does an onsite search,
selects filters, visits specific
pages, downloads material,
watches your videos or places
items in the checkout basket.
Record mouse events, such as
scrolls and hovers. Make sure
these moments are tagged in a
way that preserves the details you’ll need later, such as adding the details of product
category, price range, and product ID to the web tags associated with each item description
page. This will allow you to quickly do top-of-mind analysis, such as identifying how often
products in certain categories were viewed, or how effective a marketing campaign was in
driving a desired event. In the end, you’ll probably have several dozen or even several
hundred specific dimensions that you add to your out-of-the-box web analytics data. This isn’t
yet big data.
With the additional detailed tags that you’ve implemented, you’ll be able to analyse and
understand many aspects of the customer journey, giving insights into how different types of
customers interact with the products you’ve presented them. We’ll show examples of this
below.
If you haven’t already done so, set up conversion funnels for sequential events that lead to
important conversion events, such as purchases. shows how a basic purchase funnel might
look.
Each intermediate goal in the conversion funnel is a micro-conversion, together leading to a
macro-conversion (‘checkout’ in this case). Choose your micro-conversions in a way that
reflects increasing engagement and increased likelihood of a final conversion. The funnels you
set up will enable you to analyse drop-off rates at each stage, allowing you to address
potential problem points and increase the percentage of visitors progressing along each stage
of the funnel, eventually reaching the conversion event at the end of the funnel. Depending
on your product, customer movement down the funnel may span several days, weeks or
months, so you’ll need to decide what to consider ‘drop-off’.
For privacy and governance in your website, research and comply with local laws governing
29
the use of web cookies. Make a list of where you are storing the browsing data that identifies
the individual user (such as IP address) and how
you later use the insights you’ve gathered in
customizing your interactions with each
user. For example, if you personalize your
content and marketing based on the users’
online actions, you’ll need to consider the ethical
and legal implications. Remember the example
from Target.
Now the big data part. You’ve set up the web
analytics to record details most meaningful for
you. Now hook your web page up to a big data
system that will record every online event for
every web visitor. You’ll need a big data storage
system, such as HDFS, and you’ll need to implement code (typically JavaScript) that sends the
events to that storage. If you want a minimum-pain solution, use Google Analytics’ premium
service (GA360), and activate the BigQuery integration. This will send your web data to
Google’s cloud storage, allowing you to analyse it in detail within a few hours. If you need
data in real time, you can change the GA JavaScript method sendHitTask and send the same
data to both Google and to your own storage system.

Customer support
Consider recording and analysing all interactions with sales agents and customer support:
phone calls, online chats, emails and even videos of customers in stores. Most of this data is
easy to review in pieces, but difficult to analyse at scale without advanced tools. As you store
these interactions, your customer support agents should enrich them with additional
information, such as customer ID and time of day’, and label them with meaningful categories,
such as ‘order enquiry’, ‘new purchase’, ‘cancellation’ or ‘complaint.’ You can then save the
entire data file in a big data storage system (such as MongoDB or HDFS). We’ll show valuable
ways to use this data later in this chapter.
Physical movements
You have a choice of several technologies for monitoring how your customers are moving
within your stores. In addition to traditional video cameras and break-beam lasers across
entrances, there are technologies that track the movement of smartphones based on cellular,
Bluetooth or Wi-Fi interactions. Specialized firms such as ShopperTrak and Walkbase work in
these areas. Such monitoring will help you understand the browsing patterns of your
customers, such as what categories are considered by the same customers and how much
time is spent before a purchase decision. It will help you direct your register and support staff
where needed. Again, this data is valuable even if the customer is kept anonymous.
When a customer arrives at the register and makes a purchase, possibly with a card that is
linked to that customer, you will be able to see not only what is being purchased, but also
what other areas of the store were browsed. You might use this information in future
marketing or you might use it to redesign your store layout if you realize that the current
30
layout is hampering cross-sell opportunities.
These are just a few examples. In general, start collecting and storing as much detail as
possible, making sure to consider business value, to respect customer privacy and to comply
with local laws in your collection, storage and use of this data. Be careful not to cross the line
between ‘helpful’ and ‘creepy’. Keep your customers’ best interests in mind and assume any
techniques you use will become public knowledge.

Linking customer data


Link the customer data from your interaction points to give a holistic picture of customer
journey. If a customer phones your call centre after looking online at your cancellation policy
webpage, you should be able to connect those two events in your system. To do this, enter a
unique customer field (such as phone number or user name) along with the call record.
If you are monitoring a customer walking through your store, link that footpath with
subsequent register sales information (subject to privacy considerations). Do this by recording
the timestamp and location of the point of sale with the footpath data. Combine the data
centrally to give the complete picture.
Sometimes you’ll use anonymous customer data, such as when analysing traffic flow. Other
times you’ll use named-customer data, such as when analysing lifetime activity. For the
named-customer applications, you’ll want to de-duplicate customers. This is difficult, and
you’ll probably have limited success. The best situation is when customers always present a
unique customer ID when using your service. In an online setting, this would require a highly
persistent and unique login (as with Facebook). Offline, it typically requires photo ID. In most
situations, you won’t have this luxury, so use best efforts to link customer interactions.
You’ll typically face the following problems in identifying your customers:

 Problem: Customers will not identify themselves (e.g. not logging in).
 Possible solutions: Use web cookies and IP addresses to link visits from the same
visitors, producing a holistic picture of
anonymous customer journeys extended
across sessions. Use payment details to link
purchases to customers. Smartphones may
provide information to installed apps that
allow additional linking. Talk to your app
developers about this.
 Problem: Customers create multiple
logins.
 Possible solutions: Clean your customer
database by finding accounts that share
key fields: name, email address, home
address, date of birth, or IP address. A
graph database such as Neo4J can help in this process, as illustrated in Figure. Work
with the business to create logic for which customers to merge and which to associate
using a special database field (e.g. ‘spouse of’). Change your account creation process
to detect and circumvent creation of duplicate accounts, such as by flagging email
31
addresses from existing accounts.

Using the data


Much of your customer data will be useful even in the aggregate and anonymized format
provided by your standard web analytics tool. You’ll see how many customers came at any
hour of the day as well as useful information such as average time spent onsite, number of
pages viewed, how many entered at each page or from each marketing campaign, etc. You’ll
also see the total transactions made by customer segments, such as geography and
acquisition source. This will give you a picture of how and when your products are being used,
particularly when matched against marketing campaigns, holidays, service downtimes and
new initiatives.
Customer journey
The big data insights get much more useful when you use data to build an understanding of
the intents, preferences and habits of your customers. You should already be segmenting your
customers into personas based on static features such as home address, gender, language,
age and possibly income level. Note that Google Analytics can provide useful demographic
information (from their DoubleClick cookie) if you enable this functionality.
Broaden your segmentation criteria to include customer journey data, such as those listed
below.

What filters do they most often use? Is it Price high-to-low?

Price low-to-high? Highest rated item? Newest product? Customers sorting price low-to-high
are probably price conscious. Those sorting price high-to-low or by highest rating are probably
quality conscious. Those sorting newest first may be quality conscious or perhaps
technophiles or early adopters. Those sorting by rating may be late adopters or quality
conscious. All of this will impact how you interact with them. If a customer is quality conscious
but not price conscious, you should present them with high-quality products in search results
and marketing emails, and you should not be presenting them with clearance sales of entry-
level products. You’ll want to interact with the price-conscious customer segments in exactly
the opposite way.

How many items do they typically consider before making a purchase?

This information will help you decide when to intervene in the shopping process, such as by
offering a discount when a customer is about to leave without making a purchase.

What categories do they visit most often?

This will help you segment the customer and return the most relevant results for ambiguous
search phrases (such as ‘jaguar’ the car vs ‘jaguar’ the animal or Panama City, Florida vs
Panama City, Panama). You’ll also use this information to guide which items you market to
the customer.

Do they change the shipping option to save fees?


Again, a signal of price-conscious customers.
32
Do they read customer reviews?

Do they write reviews? If they always read reviews, don’t present them with poorly reviewed
items in your marketing emails, search results, or cross-sell frames. If they often write
reviews, or if they own social media accounts with an unusually high number of followers,
make sure they get ‘golden glove’ customer support.

What types of marketing do they best respond to?

Do they open newsletters? Do they respond to flash sales? Don’t fill their inbox with
marketing they never respond to. Give them the most relevant media and you’ll increase the
odds they respond rather than clicking ‘unsubscribe’.

If they are active on social media, what are the topics and hashtags they most frequently
mention?

Can you use this knowledge to market more effectively to them? Here again, you are building
customer personas with all relevant information. Try to get connected to their social media
accounts. As we mentioned

Customer segments (personas)


Using data, you’ve collected, decide what factors are most meaningful in dividing your
customers into segments (personas). Examples would be ‘price-sensitive males aged 20 to 30’
or perhaps ‘high-spending technophiles who make quick purchasing decisions’, or ‘customers
who browse daily across specific categories but buy only on discount’. You can construct these
segments in a qualitative way, using the intuition of marketing experts guided by the data, or
you can construct the segments in a quantitative way, using analytic tools such
as clustering and principal component analysis. Both are valid methods, but if you have a lot
of data that can be segmented in many ways, the quantitative approach will probably be more
effective.
Inventory
This customer journey data will give deeper insights into how your inventory impacts
customer experience. As you consider removing items with low sales, the customer journey
data may show a high percentage of profitable transactions by customers who found your
site searching for those non-profitable products. You’ll want to understand more about the
customer journey of those customers, what they looked for and why they made their
decisions, before you make the decision to remove products which may have attracted those
customers to your shop in the first place.
On the other hand, you may be selling a product that is often showing up in online search
results but is of no interest to your customers, as demonstrated by your customer journey
data. In this case, you should either change the display of the product search result, or remove
the item altogether, as it is taking up valuable search real estate.
Critical intervention
33
Apply both basic analysis and advanced machine learning to your customer data and you’ll
likely find ways to decrease churn and increase sales. Basic analysis of where and when your
customers are active will help you with scheduling the shifts and skill sets of your support
personnel. It will also signal what the customer is likely to do next (a customer visiting your
appliance store twice last week may be preparing for a significant purchase).
With a bit more analysis, you’ll start detecting subtle but important signals. A European
telecommunications company recently analysed data on customers cancelling their
subscriptions and found that a large number followed the same three or four steps prior to
cancellation, such as reviewing their contract online, then phoning customer support, then
disputing a bill in person and then cancelling the contract. By linking those events, the
company identified signals of impending churn so it could take action.
At an even more advanced level, machine learning techniques could detect pending account
cancellation or the likelihood of a sale based on an analysis of text, audio or video. Such a
system might be a significant investment of your time and resources, but you might have the
business case to justify it, or you might find a vendor who has already developed a technology
suitable for your application, as illustrated in the next case study.
Your competitors
It’s especially challenging to get good information about your competitors. Information
brokers such as Nielsen, Comscore and SimilarWeb will sell their estimations of traffic to your
competitors’ sites and apps, possibly including referrer information. The
website trends.google.com gives charts for the number of searches for specific terms, which
in turn give indications of how you compare with your competitors for brand search
(see Figure ).

Figure Searches in
Brazil for ‘McDonalds’
(top line) vs ‘Burger
King’ (bottom line) Q2,
2017 (Google Trends).

You’ll be able to get


information on competitor inventory, services and physical locations by scraping websites.
Your technology team can help with this (you can use a tool such as Selenium). If you are
competing on price, you’ll want to adjust your pricing based on your competition’s proximity
to your customers. For physical locations, that will be based on address and transportation
routes. For online sales, that will be partially influenced by the referrer sending a visitor to
your site. Customers arriving from price comparison sites should be considered price-
conscious and at high risk of buying from your competitors.
Work to increase your share of wallet, the percentage of a customer’s spend that goes to your
business rather than to competitors’. Start by using the detailed customer data you have
collected and see what categories and what specific products are typically purchased by the
same customer segments. You’ll be able to see which customers are already making those
cross-purchases, which are browsing but not purchasing, and which are active in only one
34
category.
Identify the products your customers are purchasing elsewhere to see where you are losing
share of wallet. If you sell groceries and your customer only buys fruits and vegetables, you’ll
know they are buying milk and eggs elsewhere. If you sell electronics and they only buy
smartphones, you’ll know they are buying computers elsewhere. This will help identify areas
where you need to compete harder. By using the customer segments you’ve created, you’ll
see if your competition is appealing more to quality-conscious customers, marketing-reactive
customers, high-spenders, etc.
Monitor online job boards to get insights into competitors’ initiatives. A significant increase
in postings for a location or job function will indicate activity in that area. Create watch lists
of competitor employees’ LinkedIn profiles and monitor them for anomalies in profile
updates. If an unusually high number of employees are updating their LinkedIn profiles, it may
signal turmoil or pending layoffs within the company. Similarly, natural language processing
run on company public statements can uncover unusual activity. This technique has been
used effectively to signal pending initial public offerings.
External factors
Your strategy will be influenced by factors ranging from government regulation to local
weather. If you are in the travel and tourism industry, regional holidays will impact long-range
bookings and weather will influence impulse bookings. The price of commodities will
influence production and transport costs, and exchange rates or political turmoil will
influence cross-border activity.
Much of the impact of external factors will be from traditional (small) data, but newer (big)
data sources will provide additional, valuable signals. Keep an eye on changes in online
customer activity, which may signal unexpected factors requiring your attention. To illustrate,
consider how Google Maps and Waze can detect construction or road closures simply by
studying driver movements.
To give another example, you may not be aware of the release of an innovative new product
until you see it in onsite searches or detect the impact in sales of your other products. If you
are running a hotel chain and have a property in Scranton, Pennsylvania, you may have no
idea there is a major convention about an obscure topic being planned there during the
second week in February. If you are prepared with a forecast of booking rates for February,
you’ll see the unexpected spike in the customer activity in your booking site and call centres
already in October, before you even know about the February conference. By monitoring
customer activity, you can act to raise room rates in October before running out of under-
priced February inventory a few weeks later.
To this end, you should construct regular forecasts of key metrics, including number of visits
and sales projections. You’ll do this by consulting historic figures, projecting growth rates, and
speaking with your business units to consider anything out of the ordinary (holiday cycles,
major events, regulatory or economic changes, etc.). These forecasts should be segmented
down to the levels at which you can steer your operations (such as product and region) and
should preferably be made at daily granularity. If you automatically monitor these figures at
daily or weekly granularities you can raise an alert whenever they move above or below
expected levels, signalling when some external factor is impacting your business in an
35
unexpected way.
Your own product
You need to truly understand your own service and product offerings when evaluating your
strategy. You may not understand them as well as you think, and your customers may
perceive them in completely different ways than you’d expect. What is working and what is
not working? How are customers responding to your products? Where are you losing money
through inefficiencies?
If your web offering is a significant part of your business, find out what is working there and
work to make it better. Create and track micro-conversions to see how your items are
performing even before a purchase is made. These provide valuable insights even if they are
not part of a funnel analysis.
Track the customer engagement with your other digital offerings.

 What is the applause rate on your social media? (How many times are your tweets
liked or retweeted? How many comments on your Facebook posts?)
 How many people download your material?
 How many people sign up for your newsletter?

Test changes in your products by running A/B tests, which you’ll do in the following way:

1. Propose one small change that you think may improve your offering. Change one frame,
one phrase, or one banner. Check with your development team to make sure it’s an easy
change.

2. Decide what key performance indicators (KPI) you most want to increase: revenue,
purchases, up-sells, time onsite, etc. Monitor the impact on other KPIs.

3. Run the original and the changed version (A and B) simultaneously. For websites, use an
A/B tool such as Optimizely. View the results using the tool or place the test version ID
in web tags and analyse specifics of each version, such as by comparing lengths of path
to conversion.

4. Check if results are statistically significant using a two-sample hypothesis test. Have an
analyst do this or use an on-line calculator such as https://abtestguide.com/calc/.

5. Use your big data system for deeper analysis:

a. Were there significant changes in customer journey, such as number of categories


viewed or filters selected?

b. Are there key product or customer segments you should manage differently?
c. Did specific external events influence results? 36

d. Did KPIs move in different directions?

Align your assumptions about your product with these new insights. For example:

 Are you trying to compete on price, while most of your revenue is coming from
customers who are quality conscious?
 Are you not taking time to curate customer reviews, while most of your customers are
actively sorting on and reading those reviews?

If your assumptions about your product don’t align with what you learn about your
customers’ preferences and habits, it may be time for a strategic pivot.
Use modern data and data science (analytics) to get the insights you’ll need to determine and
refine your strategy. Selectively choose the areas in which you should focus your efforts in
(big) data and data science and then determine the necessary tools, teams and processes.
In the next chapter, I’ll talk about how to choose and prioritize your data efforts.

4.2 Forming your strategy for big data and data science

It’s exciting for me to sit with business leaders to explore ways in which data and analytics
can solve their challenges and open new possibilities. From my experience, there are different
paths that lead a company to the point where they are ready to take a significant step forward
in their use of data and analytics.
Companies that have always operated with a minimal use of data may have been suddenly
blindsided by a crisis or may be growing exasperated by:

 lagged or inaccurate reporting;


 wasted marketing spend;
 time lost to poor sales leads;
 wasted inventory; or
 any one of a host of operational handicaps that can result when data is ignored or data
solutions are constructed in a short-sighted manner.

They end up forced to run damage control in these areas, but are ultimately seeking to
improve operations at a fundamental level and lay the groundwork for future growth.
Companies that have been operating with a data-driven mind set may be exploring innovative
ways to grow their use of data and analytics. They are looking for new data sources and
technologies that will give competitive advantages or are exploring ways to quickly scale up
and optimize a proven product by applying advances in parallel computing, artificial
intelligence and machine learning.
Regardless of which description best fits your company, the first step you’ll want to take when
37
re-evaluating your use of data and analytics is to form a strong programme team.
The programme team
Your data initiative programme team should include individuals representing four key areas
of expertise:

1. strategic,
2. business,
3. analytic; and
4. technical.

Strategic expertise
Business expertise
Analytic expertise
Technical expertise

The kick-off meeting

Once you’ve selected your programme team, plan a programme kick-off meeting to lay the strategic
foundation for the analytics initiative, sketching the framework for business applications,
brainstorming ideas, and assigning follow-on steps, which will themselves lead to initial scoping
efforts. The four skill sets represented in the programme team should all be present if possible,
although the business expert may cover the strategic input and it is also possible (but not ideal) to
postpone technology input until the scoping stage.

Also helpful at this stage is to have detailed financial statements at hand. These figures will help focus
the discussion on areas with the most influence on your financials. Bring your standard reports and
dashboards, particularly those that include your key performance indicators (KPIs).

Strategic input: Start the kick-off meeting by reviewing the purpose and principles that govern
your efforts. Continue by reviewing the strategic goals of the company, distinguishing
between the long- and short-term strategic goals. Since some analytics projects will take
significant time to develop and deploy, it’s important to distinguish the time-lines of the
strategic goals. If there is no executive or strategic stakeholder involved in the process, the
team members present should have access to documentation detailing corporate strategy. If
there is no such strategic documentation (as is, sadly, sometimes the case), continue the
brainstorming using an assumed strategy of plucking low-hanging fruit with low initial
investment, low likelihood of internal resistance and relatively high ROI.
Business input: After reviewing the principles and strategy, review the KPIs used within the
38
organization. In addition to the standard financial KPIs, a company may track any number of
metrics. Marketing will track click-through rate, customer lifetime value, conversion rates,
organic traffic, etc. Human resources may track attrition rates, acceptance rates,
absenteeism, tenure, regretted attrition, etc. Finance will typically track financial lead
indicators, often related to traffic (visits, visitors, searches) as well as third-party data.
At this stage, probe more deeply into why certain KPIs are important and highlight the KPIs
that tie in most closely with your strategic and financial goals. Identify which KPIs you should
most focus on improving.
The business experts should then describe known pain points within the organization. These
could come from within any department and could be strategic, such as limited insight into
competition or customer segments; tactical, such as difficulty setting optimal product prices,
integrating data from recent acquisitions or allocating marketing spend; or operational, such
as high fraud rates or slow delivery times.
Ask the business experts to describe where they would like to be in three years. They may be
able to describe this in terms of data and analytics, or they may simply describe this in terms
of envisioned product offerings and business results. A part of this vision should be features
and capabilities of competitors that they would like to see incorporated into their offerings.
Analytics input: By now your business objectives, principles, and strategic goals should be
completely laid out (and ideally written up in common view for discussion). At this point, your
analytics expert should work through the list and identify which of those business objectives
can be matched to standard analytic tools or models that may bring business value in relieving
a pain point, raising a KPI, or providing an innovative improvement. It’s beneficial to have
cross-industry insight into how companies in other industries have benefited from similar
analytic projects.
To illustrate this process, a statistical model may be proposed to solve forecasting inaccuracy,
a graph-based recommendation engine may be proposed to increase conversion rates or
shorten purchase-path length, a natural language processing tool may provide near-real-time
social media analysis to measure sentiment following a major advertising campaign, or a
streaming analytics framework combined with a statistical or machine learning tool may be
used for real-time customer analytics related to fraud prevention, mitigation of cart
abandonment, etc.
Technical input: If IT is represented in your kick-off meeting, they will be contributing
throughout the discussion, highlighting technical limitations and opportunities. They should
be particularly involved during the analytics phase, providing the initial data input and taking
responsibility for eventual deployment of analytics solutions. If your technical experts are not
present during the initial project kick-off, you’ll need a second meeting to verify feasibility
and get their buy-in.
Output of the kick-off
39
The first output of your programme kick-off should be a document that I refer to as Impact
Areas for Analytics, consisting of the
table illustrated in Figure . The first
column in this table should be
business goals written in
terminology understandable to
everyone. The next column is the
corresponding analytic project,
along the lines of the applications.
The next three columns contain the
data, technology and staffing
needed to execute the project. If possible, divide the table into the strategic focus areas most
relevant to your company.
By the end of your kick-off meeting, you should have filled out the first two columns of this
matrix.
The second document you’ll create in the kick-off will be an Analytics Effort document. For
each analytics project listed in the first document, this second document will describe:
1. The development effort required. This should be given in very broad terms (small,
medium, large, XL or XXL, with those terms defined however you’d like).
2. An estimate of the priority and/or ROI.
3. The individuals in the company who:
a. can authorize the project; and
b. can provide the detailed subject-matter expertise needed for implementation.
We are looking here for individuals to speak with, not to carry out the project.
These are the ‘A’ and the ‘C’ in the RASCI model used in some organizations.
Distribute the meeting notes to the programme team members, soliciting and incorporating
their feedback. When this is done, return to the programme sponsor to discuss the Impact
Areas for Analytics document. Work with the programme sponsor to prioritize the projects,
referencing the Analytics Effort document and taking into consideration the company’s
strategic priorities, financial landscape, room for capital expenditure and head-count growth,
risk appetite and the various dynamics that may operate on personal or departmental levels.
Scoping phase

Once the projects have been discussed and prioritized with the programme sponsor, you
should communicate with the corresponding authorizers (from the Analytics Effort
document) to set up short (30–60 min) scoping meetings between the analytics expert and
the subject matter expert(s). The exact methods and lines of communication and
authorization will differ by company and by culture.
During the scoping meetings, speak with the individuals who best understand the data and
40
the business challenge. Your goal at this stage is to develop a detailed understanding of the
background and current challenges of the business as well as the relevant data and systems
currently in use.

The subject experts and the analytics expert then discuss:

 the proposed analytics solution;


 what data might be used;
 how the model might be built and run; and
 how the results should be delivered to the end user (including frequency, format and
technology).

After each scoping meeting, the analytics expert should update the corresponding project
entry on the Analytics Effort document and add a proposed minimum viable product (MVP)
to the project description.

The MVP is the smallest functional deliverable that can demonstrate the feasibility and
usefulness of the analytics project. It should initially have very limited functionality and
generally will use only a small portion of the available data. Collecting and cleaning your full
data set can be a major undertaking, so focus in your MVP on a set of data that is readily
available and reasonably reliable, such as data over a limited period for one geography or
product.

The description should briefly describe the inputs, methodology and outputs of the MVP, the
criteria for evaluating the MVP, and the resources required to complete the MVP (typically
this is only the staff time required, but it might entail additional computing costs and/or third-
party resources). Utilizing cloud resources should eliminate the need for hardware purchases
for an MVP, and trial software licenses should substitute for licensing costs at this stage.

Feed this MVP into whichever project management framework you use in your company (e.g.
scrum or Kanban). Evaluate the results of the MVP to determine the next steps for that
analytics project. You may move the project through several phases before you finally deploy
it. These phases might include:

 several iterations on the MVP to converge on the desired result;


 further manual application with limited scope;
 documented and repeatable application;
 deployed and governed application; and
 deployed, governed and optimized application
 with each successive stage requiring incremental budgeting of time, resources and
technology.

It’s very important to keep in mind that analytic applications are often a form of Research &
Development (R&D). Not all good ideas will work. Sometimes this is due to insufficient or
poor-quality data, sometimes there is simply too much noise in the data, or the process that
we are examining does not lend itself to standard models. This is why it’s so important to start
with MVPs, to fail fast, to keep in close contact with business experts and to find projects that
41
produce quick wins. We’ll talk more about this in the next chapter when we talk about agile
analytics.

4.3 Analytics, algorithms and machine learning

Four types of analytics


It’s quite possible that your biggest initial wins will be from very basic applications of analytics.
Analytics can be extremely complex, but it can also be very simple, and the most basic of
applications are sometimes the most valuable. The preliminary tasks of collecting and
merging data
from multiple
sources,
cleaning the
data and
summarizing
the results in a
well-designed
table or graph
can produce
substantial
business value,
eliminating
fatal
misconceptions
and clearly
highlighting performance metrics, costs, trends and opportunities.
Gartner has developed a useful framework for classifying application areas of analytics
(shown below). Their Analytics Ascendancy Model Figure divides analytic efforts into four
categories: descriptive, diagnostic, predictive and prescriptive. I find this model to be quite
useful in discussing analytics.

Models, algorithms and black boxes


As you move into more advanced analytics, you’ll need to choose analytic models. Models are
sets of formulas that approximately describe events and interactions around us. We apply
models using algorithms, which are sequences of actions that we instruct a computer to
follow, like a recipe. As you employ an analytic model to solve your business problem (such
as predicting customer churn or recommending a product), you’ll need to follow three steps:

1. Design the model;


2. Fit the model to the data (also known as ‘training’ or ‘calibrating’ the model); and
3. Deploy the model.
Artificial intelligence and machine learning
42
Analytic software
Programming languages
Analytic tools Agile analytics

4.4 Governance and legal compliance

You have three primary concerns for securing and governing your data:

1. Proper collection and safeguarding of personal data.


2. Internal governance of your own data.
3. Complying with local laws and law enforcement in each jurisdiction in which you
operate.

This last one can be a huge headache for multinationals, particularly in Europe, where the
General Data Protection Regulation, effective May 2018, carries with it fines for violations of
up to 4 per cent of global turnover or 20 million euros (whichever is larger). The EU will hold
accountable even companies headquartered outside of Europe if they collect or process data
of sufficient numbers of EU residents.
Regardless of legal risk, you risk reputational damage if society perceives you as handling
personal data inappropriately.

Personal data
When we talk about personal data, we often use the term personally identifiable information
(PII), which, in broad terms, is data that is unique to an individual. A passport or driver’s
license number is PII, but a person’s age, ethnicity or medical condition is not. There is no
clear definition of PII. The IP address of the browser used to visit a website is considered PII
in some but not all legal jurisdictions.
There is increased awareness that identities can be determined from non-PII data using data
science techniques, and hence we speak of ‘quasi-identifiers’, which are not PII but can be
made to function like PII. You’ll need to safeguard these as well, as we’ll see in the Netflix
example below.
Identify all PII and quasi-identifiers that you process and store. Establish internal policies for
monitoring and controlling access to them. Your control over this data will facilitate
compliance with current and future government regulations, as well as some third-party
services which will refuse to process PII.
PII becomes sensitive when it is linked to private information. For example, a database with
the names and addresses of town residents is full of PII but is usually public data. A database
of medical conditions (not PII) must be protected when the database can be linked to PII.
Jurisdictions differ in their laws governing what personal data must be protected (health
records, ethnicity, religion, etc.). These laws are often rooted in historic events within each
region.
There are two focus areas for proper use of sensitive personal data: data privacy and data
43
protection.

 Data privacy relates to what data you may collect, store and use, such as whether it is
appropriate to place hidden video cameras in public areas or to use web cookies to
track online browsing without user consent.
 Data protection relates to the safeguarding and redistribution of data you have legally
collected and stored. It addresses questions such as whether you can store private
data of European residents in data centres outside of Europe.

Privacy laws
If you’re in a large organization, you will have an internal privacy officer who should be on a
first name basis with your data and analytics leader. If you don’t have a privacy officer, you
should find resources that can advise you in the privacy and data protection laws of the
jurisdictions in which you have customer bases or data centres.
Each country determines its own privacy and data protection laws, with Europe having some
of the most stringent. The EU’s Data Protection Directive of 1995 laid out recommendations
for privacy and data protection within the EU, but, before the activation of the EU-wide
General Data Protection Regulation (GDPR) in May 2018, each country was left to determine
and enforce its own laws. If you have EU customers, you’ll need to become familiar with the
requirements of the GDPR. Figure which shows the rapid rise in number of Google searches
for the term ‘GDPR’ since January 2017, demonstrates that you won’t be alone in this.

The extent to which privacy laws differ by country has proven challenging for multinational
organizations, particularly for data-driven organizations that rely on vast stores of personal
data to better understand and interact with customers. Within Europe over the past years,
certain data that could be collected in one country could not be collected in a neighbouring
country, and the personal data that could be collected within Europe could not be sent
outside of Europe unless the recipient country provided data protection meeting European
standards.
The European Union’s Safe Harbour Decision in 2000 allowed US companies complying with
44
certain data governance standards to transfer data from the EU to the US. The ability of
US companies to safeguard
personal data came into question
following the Edward Snowden
affair, so that, on 6 October 2015, Privacy and data protection
the European Court of Justice laws vary by legal jurisdiction,
and you may be subject to local
invalidated the EC’s Safe Harbour laws even if you don’t have a
Decision, noting that ‘legislation physical presence there.
permitting the public authorities
to have access on a generalized
basis to the content of electronic
communications must be
regarded as compromising the
essence of the fundamental right to respect for private life.’85 A replacement for Safe
Harbou, the EU–US Privacy Shield, was approved by the European Commission nine months
later (July 2016).

Data science and privacy revelations


To protect yourself from legal and reputational risk, you’ll need more than just an
understanding of laws. You’ll need to understand how customers perceive your use of data,
and you’ll need to be conscious of how data science techniques can lead to unintended legal
violations.
When Target used statistical models to identify and target pregnant shoppers, they were not
collecting private data, but they were making private revelations with a high degree of
accuracy. They weren’t breaking laws, but they were taking a public relations risk.
It’s interesting to compare these examples, as no laws were broken by Target, but the
company took reputational risk through non-transparent use of personal information. Netflix,
on the other hand, aligned its efforts in a very open and transparent way with the interests
of customers, in this case to arrive at better video recommendations. There was little
reputational risk, but there were legal consequences.
Other companies and even governments have fallen victim to such ‘linkage attacks’, in which
linking data sources allows attackers to compromise privacy measures. If your projects
require you to distribute anonymized personal information, you can apply techniques
in differential privacy, an area of research in methods to protect against linkage attacks while
maintaining data accuracy for legitimate applications. You may need this even for internal use
of data, as laws are increasingly limiting companies’ rights to use personal data without
explicit consent.
Be aware that the behavioural data you are storing on your customers may hide more
sensitive information than you realize. To illustrate, the Proceedings of the National Academy
of Sciences documented a study conducted on the Facebook Likes of 58,000 volunteers. The
researchers created a model that could, based only on a person’s ‘Likes’, identify with high
accuracy a range of sensitive personal attributes, including:
 Sexual orientation;
45
 ethnicity;
 religious and political views;
 personality traits;
 intelligence;
 happiness;
 use of addictive substances;
 parental separation;
 age; and
 gender.

By analysing the Facebook Likes of the users, the model could distinguish between Caucasians
and African Americans with a 95 per cent accuracy.88
So we see that two of the most fundamental tools within data science: the creative linking of
data sources and the creation of insight-generating algorithms, both increase the risk of
revealing sensitive personal details within apparently innocuous data. Be aware of such
dangers as you work to comply with privacy laws in a world of analytic tools that are
increasingly able to draw insights from and identify hidden patterns within big data.

Data governance
Establish and enforce policies within your organization for how employees access and use the
data in your systems. Designated individuals in your IT department, in collaboration with your
privacy officers and the owners of each data source, will grant and revoke access to restricted
data tables using named or role-based authorization policies and will enforce these policies
with security protocols, often keeping usage logs to verify legitimate data usage. If you are in
a regulated industry, you will be subject to more stringent requirements, where data
scientists working with production systems may need to navigate a half dozen layers of
security to get to the source data. In this case, you’ll want to choose an enterprise big data
product with features developed for high standards of security and compliance.
Adding a big data repository to your IT stack may make it more difficult to control access to,
usage of and eventual removal of personal information. In traditional data stores, data is kept
in a structured format and each data point can be assessed for sensitivity and assigned
appropriate access rights. Within big data repositories, data is often kept in unstructured
format (‘schema on read’ rather than ‘schema on write’), so it is not immediately evident
what sensitive data is present.
You may need to comply with right to be forgotten or right to erasure laws, particularly within
Europe, in which case you must delete certain personal data on request. With big data stores,
particularly the prevalent ‘data lakes’ of yet-to-be-processed data, it’s harder to know where
personal data is stored in your systems.
GDPR will limit your use of data from European customers, requiring express consent for
many business applications. This will limit the efforts of your data scientists, and you’ll also
be accountable under ‘right to explanation’ laws for algorithms that impact customers, such
as calculations of insurance risk or credit score. You will likely need to introduce new access
controls and audit trails for data scientists to ensure compliance with GDPR.
A full discussion of GDPR is beyond the scope of this book, and we’ve barely touched on the
46
myriad other regulations in Europe and around the world. Also (quick disclaimer) I’m not a
lawyer. Connect with privacy experts knowledgeable in the laws of the jurisdictions in which
you operate.

4.5 Governance for reporting

Moving on from the topics of legal compliance and data protection, I’ll briefly touch on an
optional governance framework, which should reduce internal chaos in your organization and
ease the lives of you and your colleagues. You should develop and maintain a tiered
governance model for how internal reports and dashboards are assembled and distributed
within your organization. Most organizations suffer tremendously from not having such a
model. Executives sit at quarter-end staring in dismay at a collection of departmental reports,
each of which defines a key metric in a slightly different way. At other times, a quick analysis
from an intern works its way up an email chain and may be used as input for a key decision in
another department.
From my experience, you’ll spare yourself tremendous agony if you develop a framework for:

1. Unifying definitions used in reports and dashboards.


2. Clarifying the reliability and freshness of all reports and dashboards.

One way to do this is to introduce a multi-tiered certification standard for your reports and
dashboards. The first tier would be self-service analysis and reports that are run against a
development environment. Reports at this level should never leave the unit in which they are
created. A tier one report that demonstrates business value can be certified and promoted to
tier two. Such a certification process would require a degree of documentation and
consistency and possibly additional development, signed off by designated staff. Tier-two
reports that take on more mission-critical or expansive roles may be promoted to a third tier,
etc. By the time a report lands on the desk of an executive, the executive can be confident of
its terminology, consistency and accuracy.
Takeaways

 It is important that you identify and govern your use of personally identifiable
information (PII) and quasi-identifiers.
 Establish and enforce governance and auditing of internal data usage.
 Laws related to privacy and data governance differ greatly by jurisdiction and may
impact your organization even if it does not have a physical presence within that
jurisdiction.
 Europe’s GDPR will have a strong impact on any company with customers in the EU.
 Linkage attacks and advanced analytic techniques can reveal private information
despite your efforts to protect it.
 Creating a tiered system for your internal reports and dashboards can provide
consistency and reliability.
Ask yourself
47
 What measures are you taking to protect personally identifiable information (PII)
within your systems, including protection against linkage attacks? Make sure you are
compliant with regional laws in this area and are not putting your reputation at risk
from privacy infringement, even if legal.
 If you have customers in Europe, what additional steps will you need to take to
become compliant with GDPR? Remember that GDPR fines reach 4 per cent of global
revenue.
 If your organization does not have a privacy officer, whom can you consult for
questions related to privacy and data protection laws? There are global firms that can
provide advice spanning multiple jurisdictions.
 When was the last time you reviewed an important internal report and realized the
terminology used was unclear or the data was inaccurate? What steps did you take to
address the problem? Perhaps you want to initiate an internal reporting governance
programme, such as the one outlined in this chapter.

Case study – Netflix gets burned despite best intentions


Another American company got itself into trouble by not realizing how data science
techniques could de-anonymize legally protected data. In 2006, video streaming
company Netflix was 9 years old and had grown to roughly 6 million subscribers. It
had developed a recommendation engine to increase engagement and was looking for
ways to improve those recommendations. In a stroke of apparent genius, Netflix came
up with the Netflix Prize: $1 million to the team that could develop a recommendation
algorithm capable of beating Netflix’s own by a margin of at least 10 per cent. To
support the effort, Netflix released anonymized rental histories and corresponding
ratings for 480,000 viewers. Remember that the Video Privacy Protection Act of 1988
forbade them to release rental histories linked to individuals, but these were
anonymized.
Things moved quickly following Netflix’s release of data on 2 October 2006. Within
six days, a team had already beaten the performance of Netflix’s own
recommendation algorithm by a small margin. Within a few weeks, however, a team
of researchers from the University of Texas had also hit a breakthrough. They had
de-anonymized some of the anonymized rental histories. The researchers had carried
out what is called a linkage attack, linking nameless customer viewing histories to
named individuals from online forums using reviews common to both Netflix and the
forums.
The saga played out for another three years, at which point a team reached the 10 per
cent improvement mark and won the Netflix Prize. Shortly thereafter, a class action
lawsuit was filed against Netflix, accusing it of violating privacy laws. Netflix settled
out of court and understandably cancelled their scheduled follow-up competition.
48

Chapter 5: IoT technologies and Standards

LEARNING OUTCOMES
After reading this Section of the guide, the learner should be able to:

Learning Objectives

 Demonstrate an understanding of various technologies and


standards relating to IoTs

A huge ecosystem of connected devices, named the Internet of Things, has been expanding
over the globe for the last two decades. Now, the overwhelming number of objects around
us are enabled to collect, process and send data to other objects, applications or servers. They
span numerous industries and use cases, including manufacturing, medicine, automotive,
security systems, transportation and more.

The IoT system can function and transfer information in the online mode only when devices
are safely connected to a communication network.

What makes such a connection possible? The invisible language allowing physical objects to
“talk” to each other consists of IoT standards and protocols. General protocols used for
personal computers, smartphones or tablets may not suit specific requirements (bandwidth,
range, power consumption) of IoT-based solutions. That is why multiple IoT network
protocols have been developed and new ones are still evolving.

There is a multitude of great choices for connectivity options at the engineers’ disposal. This
article explains complicated abbreviations and helps you make sense of the Internet of
Things standards.

5.1 IoT Protocols Background

The first devices connected to the global net appeared in 1982. It was a Coca-Cola vending
machine that could control the temperature of the machine and keep track of the number of
bottles in it. The term “Internet of Things” is considered to be formulated in 1999 by Kevin
Ashton, an RFID technology researcher.
In the 1990s, all IoT-related activities came down to theoretical concepts, discussions and
49
individual ideas. The 2000s and 2010s was a period of rapid development when IoT projects
began to succeed and found certain practical applications. Multiple small and large projects
were created, from intelligent lamps and fitness trackers to self-driving cars and smart cities.
This was made possible because of the emergence of wireless connections that could transfer
information over a long distance and the increased bandwidth of Internet communications.

The IoT grew to a completely “different Internet,” so that not all existing protocols were able
to satisfy its needs and provide seamless connectivity. That’s why it became a vital necessity
to create specialized IoT communication protocols and standards. However, some existing
technologies (e.g. HTTP) are also used by the Internet of Things.

5.2 The Best Tools for Internet of Things (IoT) Development

The Internet of Things is penetrating every aspect of our daily life. IoT phenomenon is already
around us: it is made up of the ordinary objects we use at home, at work or in the streets. The
difference is that all these objects and devices are computerized. They
have embedded network connectivity,
can communicate with phones and
other gadgets, get information and
remain under control.

IoT devices can be absolutely anything


and anywhere: door locks, plugs, lights,
appliances, vehicles, wearables and
other incredible options. Besides, IoT
technologies extend not only to cyber-
physical systems for domestic use but to
industrial facilities as well. Two more concepts have been developed: Building Internet of
Things (BIoT) and Industrial Internet of Things (IIoT). Such a staggering pace of IoT growth is
possible due to the proliferation of wireless networks, cloud computing, M2M connectivity
and software-defined networking development.

As the IoT trend morphs into an industry, the need for reliable, comprehensive developer
toolkits is increasing. IoT developer toolkits provide teams with the tools they need to access
specific networks, test hardware responses to application changes and manage updates. The
driving force behind the Internet of Things projects is more accessible hardware and more
flexible programming languages.
Tools for the Internet of Things (IoT) Development
50

IoT development generally requires the management of both an actuator and an endpoint.
The actuator monitors the connected device, searching for a specific value that energizes the
endpoint into action. This may be a connected home environment system that allows the user
to monitor the temperature of the home and adjust the thermostat settings remotely, or it
could be a security system that tracks movements within a building and alerts specified users
of changes.

Developing applications for connected devices generally requires that the solution provide
the following:

 Endpoint authentication
 Session creation
 Session destruction and logout
 User accounts and management
 Individual user billing details as needed
 User recent API activity
 Individual device data plan details
 Individual device details
 Device claiming and activation
 Device ordering
 Incoming and outgoing SMS management

IoT solutions distinguish mainstream technology companies from those on the leading edge.
Even companies that operate primarily outside of the tech sector will see benefits in terms of
marketplace recognition and brand identity when successful smart initiatives are launched.
Connecting with the right talent is necessary to creating valuable solutions that fill this market
need.
51

Source: IDC

5.3 IoT Development Platforms

IoT development tools help create smart objects. The first thing you need in order
to build and launch connected products is a platform. There are plenty available today. Each
platform can be an ideal fit for some applications, but not for others, due to the different
characteristics that come with it. Below we’ve listed some popular IoT development
platforms.

IBM Watson
https://www.ibm.com/watson

This platform enables connecting all types of devices and easily developing custom IoT
solutions. The advantages of IBM Watson are obvious: quick and secure connectivity; an
ability to control your data flows; online data analysing; the visualization of critical risks and
the automation of operational responses to them.

Azure
https://azure.microsoft.com/en-in/
The Azure IoT development platform by Microsoft has some important capabilities. It enables 52
you to collect, analyse and visualize data, integrate it with back-office systems, store and
process large data sets and manage devices. Azure is highly scalable, and it supports a great
number of devices and operating systems.

AWS (Amazon Web Services)


https://aws.amazon.com

AWS is a managed cloud-based platform that supports billions of devices around the world.
It provides secure and easy interaction even if the devices are offline. Amazon’s data centers
are equipped with multiple security levels and ensure seamless access and the safety of your
data. The main advantage of this platform is that no hardware infrastructure is needed. AWS
offers low prices without long-term commitments.

5.4 IoT Operating Systems

When choosing a platform, it’s necessary to decide on the operating system. There are certain
limitations to be considered: low processing power, a smaller amount of RAM and storage.
The most commonly used operating systems for such built-in computers are Linux, Android
and Ubuntu Core. But there is a great number of other IoT OSs available.

Here’s a non-exhaustive list:


RIOT is a free operating system developed by a community consisting of companies, academia
and hobbyists. RIOT supports most low-power devices and microcontroller architecture. It
implements all relevant standards to ensure that the Internet of Things is connected, secure,
durable and provides privacy protection.

Contiki is an open-source operating system for the Internet of Things created by a worldwide
team of developers. It provides powerful low-power Internet communication, supports fully
standard IPv6 and IPv4, along with the recent low-power wireless standards: 6lowpan, RPL
and CoAP. Contiki runs on a range of low-power wireless devices; its applications are written
in standard C; development with Contiki is easy and fast.

ARM mbed OS is an open-source embedded operating system. It includes all the necessary
features to develop a connected product. Mbed OS provides multilayer security and a wide
range of communication options with drivers for Bluetooth Low Energy, Thread, 6LoWPAN,
Ethernet and WiFi. What’s more, necessary libraries are included automatically on your 53
devices to facilitate code writing.

The ThingBox is a set of software already installed and configured on an SDCard. The
ThingBox allows anyone to graphically create new unlimited applications interacting with
connected objects from a simple web-browser. This OS is suitable and easy-to-use for both
technical people and users with no technical background.

Huawei LiteOS is a lightweight, low energy, efficient operating system. It starts up within
milliseconds and responds within microseconds. LiteOS coordinates multiple sensors and
supports long- and short-distance communication.

Raspbian is one of the most widely used platforms for the Internet of Things. It is a free
system optimized for the Raspberry Pi hardware. Raspbian includes basic programs and
utilities to make the hardware run, but it also compiles more than 35,000 packages and pre-
compiled software for easy installation.

Android Things is an operating system from Google. It lets you build professional, mass-
market products on a trusted platform, without previous knowledge of embedded system
design. Android Things provides you with leverage for the existing Android development
tools, APIs, resources and regular security updates. Android Things ensures the development
of IoT products at scale.

5.5 IoT Programming Languages

Nowadays, IoT software uses more general programming languages than it used to. The
choice of language for your smart service depends on its compatibility with the system, the
code size and memory, general requirements and whether your developer is familiar with this
or that language. Some languages are suitable for general-purpose projects (e.g. Java), others
are more specific (e.g. Parasail). Here is a list of the main languages in use:

1. C and C++ are quite universal and familiar to many programmers. Both languages are
created to be written for the hardware they are running on, which helps provide ideal code
for a special built-in system.
2. Java is rather mobile and is able to run on various hardware. This is a real advantage for
IoT.
3. JavaScript is the most widespread language on the Internet. As the greater part of the 54
Internet already speaks JavaScrip, it’s a great option for IoT, too. When all of the connected
devices understand the servers, it’s much easier to make them function. It’s also possible
to reuse the same JavaScript functions for different devices.
4. Python is an interpreted language, so it is flexible and easy to use in the IoT world. Python
is especially good for data-heavy applications.
5. Go, Rust, Forth, Parasail, B# — these languages were not modified, but specifically
designed for embedded programming, so they fit the Internet of Things like a glove.

5.6 Open-Source Tools for the Internet of Things

IoT developers have numerous open-source tools for the Internet of Things at their disposal.
Utilizing the tools we’ve listed below, you’ll be able to develop successful solutions with ease.

Arduino Starter Kit. This is a cloud-based system that offers both software and hardware. It
can be used even by beginner programmers.

Home Assistant. This tool is aimed at the smart home market and is great for interaction with
smart sensors in your home. The downside is that it doesn’t have a cloud component.

Zetta. This is a cloud-based platform built on Node.js. Zetta is perfect for turning devices into
API.

Device Hive. This tool functions as an M2M communications framework. Device Hive is quite
popular for the development of smart homes.

ThingSpeak. This is one of the oldest and most effective tools for IoT applications in the
market. ThingSpeak can process huge sums of data, it is used in web design applications and
location tracking tasks. This tool is able to work with other open-source tools.

NOD-RED. This is a browser-based tool for wiring the Internet of Things together. It helps deal
with the flow of the data, integrates with APIs, services and any devices.
5.7 Best IoT Development Kits
55

IoT development is interesting not only for large organizations but for small businesses and
individual developers as well. Here’s a top list of the best tools for the Internet of Things for
hobbyists and start-ups.

https://www.techworld.com/picture-gallery/apps-wearables/-best-iot-starter-kits-for-developers-
3637481/

1. ARM mBed
2. Relayr
3. Microsoft Azure IoT Starter Kits
4. BrickPi
5. VERVE2
6. Kinoma Create
7. Ninja Sphere
8. AWS IoT Starter Kits
9. Helium Development Kit

5.8 IoT Security

The Internet of Things makes ordinary physical objects smarter and broadens the horizons.
Together with these amazing possibilities, security problems arise, as all the connected
devices are subject to cyber-attacks and data leaks. That’s why security points have to be
integrated at every stage of IoT services development and deployment.

A special organization — The IoT Security Foundation — was launched in 2015 in England.
This is evidence that the world of IoT has become an integral part of modern society and its
safety is on the agenda.
56

Source: IDC

5.9 IoT Statistics and Forecast

The IoT ecosystem is currently experiencing a period of rapid growth. According to Ericsson,
in 2018, the number of smart sensors and devices will exceed the number of mobile phones
and will become the largest category of connected devices.

Analysts of the company predict that by 2022, there will be about 29 billion connected
devices, and around 16 billion of them will be associated with IoT.

According to the statistics portal Statista, the global smart home market will reach almost $60
billion in 2017.

Experts also anticipate the increase of investments in IoT security


technologies. Gartner predicts that by 2020, more than 25% of attacks in enterprises will
involve IoT. It is expected that spending on IoT security will reach $547 million in 2018.

5.10 Types of IoT Connections

An IoT system has a three-level architecture: devices, gateways and data systems. The data
moves between these levels via four types of transmission channels.

1. Device to device (D2D) — direct contact between two smart objects when they share
information instantaneously without intermediaries. For example, industrial robots and
sensors are connected to one another directly to coordinate their actions and perform the
assembly of components more efficiently. This type of connection is not very common yet,
57
because most devices are not able to handle such processes.

2. Device to gateway — telecommunications between sensors and gateway nodes.


Gateways are more powerful computing devices than sensors. They have two main functions:
to consolidate data from sensors and route it to the relevant data system; to analyse data
and, if some problems are found, return it back to the device. There are various IoT gateway
protocols that may better suit this or that solution depending on the gateway computing
capabilities, network capacity and reliability, the frequency of data generation and its quality.

3. Gateway to data systems — data transmission from a gateway to the appropriate data
system. To determine what protocol to use, you should analyse data traffic (frequency of
bustiness and congestion, security requirements and how many parallel connections are
needed).

4. Between data systems — information transfer within data centres or clouds. Protocols
for this type of connection should be easy to deploy and integrate with existing apps, have
high availability, capacity and reliable disaster recovery.

Types of IoT Networks

Networks are divided into categories based on the distance range they provide.

A nanonetwork — a set of small devices (sized a few micrometres at most) that perform very
simple tasks such as sensing, computing, storing, and actuation. Such systems are applied in
the biometrical, military and other nanotechnologies.

NFC (Near-Field Communication) — a low-speed network to connect electronic devices at a


distance within 4 cm from each other. Possible applications are contactless payment systems,
identity documents and key cards.

BAN (Body Area Network) — a network to connect wearable computing devices that can be
worn either fixed on the body, or near the body in different positions, or embedded inside
the body (implants).

PAN (Personal Area Network) — a net to link up devices within a radius of roughly one or a
couple of rooms.

LAN (Local Area Network) — a network covering the area of one building.

CAN (Campus/Corporate Area Network) — a network that unites smaller local area networks
within a limited geographical area (enterprise, university).

MAN (Metropolitan Area Network) — a big network for a certain metropolitan area powered
by the microwave transmission technology.

WAN (Wide Area Network) — a network that exists over a large-scale geographical area and
unites different smaller networks, including LANs and MANs.
Mesh Networks
58
Wireless nets can also be categorized according to their topology, i.e. a connectivity
configuration. There may be various combinations of connections between nodes: line, ring,
star, mesh, fully connected, tree, bus.

Mesh networks have the most benefits if compared to other types of networks since they
don’t have a hierarchy, and the hub and each node is connected to as many other nodes as
possible. Information can be routed more directly and efficiently, which prevents
communication problems. This makes mesh networks an excellent solution for the connected
objects.

Requirements for IoT Networks

• The capacity to connect a large number of heterogeneous elements


• High reliability
• Real-time data transmission with minimum delays
• The ability to protect all data flows
• The ability to configure applications
• Monitoring and traffic management at the device level
• Cost-effectiveness for a large number of connected objects

5.11 Top Seven IoT Platforms

Businesses are more actively adopting the Internet of Things. The data driven by connected
devices can create efficiencies and bring your company to the next level.

Any IoT device should connect to other devices, sensors, apps and data networks to transfer
information. An IoT platform serves as a mediator to unite all of them in one system.
Combining many of the tools, such a platform stores, analyses and manages the plethora of
data generated by the connected assets.

The most popular IoT platforms are still the solutions by the leading vendors such as Amazon,
Microsoft and IBM. But there are lots of other good options on the market. Here, we provide
a review of the best Internet of Things platforms 2019.

Google Cloud IoT


Google launched the IoT platform on the basis of its end-to-end 59
Google Cloud Platform. Currently, it’s one of the top Internet of
Things platforms. Google Cloud IoT is a set of integrated services:

 Cloud IoT Core — to capture and handle the device data


 Cloud Pub/Sub — to ingest event streams from anywhere for real-time stream analytics
 Google BigQuery — to do ad hoc analysis
 Cloud Machine Learning Engine — to apply ML
 Google Data Studio — to visualize data making reports and dashboards
 Cloud Functions — a server less environment to trigger automatic changes, build and
connect cloud services
The main components of Cloud IoT Core are:

A device manager — to register devices with the service, monitor and configure them
Protocol bridges (MQTT and HTTP) — to connect devices to Google Cloud Platform
Google Cloud automatically integrates with the Internet of Things hardware producers such
as Intel and Microchip.

SAP Leonardo

SAP is the leading German software company. In 2017, it launched Leonardo as a purely IoT
platform. But later, it was relaunched as a “digital innovation system” in order to integrate
more emerging technologies in one place, such as Artificial Intelligence, Machine Learning,
Big Data, advanced analytics and block chain. Since one technology is not enough to deliver
good outcomes for customers, this integral
approach is really worthwhile. When technologies
are viewed and implemented jointly, it’s easier to
support businesses in any digital aspect and
accelerate time to value.
SAP Leonardo is predicted to be the leading platform for the Internet of Things 2018. The
platform offers accelerator packages. An accelerator is a fixed-price package tailored to
specific industries and functions. It comprises methodologies, the necessary licenses,
development and design services. Accelerators help customers create apps from the initial
prototype to the final solution.
SAP Leonardo offers two services:
Ready-made applications (e.g. SAP Service Ticketing)
Micro services and APIs (e.g. the SAP Streaming Analytics micro service) that can be 60
integrated into the customer’s own applications

Cisco IoT Cloud Connect

Cisco IoT Cloud Connect is originally an offering for mobile operators. This
mobility-cloud-based software suite is on the list of the best Internet of
Things cloud platforms. It allows customers to fully optimize and utilize
networks, provides real-time visibility and updates every level of the
network.

Its core feature s include:

 Voice and data connectivity


 SIM lifecycle management
 IP session control
 Customizable billing and reporting
Cisco has experience in creating the agricultural Internet of Things solutions for the National
Farmers Federation in Australia.

Its core competencies are:

 Gateways and edge devices manufacturing


 Management and monitoring through the Cisco Jasper Control Centre
 Launching IoT apps through the Development Centre
 Providing security through IoT Threat Defence

Bosch IoT Suite

The German IT company Bosch has become a full-service provider of connectivity and the
Internet of Things with its own open source IoT
platform. Now, it can compete with the big players
such as Amazon and IBM.

The Bosch IoT Suite is a flexible open source Platform as a Service (PaaS). Bosch focuses on
efficiency and safety and provides cloud services for typical IoT projects. Prototype
applications can be quickly set up and deployed within minutes. Software developers can 61
operate them at high availability.

Salesforce IoT

Salesforce specializes in customer relations management.


Salesforce IoT Cloud is a service running on AWS infrastructure.
It is powered by Thunder — a real-time event processing
engine. This is a good solution for customers who want to
monitor their connected devices online and make instant
decisions.

The advantages of Salesforce offering are high speed, simple point-and-click UI and the easy-
to-use and more meaningful user experience. Even non-technical users can easily derive
benefits from digital projects.

Another important feature is MyIoT — a declarative interface for building apps on top of
connected assets data.

Hitachi Lumada

Japanese IT vendor Hitachi with its Lumada is also on the list of the
best IoT platforms 2018. Lumada is a comprehensive service as it
includes the Internet of Things, Artificial Intelligence and Machine
Learning technologies. Therefore, it delivers the most advanced
opportunities to turn data into intelligent action, solving customer
problems before they occur.

Lumada focuses on industrial IoT deployments, which is why it can be run both on-premises
and in the cloud.

General Electric Predix


General Electric launched its PaaS Predix, which aims at the 62
industrial market, in 2016. Predix offers connectivity and
analytics and creates digital industrial applications for
sectors such as aviation, healthcare, energy and
transportation. The apps allow customers to use real-time
operational data from connected assets and make faster decisions.

Predix includes:

 A catalog of app templates that are used off-the-shelf with your data from connected
devices
 A low-code Studio that helps non-technical users build industrial IoT apps
Customers can manage connected devices using Predix as a dashboard and create virtual
models (digital twins) of assets to predict and optimize their performance.

Predix runs on the major public cloud infrastructure providers. For instance, GE created
Predix-ready devices in partnership with Verizon, Cisco and Intel. Recently, the company
partnered with Apple to bring Predix apps to iOS devices.

Choosing an IoT Platform

There’s no definite answer to this question since there’s no one best platform suitable for any
digital project. The choice will always depend on the specific requirements of your business.

Large enterprises are more likely to turn to giants such as Amazon or Microsoft. Their
offerings are the most established, but also the most expensive. Smaller companies may find
more cost-efficient options that will nevertheless perfectly meet their requirements.

5.12 Most Popular Internet of Things Protocols, Standards and Communication


Technologies

Now, let’s get to the specifics of IoT wireless protocols, standards and
technologies. There are numerous options and alternatives, but we’ll discuss
the most popular ones.

MQTT
MQTT (Message Queue Telemetry Transport) is a lightweight protocol for 63
sending simple data flows from sensors to applications and
middleware.

The protocol functions on top of TCP/IP and includes three


components: subscriber, publisher and broker. The publisher
collects data and sends it to subscribers. The broker tests
publishers and subscribers, checking their authorization and ensuring security.

MQTT suits small, cheap, low-memory and low-power devices.

DDS

DDS (Data Distribution Service) is an IoT standard for real-time, scalable and high-
performance machine-to-machine communication. It was developed by the Object
Management Group (OMG).

You can deploy DDS both in low-footprint devices and in the cloud.

The DDS standard has two main layers:Data-Centric Publish-Subscribe


(DCPS), which delivers the information to subscribers

Data-Local Reconstruction Layer (DLRL), which provides an interface to DCPS functionalities

AMQP

AMQP (Advanced Message Queuing Protocol) is an


application layer protocol for message-oriented middleware
environments. It is approved as an international standard.

The processing chain of the protocol includes three components that follow certain rules.

1. Exchange — gets messages and puts them in the queues


2. Message queue — stores messages until they can be safely processed by the client app
3. Binding — states the relationship between the first and the second components

Bluetooth

Bluetooth is a short-range communications technology integrated into most smartphones


and mobile devices, which is a major advantage for personal products, particularly wearables.

Bluetooth is well-known to mobile users. But not long ago, the new
significant protocol for IoT apps appeared — Bluetooth Low-Energy (BLE),
or Bluetooth Smart. This technology is a real foundation for the IoT, as it
is scalable and flexible to all market innovations. Moreover, it is designed
to reduce power consumption.
 Standard: Bluetooth 4.2
64
 Frequency: 2.4GHz
 Range: 50-150m (Smart/BLE)
 Data Rates: 1Mbps (Smart/BLE)

Zigbee

ZigBee 3.0 is a low-power, low data-rate wireless network used mostly in industrial settings.

The Zigbee Alliance even created the universal language for the Internet of Things
— Dotdot — which makes it possible for smart objects to work securely on any network and
seamlessly understand each other.

 Standard: ZigBee 3.0 based on IEEE802.15.4


 Frequency: 2.4GHz
 Range: 10-100m
 Data Rates: 250kbps

WiFi

Wi-Fi is the technology for radio wireless networking of devices. It


offers fast data transfer and is able to process large amounts of
data.

This is the most popular type of connectivity in LAN environments.

Standard: Based on IEEE 802.11


Frequencies: 2.4GHz and 5GHz bands
Range: Approximately 50m
Data Rates: 150-200Mbps, 600 Mbps maximum

Cellular

Cellular technology is the basis of mobile phone networks. But it is also suitable for the IoT
apps that need functioning over longer distances. They can take
advantage of cellular communication capabilities such as GSM,
3G, 4G (and 5G soon).

The technology is able to transfer high quantities of data, but the


power consumption and the expenses are high too. Thus, it can
be a perfect solution for projects that send small amounts of information.

Standard: GSM/GPRS/EDGE (2G), UMTS/HSPA (3G), LTE (4G)


Frequencies: 900/1800/1900/2100MHz
65
Range: 35km (GSM); 200km (HSPA)
Data Rates: 35-170kps (GPRS), 120-384kbps (EDGE), 384Kbps-2Mbps (UMTS), 600kbps-
10Mbps (HSPA), 3-10Mbps (LTE)

LoRaWAN

LoRaWAN (Long Range Wide Area Network) is a protocol


for wide area networks. It is designed to support huge
networks (e.g. smart cities) with millions of low-power
devices.

LoRaWAN can provide low-cost mobile and secure bidirectional communication in various
industries.

Standard: LoRaWAN
Frequency: Various
Range: 2-5km (urban area), 15km (suburban area)
Data Rates: 0.3-50 kbps
5.13 Standards Bodies
66

IEC - International Electro technical Commission (http://www.iec.ch)


The IEC has produced numerous IoT white papers including a paper on Wireless Sensor
Networks (WSNs), and it has taken part in key standards development such as IEC 62056
(DLMS/COSEM) for smart meters and OPC-UA for data exchange among applications.
See also the work of the ISO/IEC joint technical committees below.

IEEE - Institute of Electrical and Electronics Engineers (http://www.ieee.org)


The IEEE Standards Association (IEEE-SA) has produced an IoT Ecosystem Study and is
developing the IEEE P2413 standard for an Internet of Things architectural framework. At
the time this book was first published, P2413 was in early draft.
Other IEEE standards are also frequently cited in building out IoT infrastructure.
For example, IEEE 802.15.4 defines a standard for low-data-rate, low-power, short-range
radio frequency transmissions for wireless personal area networks (WPANs).

IETF - Internet Engineering Task Force (http://www.ietf.org)


IETF has focused on network routing protocols (for example, IPv6 packets) and is
working on standards for Constrained RESTful Environments in the IETF CoRE working
group. These efforts are addressing how to deploy self-organizing sensor networks
interconnected with IPv6 and building applications using embedded web service
technology.

ISA - International Society of Automation (http://www.isa.org)

ANSI / ISA-100.11a-2011 for “Wireless Systems for Industrial Automation: Process Control
and Related Applications” was approved in September 2014 and published as IEC 62734.
It provides a definition of reliable and secure wireless operations including monitoring,
alerting, supervisory control, open-loop control, and closed-loop applications. After
initial approval by ANSI in 2011, compliant device production began in earnest and over
130,000 connected devices had appeared by the end of 2012. ISA / IEC 62443 (formerly
ISA-99) provides a standard for automation system security.

ISO - International Organization for Standardization (http://www.iso.org)

ISO standards relevant to IoT include ISO18185 for RFID and numerous other supply
chain and sensor standards (ranging from device interfaces designed to monitor
conditions to sensor networking and network security frameworks). At the time of
publication, ISO/AWI 18575 was planned to address products and product packages for
IoT in the supply chain. ISO is often seen as providing a valuable resource for reference
architectures, specifications, and testing procedures.
ISO/IEC JTC/SWG 5
67
This joint technical committee (JTC) of ISO and IEC produced a subcommittee / working
group (SWG) that identifies market requirements and standardization gaps. It documents
standardization activity for IoT from groups internal and external to ISO and IEC. Areas
of collaboration this SWG focuses on include accessibility, user interfaces, software
and systems engineering, IT education, IT sustainability, sensor networking, automatic
identification and data capture, geospatial information, shipping, packaging, and thermal
performance and energy usage.

W3C - Worldwide Web Consortium (http://www.w3.org)

In early 2015, W3C launched a Web of Things initiative to develop web standards based
on IoT and what it calls “a web of data.” Many previous W3C standards efforts are
fundamental to IoT development including XML, SOAP, WSDL, and REST.

Open Source Projects


Open source projects are based on the notion of a shared code base with multiple
committers or contributors. Within IoT, a number of such projects have emerged. Though
not considered as standards in the classic sense, these efforts can become defacto
standards if widespread adoption occurs.

Allseen Alliance (http://www.allseenalliance.org)


This alliance of over 140 members (as of early 2015) created “AllJoyn,” an open source
framework used in developing IoT projects. The alliance is largely made up of non-IT
companies interested in building IoT solutions. The framework that was created defines
data and power transports, language bindings, platforms, and security methods, as well
as providing a growing array of common services and interfaces.

Contiki (http://www.contiki-os.org)
Contiki provides an open source development environment (written in C) used to
connect low-cost and low-power micro-controllers to the Internet (IPv6 and IPv4). The
environment includes simulators and regression tests.

Eclipse (http://iot.eclipse.org)
Eclipse provides frameworks for developing IoT gateways including Kura (Java and
OSGi services) and Mihini (written in Lua scripts). Industry services are bundled in a
SmartHome project consisting of OSGi bundles and an Eclipse SCADA offering. Tools
and libraries are provided for Message Queuing Telemetry Transport (MQTT), the
Constrained Application Protocol (CoAP), and OMA-DM and OMA LWM2M device
management protocols.

openHAB (http://www.openhab.org)
An open source project called openHAB produced software capable of integrating home
automation systems and technologies through a common interface. It can be deployed to
any intelligent device in the home that can run a Java Virtual Machine (JVM). It includes
a rules engine enabled through user control and provides interfaces via popular mobile
68
devices (Android, iOS) or via the web.

ThingsSpeak (http://www.thingspeak.org)
ThingSpeak provides APIs for “channels” enabling applications to store and retrieve data
and for “charts” providing visualization. ThingHTTP enables a device to connect to a web
service using HTTP over a network or the Internet. Links into Twitter are also provided for

You might also like