HDF-3 0

HDF: NiFi DataFlow Management
HDF Powered by Apache NiFi

Agenda
Day 1 Day 2 Day 3

Introduction to Enterprise Data Flow Anatomy of a Remote HDF and HDP – A Complete
Processor Groups Big Data solution
Whats new in HDF-3.0 Attributes In NiFi HDF Best Practices
HDF-3.0 - NiFi Architecture & Features NiFi Expression Language Securing HDF with 2-way SSL,
LDAP and Kerberos
HDF System Requirements Working with Templates HDF Multi-tenancy
Installing and Configuring HDF NiFi Dataflow Optimization File Based Authorizer
NiFi User Interface Data Provenance in NiFi Ranger Based Authorizer
Building Your First DataFlow using NiFi NiFi Cluster and State
Management
Anatomy Of a Processor Group NiFi Monitoring
2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Introductions
• Your name
• Job responsibilities
• Previous NiFi exposure (if any)
• Your expectations for the course

Class Logistics
 Schedule
 Facilities, breaks, restrooms
 Lunch
 Computers and Wireless Access

Peek into Enterprise Data Flow
Where do we find Data Flow?
• Remote sensor delivery (Internet of Things - IoT)
• Intra-site / Inter-site / global distribution (Enterprise)
• Ingest for feeding analytics (Big Data)
• Data Processing (Simple Event Processing)

Simplistic View of Enterprise Data Flow
Process and Analyze

Acquire Data
Data
The Data Flow Thing
Store Data

Basics of Connecting Systems
For every connection,
these must agree:
1. Protocol
2. Format
P1 C1
3. Schema
Producer Consumer 4. Priority

5. Size of event
6. Frequency of event
7. Authorization access
8. Relevance

Realistic View of Enterprise Data Flow
Different organizations/business units across different geographic locations…

Conducting business in different legal and network domains…

Operating on very different infrastructure (power, space, cooling)…

Capable of different volume, velocity, bandwidth, and latency…

Interacting with different business partners and customers

Do you think organizations struggle with dataflow management?
? ?
?
?
? ?

IoT is Driving New Requirements
IoAT Data Grows Faster Than We Consume It
Internet of Anything
Sensors The Opportunity Much of the new data

and machines
Unlock transformational business value exists in-flight between
Geolocation from a full fidelity of data and analytics systems and devices as
for all data. part of the Internet of
Server logs NEW Anything
Clickstream
Web & social
Files & emails
Traditional Data Sources

TRADITIONAL
ERP, CRM, SCM

Internet of Anything is Driving New Requirements
Need trusted insights from data at the very edge to the data lake in real-
time with full-fidelity
– Data generated by sensors, machines, geo-location devices, logs, clickstreams, social feeds, etc.
Modern applications need access to both data-in-motion and data-at-rest
IoAT data flows are multi-directional and point-to-point

– Very different than existing ETL, data movement, and streaming technologies which are generally one direction
The perimeter is outside the data center and can be very jagged
– This “Jagged Edge” creates new opportunity for security, data protection, data governance and provenance

Meeting IoAT Edge Requirements
Small Footprints
operate with very little power
DELIVER
Limited Bandwidth
can create high latency
PRIORITIZE
Data Availability
exceeds transmission bandwidth GATHER
recoverability
Data Must Be Secured

throughout its journey
both the data plane and control plane
Track from the edge Through to the datacenter

The Need for Data Provenance
For Operators
• Traceability, lineage
• Recovery and replay
BEGIN
For Compliance
• Audit trail
• Remediation
For Business LINEAGE END
• Value sources
• Value IT investment

The Need for Fine-grained Security and Compliance
It’s not enough to say you have
encrypted communications
• Enterprise authorization
services –entitlements
change often
• People and systems with
different roles require difference
access levels
• Tagged/classified data

Real-time Data Flow
It’s not just how quickly you

move data – it’s about how
quickly you can change
behavior and seize new
opportunities

HDF powered by Apache NiFi
HDF Powered by Apache NiFi Addresses Modern Data Flow
Challenges
•
•
Logs
Files
Collect: Bring Together
• Feeds
• Sensors Aggregate all IoAT data from sensors, geo-location devices, machines, logs,
files, and feeds via a highly secure lightweight agent
• Deliver Conduct: Mediate the Data Flow

• Secure
• Govern
• Audit
Mediate point-to-point and bi-directional data flows, delivering data
reliably to real-time applications and storage platforms such as HDP
• Parse Curate: Gain Insights

• Filter
• Transform
• Fork
Parse, filter, join, transform, fork, and clone data in motion to
• Clone empower analytics and perishable insights

Hortonworks DataFlow Manages Data-in-Motion
Regional Core
Sources Infrastructure Infrastructure
 Constrained  Hybrid – cloud / on-premises

 High-latency  Low-latency
 Localized context  Global context
25 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Apache NiFi, Apache MiNiFi, Apache Kafka, Apache Storm are trademarks of the Apache Software Foundation
NiFi Developed by the National Security Agency
Developed by the NSA over
the last 8 years.
"NSA's innovators work on

some of the most
challenging national security
problems imaginable,"
"Commercial enterprises
could use it to quickly
control, manage, and
analyze the flow of
information from
geographically dispersed
sites – creating
comprehensive situational
awareness"
-- Linda L. Burger,
Director of the NSA

A Brief History
2006
NiagaraFiles (NiFi) was first incepted at the
National Security Agency (NSA)
November 2014
NiFi is donated to the Apache Software Foundation
(ASF) through NSA’s Technology Transfer Program
and enters ASF’s incubator.
July 2015
NiFi reaches ASF top-level project status

Designed In Response to Real World Demands
Visual User Interface

Drag and drop for efficient, agile operations
Immediate Feedback
Start, stop, tune, replay dataflows in real-time
Adaptive to Volume and Bandwidth
Any data, big or small
Provenance Metadata
Governance, compliance & data evaluation
Secure Data Acquisition & Transport
Fine grained encryption for controlled data sharing
HDF Powered by
Apache NiFi
Apache NiFi
• Powerful and reliable system to process and
distribute data.
• Directed graphs of data routing and transformation.
• Web-based User Interface for creating, monitoring,
& controlling data flows
• Highly configurable - modify data flow at runtime,
dynamically prioritize data
• Data Provenance tracks data through entire system
• Easily extensible through development of custom
components
[1] https://nifi.apache.org/

HDF Use Cases
Optimize Splunk: Move Data Internally:
Reduce costs by pre-filtering data so that only Optimize resource utilization by moving data
relevant content is forwarded into Splunk
between data centers or between on-
Ingest Logs for Cyber Security: premises infrastructure and cloud
Integrated and secure log collection for real-time infrastructure
data analytics and threat detection
Capture IoT Data:
Feed Data to Streaming Analytics: Transport disparate and often remote IoT
Accelerate big data ROI by streaming data into data in real time, despite any limitations in
analytics systems such as Apache Storm or Apache device footprint, power or connectivity—
Spark Streaming
avoiding data loss
Data Warehouse Offload:
Big Data Ingest
Convert source data to streaming data and use HDF Easily and efficiently ingest data into Hadoop
for data movement before delivering it for ETL
processing. Enable ETL processing to be offloaded
to Hadoop without having to change source
systems.
HDF 3.0 – Whats New!!!
Hortonworks DataFlow: Data-in-Motion Platform
Platform to COLLECT, CURATE, ANALYZE and ACT ON
data in motion across the data center and cloud

HDF Data-In-Motion Platform – with HDF 3.0 GA Release

Flow Management with Apache NiFi and Apache MiniFi
Enable easy ingestion, routing, management and delivery of any data anywhere (Edge,
cloud, data center) to any downstream system with built in end-to-end security and
provenance

New Features in HDF 3.0
Record Processing
Record Based Processing Mechanism

 Why?
– Improve operational efficiency
– Intuitive and flexible filtering/routing strategies powered by ‘QueryRecord’
– Simpler dataflow design
 What?
– Introduce ‘record’ based operation model
– ‘RecordReader’ and ‘RecordWriter’ controller services
– A series of processors supporting the reader/writer processing mechanism
• Plugin record reader to de-serialize bytes to record objects
• Plugin record writer to serialize record objects to bytes
• Enable operations against in-memory record objects
 How?

Record Processing
Record Readers and Writers
 Record Readers
– JsonTreeReader
– JsonPathReader
– AvroReader
– CSVReader
– GrokReader
– Scripted Reader
 Record Writers
– JsonWriter
– AvroWriter
– CSVWriter
– FreeFormTextWriter
– ScriptedRecordSetWriter

Component Version

Component Version
Component Versioning
 Why?
– Foundational work to enable extension registry
– Foundational work to enhance flow migration experience
 What?
– Support multiple versions of the same NAR in a single NIFI instance
– E.g. Hadoop NAR version A: Apache Hadoop client lib; Hadoop NAR version B: proprietary Hadoop
client lib
 How?

Component Version
Component Versioning

Change Data Capture

CDC
Entry-Level Change Data Capture (CDC)

 Why?
– Capture CDC records, and update target DB in real-time
– Based on transaction logs (NOT DB triggers)
 What?
– Entry level CDC solution
– Supported source DB: MySQL. will support others in the following releases
– Supported target DB
• INSERTS/UPDTAES/DELETES: support DBs that take standard SQL out of the box
• DDL: need to customize the template due to SQL syntax difference
 How?

CDC
Entry-Level Change Data Capture (CDC)
CaptureChange
Processor 2 Processor 3 EnforceOrder PutDBRecord
MySQL
Leverage plugins on Ensure sequence of

the source DB side to events; must be
interpret transaction single threaded
logs FirstInFirstOut
prioritization
Could be multi-
threaded, for optimized
performance Convert input to SQL,
and execute SQL
directly; rollback at
failure; must be single
threaded
Entry-Level CDC: flow has to be configured following the rules

Manageability

Manageability
Single Ambari/Ranger Managing HDF and HDP Services
Ambari-NiFi Integration HDF 3.0 Deployment Model

 NiFi cluster management  HDP users can share Ambari server
– Start/stop NiFi service
 Single Ambari for cluster management, single
– Centralized place for managing config files
Ranger for policy management
 Ambari to display NiFi metrics
 Benefit customers that have both HDF and
 Ambari to manage kerberos HDP
authentication
 Optimized manageability and reduced
 Make NiFi available as an add-on service operational overhead
to HDP stack
Manageability
 Why?
– Optimized manageability
– Reduced operational overhead
 What?
– Make NIFI available as an add-on service to HDP stack
– Single Ambari for cluster management, single Ranger for policy management
– Available to customers paying for both HDF and HDP support

Manageability
 Pre-requisite
– Ambari 2.5.1, HDP 2.6.1, HDF 3.0 management pack
Existing HDP Existing HDF Wants to deploy Wants to deploy Wants to deploy HDF-
customer customer HDF-NiFi HDF- StreamInsight TP Deployment scenario
Storm/Kafka/SAM (HDP dependency)
1 Ambari/Ranger instance,
NO NO YES YES YES install HDP 2.6.x, add HDF 3.x
services
NO YES YES YES NO upgrade to Ambari 2.5.1,
upgrade NIFI, add SAM, etc.
2 Ambari/Ranger instances.
One managing existing NIFI,
NO YES YES YES YES
install a new Ambari 2.5.1 to
manage StreamInsight

Manageability
 Pre-requisite
– Ambari 2.5.1, HDP 2.6.1, HDF 3.0 management pack
Existing HDP Existing HDF Wants to deploy Wants to deploy Wants to deploy HDF-
customer customer HDF-NiFi HDF- StreamInsight TP
Deployment scenario
Storm/Kafka/SAM (HDP dependency)
upgrade to HDP 2.6.1, add HDF
YES NO YES YES YES
3.x services
2 Ambari/Ranger instances.
One managing existing NIFI, one
YES YES YES YES YES
managing Storm/Kafka/SAM

New Processors - Wait and Notify

New Processors
Wait and Notify

Other NiFi Framework Improvements

Better Implementations for Content/Provenance Repositories
 New Implementations of Provenance Repository
– Addresses bottleneck previously encountered, More scalable
– Encrypted Implementation
• PersistentProvenanceRepository
• VolatileProvenanceRepository
• WriteAheadProvenanceRepository
• EncryptedWriteAheadProvenanceRepository
 New Implementations of Content Repository
– Addresses latency previously encountered
– Stores flowfile content in memory instead of on disk at the risk of data loss in the event of
power/machine failure.
• FileSystemRepository
• VolatileContentRepository
Zero-master Clustering
 New clustering paradigm
 Zero-master clustering
– Multiple entry points, no master node, no
single point of failure
– Auto-elected cluster coordinator for cluster
maintenance
– Automatic failover handling

Multi-tenant Flow Editing
 Multi-tenant flow editing

– Multiple teams making edits to different
components at the same time
– Only the component being modified is
locked, not the entire flow
– Collaborative model

Multi-tenant Authorization
 Component level authorization

– New authorizer API
– “Read” and “Write” permissions
– Protection against unauthorized usage
without losing context
 Authorization management
– Internal management (NIFI)
– External management (Ranger, etc.)

MiNiFi
MiNiFi Agent MiNiFi Management

Java agent Near term (HDF 3.0)
 Java implementation  Design & deploy
– Push updates
 Availability
– Config file driven
– GA HDF starting 2.0 (built from scratch, ~ 40MB)
Long term
Native agent
 Centralized command and control
 C++ implementation
 Availability
– TP HDF 3.0
– GA post HDF 3.0
 Resource efficient (focus on memory and disk)

MiNiFi vs NiFi
MiNiFi is a good fit, if…

 Simple flow needs to be deployed on multiple devices
– Design & deploy makes more than interactive command and control
– Could be IoT use cases, where you have a large number devices. connected vehicles, etc
– Could be in data center, think of a number of log servers. Key is, you want to deploy the
same flow, with simple functions, on multiple devices
 Agent footprint needs to be small

– Telematics devices, etc.
– Edge servers where multiple applications share limited resources, etc.

220+ Processors for Deeper Ecosystem Integration
FTP
Hash Encrypt GeoEnrich
SFTP
Merge Tail Scan
HL7
Extract Evaluate Replace
UDP
Duplicate Execute Translate
XML
Split Fetch Convert
HTTP
WebSocket
Email
Route Text Distribute Load
HTML
Route Content Generate Table Fetch
Image
Route Context Jolt Transform JSON
Syslog
Control Rate Prioritized Delivery
AMQP
All Apache project logos are trademarks of the ASF and the respective projects.
NiFi Positioning
Enterprise Processing
Service Bus Framework
(Fuse, Mule, etc.) (Storm, Spark, etc.)
Apache
NiFi / MiNiFi
Messaging
ETL
(Informatica, etc.) Bus
(Kafka, MQ, etc.)

Apache NiFi / Processing Frameworks
NiFi Processing Frameworks (Storm, Spark,
Simple event processing etc.)
• Primarily feed data into processing Complex and distributed processing
frameworks, can process data, with a
focus on simple event processing • Complex processing from multiple streams
(JOIN operations)
• Operate on a single piece of data, or in
correlation with an enrichment dataset • Analyzing data across time windows (rolling
(enrichment, parsing, splitting, and window aggregation, standard deviation, etc.)
transformations)
• Scale out to thousands of nodes if needed
• Can scale out, but scale up better to
take full advantage of hardware ⚠ Not designed to collect data or manage data flow
resources, run concurrent processing
tasks/threads (processing terabytes of
data per day on a single node)
⚠ Not another distributed processing
framework, but to feed data into those

Apache NiFi / Messaging Bus Services
NiFi Messaging Bus (Kafka, JMS, etc.)
Provide dataflow solution Provide messaging bus service
• Centralized management, from edge to • Low latency
core
• Great data durability
• Great traceability, event level data
provenance starting when data is born • Decentralized management (producers &
consumers)
• Interactive command and control – real
time operational visibility • Low broker maintenance for dynamic
consumer side updates
• Dataflow management, including
⚠ Not designed to solve dataflow problems
prioritization, back pressure, and edge
intelligence (prioritization, edge intelligence, etc.)
⚠ Traceability limited to in/out of topics, no lineage
• Visual representation of global dataflow
⚠ Lack of global view of components/connectivities
⚠ Not a messaging bus, flow maintenance
needed when you have frequent consumer
side updates
Apache NiFi / Integration, or ingestion, Frameworks
NiFi Integration framework (Spring
End user facing dataflow management Integration, Camel, etc), ingestion
tool framework (Flume, etc)
• Out of the box solution for dataflow Developer facing integration tool with a
management focus on data ingestion
• Interactive command and control in the core, • A set of tools to orchestrate workflow
design and deploy on the edge
• A fixed design and deploy pattern
• Flexible failure handling at each point of the
flow • Leverage messaging bus across
• Visual representation of global dataflow and disconnected networks
connectivities ⚠ Developer facing, custom coding needed to
• Native cross data center communication optimize
• Data provenance for traceability ⚠ Pre-built failure handling, lack of flexibility
⚠ Not a library to be embedded in other applications ⚠ No holistic view of global dataflow
⚠ No built-in data traceability
Apache NiFi / ETL Tools
NiFi ETL (Informatica, etc.)
NOT schema dependent Schema dependent
• Dataflow management for both structured • Tailored for Databases/WH
and unstructured data, powered by
separation of metadata and payload • ETL operations based on schema/data
• Schema is not required, but you can have modeling
schema • Highly efficient, optimized performance
• Minimum modeling effort, just enough to
manage dataflows ⚠ Must pre-prepare your data, time consuming to
build data modeling, and maintain schemas
• Do the plumbing job, maximize
developers’ brainpower for creative work ⚠ Not geared towards handling unstructured data,
PDF, Audio, Video, etc.
⚠ Not designed to do heavy lifting
transformation work for DB tables (JOIN ⚠ Not designed to solve dataflow problems
datasets, etc.). You can create custom
processors to do that, but long way to go to
catch up with existing ETL tools from user
experience perspective (GUI for data
wrangling, cleansing, etc.)
Unsupported Features

Unsupported Flow Management Features
 MiNiFi C++ - [Tech preview]

 NiFi Docker image (Apache only)
 HDF in Cloudbreak
 Community processors/components

Whats coming in HDF 3.1?

HDF 3.1 Flow Management Key Release Themes
Cross-Product
Ecosystem Core Enhancements
Integration
Focus on: Focus on: Focus on:
• Kafka 0.11.x processors • TDE enabled repo encryption • Make NIFI deployable via
• Merge/infer/validate record • Key management controller CloudBreak and HDCloud
processors service • Better Ambari experience
• Azure SAS token • Kerberos key tab permission • Automate adding NIFI
• MoveHDFS management nodes to existing cluster
• ExecuteSparkJob processors • windows MSI for NIFI and • Rolling-restart
MINIFI java • Better Ranger experience:
• Containerized deployment group based policy suport
(Docker) • NIFI-Atlas integration
• Framework level retry • NIFI-SmartSense integration
• Referenceable process group • NIFI-Knox integration
• MINIFI C++ GA
• MINIFI Java GA
• MINIFI Andriod/IOS libraries
HDF-3.0 - NiFi Architecture and
Features
Apache NiFi: The three key concepts
• Manage the flow of information

• Data Provenance
• Secure the control plane and
data plane

Apache NiFi – Key Features
• Guaranteed delivery • Recovery/recording

• Data buffering a rolling log of fine-grained
history
- Backpressure
• Visual command and
- Pressure release
control
• Prioritized queuing
• Flow templates
• Flow specific QoS
• Multi-tenant Authorization
- Latency vs. throughput
• Designed for extension
- Loss tolerance
• Clustering
• Data provenance

Flow Based Programming (FBP)
FBP Term NiFi Term Description
Information FlowFile Each object moving through the system.
Packet
Black Box FlowFile Performs the work, doing some combination of data routing, transformation,
Processor or mediation between systems.
Bounded Connection The linkage between processors, acting as queues and allowing various
Buffer processes to interact at differing rates.
Scheduler Flow Maintains the knowledge of how processes are connected, and manages the
Controller threads and allocations thereof which all processes use.
Subnet Process A set of processes and their connections, which can receive and send data via
Group ports. A process group allows creation of entirely new component simply by
composition of its components.

NiFi Architecture

NiFi Architecture

Primary Components
NiFi executes within a JVM living within a host operating system. The primary components of NiFi then living
within the JVM are as follows:
Web Server
• The purpose of the web server is to host NiFi’s HTTP-based command and control API.
Flow Controller
• The flow controller is the brains of the operation.
• It provides threads for extensions to run on and manages their schedule of when they’ll receive resources to
execute.
Extensions
• There are various types of extensions for NiFi which will be described in other documents.
• But the key point here is that extensions operate/execute within the JVM.

Primary Components(Cont..)
FlowFile Repository
• The FlowFile Repository is where NiFi keeps track of the state of what it knows about a given FlowFile that is
presently active in the flow.
• The default approach is a persistent Write-Ahead Log that lives on a specified disk partition.
Content Repository
• The Content Repository is where the actual content bytes of a given FlowFile live.
• The default approach stores blocks of data in the file system.
• More than one file system storage location can be specified so as to get different physical partitions engaged to
reduce contention on any single volume.
Provenance Repository
• The Provenance Repository is where all provenance event data is stored.
• The repository construct is pluggable with the default implementation being to use one or more physical disk
volumes.
• Within each location event data is indexed and searchable.

NiFi Cluster
Starting with the NiFi 1.x/HDF-2.x release, a Zero-Master Clustering paradigm is employed.
NiFi Cluster Coordinator:

• A Cluster Coordinator is the node in a NiFI cluster that is responsible managing the nodes in a cluster.
• Determines which nodes are allowed in the cluster.
• Providing the most up-to-date flow to newly joining nodes.
Nodes:
• Each cluster is made up of one or more nodes. The nodes do the actual data processing.
Primary Node:
• Every cluster has one Primary Node. On this node, it is possible to run "Isolated Processors" (see below).
ZooKeeper Server:
• It is used to automatically elect a Primary Node and cluster co-ordinator.
We will learn in detail about NiFi Cluster in following Lessons..

NiFi - User Interface
• Drag and drop processors to build a flow

• Start, stop, and configure components in real time
• View errors and corresponding error messages
• View statistics and health of data flow
• Create templates of common processor & connections

NiFi - Provenance
• Tracks data at each point as it flows

through the system
• Records, indexes, and makes events
available for display
• Handles fan-in/fan-out, i.e. merging
and splitting data
• View attributes and content at given
points in time

NiFi - Queue Prioritization
• Configure a prioritizer per connection
• Determine what is important for your
data – time based, arrival order,
importance of a data set
• Funnel many connections down to a
single connection to prioritize across
data sets
• Develop your own prioritizer if needed

NiFi - Extensibility
Built from the ground up with extensions in mind
Service-loader pattern for…
• Processors
• Controller Services
• Reporting Tasks
• Prioritizers
Extensions packaged as NiFi Archives (NARs)

• Deploy NiFi lib directory and restart
• Provides ClassLoader isolation
• Same model as standard components

NiFi - Security
Administration
Central management and consistent • Automatic NiFi Cluster Coordinator and Primary Node election with Zookeeper.
security • Multiple entry Points
Authentication
Authenticate users and systems • 2-Way SSL support out of the box; LDAP Integration; Kerberos Integration
Authorization • Multitenant Authorization

Provision access to data • File-based authority provider – Global and Component level Access policies
• Ranger Based Authority Provider
Audit • Detailed logging of all user actions
Maintain a record of data access • Detailed logging of key system behaviors
• Data Provenance enables unparalleled tracking from the edge through the Lake
Data Protection • Support a variety of SSL/encrypted protocols
Protect data at rest and in motion • Tag and utilize tags on data for fine grained access controls
• Encrypt/decrypt content using pre-shared key mechanisms
• Encrypted Passwords in Configuration Files
Initial Admin Manually designate initial Legacy Authorized Users converted previously configured Cluster Node Secure identities for
admin user granted access to users and roles to the multi- Identities each node.
the UI tenant model

System Requirements
System Requirements
• Apache NiFi can run on something as simple as a laptop, but it can also be clustered across many enterprise-
class servers.
• The amount of hardware and memory needed will depend on the size and nature of the dataflow involved.
• Requires Java 8 or newer

• Interoperability Requirements
You cannot install HDF on a system where HDP is already installed.
• Supported Operating Systems:
Red Hat Enterprise Linux / CentOS 6 (64-bit) / CentOS 7 (64-bit)
Ubuntu Precise (12.04) (64-bit) / Trusty (14.04) (64-bit)
Debian 7
SUSE Linux Enterprise Server (SLES) 11 SP3 (64-bit)
• Supported Web Browsers:
Microsoft Edge
Mozilla FireFox 24+
Google Chrome 36+
Safari 8+

Required Software Packages
 yum (CentOS or RHEL)  wget

 zypper (SLES)  unzip
 php_curl (SLES)  chkconfig
 apt-get (Ubuntu)  Tar and wget
 reposync  Java software, one of the following:
– Oracle JDK 1.8
 rpm (CentOS, RHEL, or SLES)
– OpenJDK 1.8
 scp
 curl

Supported Databases
 Apache Ambari and Ranger frameworks require a database.

 The table lists the frameworks and the database choices.
 Choosing a single database type simplifies administration.
 Database administrators should implement high availability and regularly back up databases.
Database Ambari Ranger

MySQL 5.6 ✔ ✔
Oracle 11g r2, 12c** ✔ ✔
PostgreSQL 8.x, 9.1.13+, 9.3 default ✔
MariaDB 10* ✔ ✔

Ambari and Metrics Collector Hardware Guidelines
Number of Memory Size Disk Space
 The Ambari host should have at least 1 Cluster Nodes Guideline (GB)
GB RAM, with 500 MB free. (MB)
 The Ambari Metrics Collector host 1 1024 10
memory and disk requirements are 10 1024 20
based on the cluster size.
50 2048 50
 The Ambari Server and Metrics
Collector can be co-located on the same 100 4096 100
host. 300 4096 100
500 8096 200
1000 12288 200
2000 16384 500

Configuration Best Practices
• Maximum File Handles: Increase the limits by editing /etc/security/limits.conf to something like:
• hard nofile 50000
• soft nofile 50000
• Maximum Forked Processes: increase the allowable number of threads by editing /etc/security/limits.conf
• hard nproc 10000
• soft nproc 10000
• Increase the number of TCP socket ports available

• sudo sysctl -w net.ipv4.ip_local_port_range="10000 65000”
• Tell Linux you never want NiFi to swap

• vm.swappiness = 0

HDF Cluster Types and Recommendations
Node
Cluster Type Description Nodes Specification Network
Evaluate HDF on local 1 VM At least 4 GB RAM
HDF Sandbox
machine.
Evaluation Cluster Evaluate HDF in a 3 VMs/Nodes 8 cores/vCores
clustered environment. 16 GB of RAM
Small Development Cluster for DEV 6 VMs/Nodes 8 cores/vCores
Cluster environments. 16 GB of RAM
Medium QE Cluster Use this cluster in QE 8 VMs/Nodes 8 - 16 cores/vCores
environments. 32 GB of RAM
Small Production Cluster for small PROD 15 Nodes 8 - 16 cores 1 GB Bonded Nic
Cluster environments. 64 - 128 GB of RAM
Medium Cluster for Medium 24+ Nodes 8 - 16 cores 10 GB Bonded Nic
Production Cluster PROD environments. 64 - 128 GB of RAM
Large Production Cluster for Large PROD 32+ Nodes 16 cores 10 GB Bonded Nic
Cluster environments. 64 - 128 GB of RAM
NiFi Clusters Scale Linearly
Throughput Number of CPU Number of RAM/node Ideal
Target NiFi nodes Cores/node disks/node, size of Networking
each disk (RAID Setup
10/SSD)
50 MB/s, 1-2 8+ 6+, 1TB 4+ GB 1 Gigabit

1000 events/s bonded NICs
100 MB/s, 3-4 16+ 6+, 2TB 8+ GB 1 Gigabit

10,000 events/s bonded NICs
200 MB/s, 5-7 24+ 12+, 2TB 8+ GB 10 Gigabit

100,000 bonded NICs
events/s
400 MB/s, 7-10 24+ 12+, 2TB 8+ GB 10 Gigabit
100,000+ bonded NICs
events/s

NiFi Disk
 RAID1 – Will provide best performance/throughput

 RAID10 – Small/Negligible performance hit; however, adds additional level of data loss
tolerance
 Must have separate disk for Repos (multiple disks can be configured)
 Use typical OS RAID (enterprise requirements). Typically RAID1

NiFi H/W Expectations
 NiFi is designed to take advantage of:

– all the cores on a machine
– all the network capacity
– all the disk speed
– many GB of RAM (though usually not all) on a system
 Average flow defined as:
Those in which data arrives via some fairly consistent rate with a few operations performed on each
event/data comprised of attribute extraction, routing, transformation, compression, and follow-on
delivery.
 Most important hardware factors :
– Top-end disk throughput as configured which is a combination of seek time and raw performance
– Network speed
 CPU only a concern when there is a lot of compression, encryption, or media analytics

Installing and Configuring HDF
 Deployment Scenarios
Objectives

Deployment Scenarios
Scenario 1: Installing HDF Services on a New HDP Cluster
This scenario applies to you if you are both an HDP and HDF customer and you want to
install a fresh cluster of HDP and add HDF services.
Overview of Steps to Implement the scenario:
1. Install Ambari
2. Install Databases
3. Install HDP Cluster using Ambari
4. Install HDF Management Pack
5. Update HDF Base URL
6. Add HDF Services to HDP cluster

Scenario 2: Installing HDF Services on an Existing HDP Cluster
You have an existing HDP cluster with Storm and or Kafka services and want to install NiFi
or SAM’s modules on that cluster.
1. Upgrade Ambari
2. Upgrade HDP
5. Update HDF Base URL
6. Add HDF Services to HDP cluster

Scenario 3: Installing a new HDF Cluster
You want to install the entire HDF platform consisting of all flow management and stream
processing components on a new cluster.
1. Install Ambari
4. Install HDF cluster using Ambari

 The Ambari Installation Process
 Installing a new HDF Cluster with Ambari
Objectives

Scenario 3 : Installing a new HDF Cluster - Options
Manual Ambari Web UI Ambari

installation from interactive Blueprint
packages installation installation
less automation More automation

Ambari Interactive Installation Overview
1. Pre-installation steps – prepare the base operating systems for HDF
2. Installation steps using the Ambari Web UI:
1. Download Ambari repo and Install Ambari software [Or Setup Local Repo]
2. Setup the Ambari Server.
4. Start the Ambari Server.
5. Log in to the Ambari Web UI.
6. Use the Ambari installation wizard to:
• Install and Register Ambari agents on the cluster nodes
• Define the HDF service components to install to each node
• Use Ambari agents to install service components on the nodes
• Validate the installation
• Additional Configurations if any[SSL/LDAP/Kerberos/Ranger]

 The Ambari Installation Process
 Installing a new HDF Cluster with Ambari
 Installing Ambari
Objectives

Installing and setting up Ambari
 With the proper Ambari repo file in place:
1. Download and install the Ambari Server software Silent installation-
– For example on CentOS: yum –y install ambari-server accept all default
settings. Without the –
2. Then setup the Ambari Server s the installer asks
– Setup initializes the Ambari database configuration
– For example on CentOS: ambari-server setup –s questions.
3. Install the HDF managemet pack for your OS

– For example on CentOS: ambari-server install-mpack --
mpack=http://.../hdf-mpack-name.tar.gz --purge –verbose
4. Start the Ambari Server
– For example on CentOS: ambari-server start
5. Open a browser and log in to the Ambari Web UI
– http://<Ambari_server_hostname>:8080
– Default user name and password are: admin and admin
Ambari Web UI Log In
The defaults are:

admin
admin

Welcome Screen

Name Your Cluster

Select the Stack Version

Choose Nodes for Ambari Agents
Typically use fully

qualified domain names
Password-less SSH
configured

Install and Register Ambari Agents

Choose Services to Install

Assign Slaves and Clients

Customize Services
First drill down and then

scroll down to view the
configuration issue.

Configuration Warnings
 Ambari will warn of any configuration issues before proceeding with the installation.
 Use caution when choosing to proceed without resolving the warning.

Install, Start, and Test

The Ambari Dashboard

NiFi User Interface

 The Ambari Driven Installation Process
 Manual Installation Process
Objectives

Manually Downloading & Installing NiFi
• NiFi can be downloaded from the NiFi Downloads Page:
http://public-repo-1.hortonworks.com/HDF/3.0.0.0/nifi-1.2.0.3.0.0.0-453-bin.tar.gz
OR
http://nifi.apache.org/download.html
• There are two packaging options available:

- A "tarball" that is tailored more to Linux.
- A zip file that is more applicable for Windows users.
- Mac OSX users may also use the tarball or can install via Homebrew.

Configuring NiFi
Configurations for NiFi may Vary for different Use cases. Use these sections as advice, but consult your
distribution-specific documentation for how best to achieve these recommendations.
Basic Configurations to be considered are:
• OS Configuration Best Practices: Typical Linux defaults are not necessarily well tuned for the needs of an
IO intensive application like NiFi.
• Security Configuration: NiFi provides several different configuration options for security purposes.
important properties are those under the "security properties" heading in the nifi.properties file.
• Controlling Levels of Access: Configuring who will have access to the system and what types of access
those people will have. NiFi controls this through the user of an Authority Provider.

Configuring NiFi (cont..)
• Clustering Configuration: Overview of NiFi Clustering and instructions on how to set up a basic cluster.
• Bootstrap Properties: Allows users to configure settings for how NiFi should be started.
• Notification Services: When the NiFi bootstrap starts or stops NiFi, or detects that it has died
unexpectedly, it is able to notify configured recipients.(As of now only email notification)
• NiFi System Properties: The nifi.properties file in the conf directory is the main configuration file for
controlling how NiFi runs.
We will be discussing some of these topics in details in the coming sessions.

Installing and Starting NiFi
Once NiFi has been downloaded and unzipped as described above, it can be started by using the mechanism
appropriate for your operating system.
1) For Windows Users:
• For Windows users, navigate to the folder where NiFi was installed.
• Within this folder is a subfolder named bin, Navigate to this subfolder and double-click the run-nifi.bat file.
• This will launch NiFi and leave it running in the foreground.
• To shut down NiFi, select the window that was launched and hold the Ctrl key while pressing C.

Installing and Starting NiFi (Cont..)
2) For Linux/Mac OSX users:
• Navigate to the directory where NiFi was installed. To run NiFi in the foreground, run:
bin/nifi.sh run.
• This will leave the application running. To shut down press Ctrl-C. At that time.
• To run NiFi in the background, instead run:

bin/nifi.sh start.
• To check the status and see if NiFi is currently running, execute the command:
bin/nifi.sh status.
• NiFi can be shutdown by executing the command:

bin/nifi.sh stop.

Installing NiFi as a Service
• Currently, installing NiFi as a service is supported only for Linux and Mac OSX users.
• To install the application as a service with name nifi , navigate to the installation directory and execute:
bin/nifi.sh install
• To specify a custom name for the service, execute the command with an optional second argument that is the
name of the service. For example, to install NiFi as a service with the name dataflow, use the command:
bin/nifi.sh install dataflow
• Once installed, the service can be started and stopped using the appropriate commands:
sudo service nifi start

sudo service nifi stop
sudo service nifi status

When NiFi Starts up..
When NiFi first starts up, the following files and directories are created:
• content_repository : The location of the Content Repository.

• database_repository : The location of the H2 database directory.
• flowfile_repository : The location of the FlowFile Repository
• provenance_repository : The location of the Provenance Repository
• work directory : The location of the NiFI working Directory (documentation, jetty & nar directories)
• logs directory : The location of NiFi Logs Directory
• flow.xml.gz file and the templates directory : file describing what is displayed on the NiFi graph

NiFi Installation Directory Structure
Bin directory: This contains the executable for starting,
stopping, obtaining NiFi status, and creating a NiFi dump
output. /
opt/
Conf directory: This contains all the files that can/should be HDF-3.0.0/
configured to control how your NiFi installation is bounded.
nifi/
Docs directory: This contains the various guides that are also
LICENCE
available via the NiFi UI after it is running. NOTICE
README
Lib directory: This contains all the jar and nar files that are bin/
included with the currently installed version of NiFi. conf/
docs/
lib/
- Note: Once NiFi is started, you will see the number of sub
directories increases inside the Installation directory.

NiFi Installation Directory Structure
Content repository: This repository contains the actual content for /
every FlowFile that is currently active in the NiFi instance. opt/
HDF-3.0.0/
Database repository: This directory contains two H2 databases. One nifi/
database keeps track of all changes made within the NiFi graph. The
other database tracks all users who have accessed the UI. LICENCE
NOTICE
Flowfile repository: This repository keeps track of all the FlowFiles README
bin/
currently active in NiFi.
conf/
docs/
Logs: This directory contains the various logs that NiFi outputs. lib/
work/
Provenance repository: This repository contains events reported at content_repository/
various stages of a FlowFiles life through a dataflow(s). flowfile_repository/
logs/
provenance_repository/
Work: This directory is where NiFi explodes the various jar and nar State
packages used by NiFi. database_repository

I Started NiFi. Now What?
• Now that NiFi has been started, we can bring up the User Interface (UI) in order to create and monitor our
dataflow.
• To get started, open a web browser and navigate to:
http://localhost:8080/nifi
• The port can be changed by editing the nifi.properties file in the NiFi conf directory, but the default port is
8080.
• This will bring up the User Interface, which at this point is a blank canvas for orchestrating a dataflow:

NiFi User Interface

Lab: Ambari Driven HDF
Installation
NiFi User Interface
NiFi User Interface
• When a DFM navigates to the UI for the first time, The canvas is blank.
• Near the top of the UI are a few toolbars that will be very important to create your dataflow:

Draggable Components:
 To the left is the Components Toolbar. This toolbar consists of the different components that can be dragged
onto the canvas like:
• Processor
• Input port
• Output Port
• Process Groups
• Remote Process Groups
• Funnels
• Template
• Label
We will Discuss more on these components in coming sections.

Action Toolbar
 This toolbar consists of buttons to manipulate the existing components on the graph. The functional buttons
are:
• Enable • Copy
• Disable • Paste
• Start • Group components
• Stop • Change Color
• Create Template • Delete

Search Toolbar
 This toolbar consists of a single Search field that allows users to easily find components on the graph.
 Users are able to search by component name, type, identifier, configuration properties, and their values.

Global Menu
 This toolbar consists of buttons that are used by DFMs to manage the flow as well as by administrators who
manage user access and configure system properties, such as how many system resources should be provided
to the application. Components are:
• Summary
• Counters
• Bulletin Board
• Data Provenance
• Controller Settings
• Flow Configuration History
• Users
• Policies
• Templates
• User Settings
• Cluster
• Help
• About

NiFi User Interface – Navigation Segments
• we have segments that provide capabilities to easily navigate around the graph.

Drag-gable Components In
Detail
Draggable Components In Detail:
 To the left is the Components Toolbar. This toolbar consists of the different components that can be dragged
onto the canvas like:
• Processor
• Input port
• Output Port
• Process Groups
• Remote Process Groups
• Funnels
• Template
• Label

Processor
• The Processor is the most commonly used component.

• It is responsible for data ingress, egress, routing, and manipulating.
• There are many different types of Processors.
• In fact, this is a very common Extension Point in NiFi.
• Many vendors may implement their own Processors to perform whatever functions are necessary for their use
case.
• When a Processor is dragged onto the graph, the user is presented with a dialog to choose which type of
Processor to use.

Input Port
• Input Ports provide a mechanism for transferring data into a Process Group.
• When an Input Port is dragged onto the canvas, the DFM is prompted to name the Port.
• All Ports within a Process Group must have unique names.
• If the Input Port is dragged onto the Root Process Group, the Input Port provides a mechanism to receive data
from remote instances of NiFi via Site-to-Site.
• Input Port can be configured to restrict access to appropriate users, if NiFi is configured to run securely.

Output Port
• Output Ports provide a mechanism for transferring data from a Process Group to destinations outside of the
Process Group.
• When an Output Port is dragged onto the canvas, the DFM is prompted to name the Port.
• If the Output Port is dragged onto the Root Process Group, the Output Port provides a mechanism for sending
data to remote instances of NiFi via Site-to-Site.
• In this case, the Port acts as a queue. As remote instances of NiFi pull data from the port, that data is removed
from the queues of the incoming Connections.
• If NiFi is configured to run securely, the Output Port can be configured to restrict access to appropriate users.

Process Group
• Process Groups can be used to logically group a set of components so that the dataflow is easier to understand
and maintain.
• When a Process Group is dragged onto the canvas, the DFM is prompted to name the Process Group.
• All Process Groups within the same parent group must have unique names.
• The Process Group will then be nested within that parent group.

Remote Process Group
• Remote Process Groups appear and behave similar to Process Groups.
• However, the Remote Process Group (RPG) references a remote instance of NiFi.
• When an RPG is dragged onto the canvas, the DFM is prompted for the URL of the remote NiFi instance.
• If the remote NiFi is a clustered instance, the URL that should be used is the URL of the remote instance’s NiFi
Cluster Manager.
• When data is transferred to a clustered instance of NiFi via an RPG, the RPG it will connect to the remote
instance’s coordinator node to determine which nodes are busy.
• This information is then used to load balance the data that is pushed to each node.

Funnel
• Funnels are used to combine the data from many Connections into a single Connection.
• This has two advantages:
1) Make Graph looks better
- If many Connections are created with the same destination, the canvas can become cluttered
if those Connections have to span a large space.
- By funneling these Connections into a single Connection, that single Connection can then be
drawn to span that large space instead.
2) Configure priorities
- Connections can be configured with FlowFile Prioritizers.
- Data from several Connections can be funneled into a single Connection, providing the ability
to Prioritize all of the data on that one Connection.

Templates
• Templates can be created by DFMs from sections of the flow, or they can be imported from other dataflows.
• These Templates provide larger building blocks for creating a complex flow quickly.
• When the Template is dragged onto the canvas, the DFM is provided a dialog to choose which Template to add
to the canvas:

Labels
• Labels are used to provide documentation to parts of a dataflow.
• When a Label is dropped onto the canvas, it is created with a default size.
• The Label can then be resized by dragging the handle in the bottom-right corner.
• The Label has no text when initially created. The text of the Label can be added by right-clicking on the Label
and choosing Configure...

Summary & History
Summary Page
• While the NiFi canvas is useful for understanding how the configured DataFlow is laid out, this view is not
always optimal when trying to discern the status of the system.
• In order to help the user understand how the DataFlow is functioning at a higher level, NiFi provides a
Summary page.
• This page is available in the Management Toolbar in the top-right corner of the User Interface.
• The Summary Page is opened by clicking the Summary icon from the Management Toolbar.
• Summary dialog provides a great deal of information about each of the components on the graph.

Summary Page

Summary Page
• Bulletin Indicator: As in other places throughout the User Interface, when this icon is present, hovering over
the icon will provide information about the Bulletin that was generated, including:
 the message,
 the severity level,
 the time at which the Bulletin was generated
 node that generated the Bulletin
• Details: Clicking the Details icon will provide the user with the details of the component. This dialog is the same
as the dialog provided when the user right-clicks on the component and chooses the “View configuration”
menu item.
• Go To: Clicking this button will close the Summary page and take the user directly to the component on the
NiFi canvas..

Summary Page
• Refresh: The Refresh button allows the user to refresh the information displayed without closing the dialog and
opening it again.
• Filter: The Filter element allows users to filter the contents of the Summary table by typing in all or part of
some criteria, such as a Processor Type or Processor Name.
• Pop-Out: When monitoring a flow, it is helpful to be able to open the Summary table in a separate browser tab
or window.
• Stats History: Clicking the Stats History icon will open a new dialog that shows a historical view of the statistics
that are rendered for this component.

Summary Page
• System Diagnostics: The System Diagnostics window provides information about how the system is performing
with respect to system resource utilization.
Version Tab
JVM Tab
NiFi Version
System Tab Details
Java Version details
Heap/Non heap OS version Details

memory usage
Load Average for
Last minute
GC details
153 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Flow File and Content Repo Disk usage
Historical Statistics of a Component
• While the Summary table and the canvas show numeric statistics pertaining to the performance of a
component over the past five minutes, it is often useful to have a view of historical statistics as well.
• This information is available by right-clicking on a component and choosing the “Stats” menu option or by
clicking on the Stats History in the Summary page.
• The amount of historical information that is stored is configurable in the NiFi properties but defaults to 24
hours.
• When the Stats dialog is opened, it provides a graph of historical statistics:


The left-hand side of the dialog provides information about the component that the stats are for:
• Id: The ID of the component for which the stats are being shown.
• Group Id: The ID of the Process Group in which the component resides.
• Name: The Name of the Component for which the stats are being shown.
• Component-Specific Entries: Information is shown for each different type of component. For example, for a
Processor, the type of Processor is displayed. For a Connection, the source and destination names and IDs are
shown.
• Start: The earliest time shown on the graph.
• End: The latest time shown on the graph.
• Min/Max/Mean: The minimum, maximum, and mean (arithmetic mean, or average) values are shown.

• The right-hand side of the dialog provides a drop-down list of the different types of metrics to render in the
graphs.
• The top graph is larger so as to provide an easier-to-read rendering of the information.
• The bottom graph is much shorter and provides the ability to select a time range.
• Selecting a time range here will cause the top graph to show only the time range selected, but in a more
detailed manner.
• Additionally, this will cause the Min/Max/Mean values on the left-hand side to be recalculated.
• Once a selection has been created by dragging a rectangle over the graph, double-clicking on the selected
portion will cause the selection to fully expand in the vertical direction.
• Clicking on the bottom graph without dragging will remove the selection.

Anatomy of a Processor
Anatomy of a Processor
NiFi provides a significant amount of information about each Processor on the canvas. The following diagram
shows the anatomy of a Processor.
The elements are discussed in following slides:

Elements of a Processor
1) Processor Type:
• NiFi provides several different types of Processors in order to allow for a wide range of tasks to be performed.
• Each type of Processor is designed to perform one specific task.
• The Processor type (PutFile, in this example) describes the task that this Processor performs.
• In the above diagram, the Processor writes a FlowFile to disk - or “Puts” a FlowFile to a File.
2) Processor Name:
• This is the user-defined name of the Processor.
• By default, the name of the Processor is the same as the Processor Type.
• In the example, this value is "Copy to /review".

Elements of a Processor(Cont..)
3) Bulletin Indicator:
• When a Processor event occurres, it generates a Bulletin to notify those who are monitoring NiFi via the User
Interface.
• The DFM is able to configure which bulletins should be displayed in the User Interface.
• The default value is WARN.
• This icon is not present unless a Bulletin exists for this Processor.
• If the instance of NiFi is clustered, it will also show the Node that emitted the Bulletin.
• Bulletins automatically expire after five minutes.

4) Active Tasks:
• The number of tasks that this Processor is currently executing.
• This number is constrained by the “Concurrent tasks” setting in the “Scheduling” tab of the Processor
configuration dialog.
• In the example we can see that the Processor is currently performing two tasks.
• If the NiFi instance is clustered, this value represents the number of tasks that are currently executing across all
nodes in the cluster.

5) Status Indicator:
Shows the current Status of the Processor. The following indicators are possible:
Running: The Processor is currently running.
Stopped: The Processor is valid and enabled but is not running.
Invalid: The Processor is enabled but is not currently valid and cannot be started. Hovering over this icon
will provide a tooltip indicating why the Processor is not valid.
Disabled: The Processor is not running and cannot be started until it has been enabled. This status does
not indicate whether or not the Processor is valid.

6) 5-Minute Statistics:
• The Processor shows several different statistics in tabular form.
• Each of these statistics represents the amount of work that has been performed in the past five minutes.
• If the NiFi instance is clustered, these values indicate how much work has been done by all of the Nodes
combined in the past five minutes.
• These metrics are:
• In: The amount of data that the Processor has pulled from the queues of its incoming Connections.
• Read/Write: The total size of the FlowFile content that the Processor has read/written to disk.
• Out: The amount of data that the Processor has transferred to its outbound Connections.
• Tasks/Time: The number of tasks the Processor executed in the past 5 minutes, and the amount of time
taken to perform those tasks.

What Processors are Available??
What Processors are Available
• In order to create an effective dataflow, the users must understand what types of Processors are available to
them.
• NiFi contains many different Processors out of the box.
• These Processors provide capabilities to ingest data from numerous different systems, route, transform,
process, split, and aggregate data, and distribute data to many systems.
• The number of Processors that are available increases in nearly each release of NiFi.

Few Categories of Processors
Highlighting some of the most frequently used Processors, categorizing them by their functions:
• Data Transformation
• Routing and Mediation
• Database Access
• Attribute Extraction
• System Interaction
• Data Ingestion
• Data Egress / Sending Data
• Splitting and Aggregation
• HTTP
• Amazon Web Services
Note: And many more are available and still counting…

Data Transformation Processors
• CompressContent: Compress or Decompress Content
• ConvertCharacterSet: Convert character set to encode the content from one character set to another
• EncryptContent: Encrypt or Decrypt Content
• ReplaceText: Use Regular Expressions to modify textual Content
• TransformXml: Apply an XSLT transform to XML Content
Database Access Processors

• ConvertJSONToSQL: Convert a JSON document into a SQL INSERT or UPDATE command.
• ExecuteSQL: Executes a user-defined SQL SELECT command, writing the results in Avro format.
• PutSQL: Updates a database by executing the SQL DDM statement defined by the FlowFile’s content.
• GetHbase: This Processor polls HBase for any records in the specified table.
• PutHbaseCell: Adds the Contents of a FlowFile to HBase as the value of a single cell
• PutHBaseJSON: Adds rows to HBase based on the contents of incoming JSON documents.

Routing and Mediation Processors
• ControlRate: Throttle the rate at which data can flow through one part of the flow
• DetectDuplicate: Monitor for duplicate FlowFiles, based on some user-defined criteria.
• DistributeLoad: Load balance by distributing only a portion of data to each user-defined Relationship
• MonitorActivity: Sends a notification when a user-defined period of time elapses without any data.
• RouteOnAttribute: Route FlowFile based on the attributes that it contains.
• ScanAttribute: Scans the user-defined set of Attributes on a FlowFile.
• RouteOnContent: Search Content in FlowFile, if it matches - routed to the configured Relationship.
• ScanContent: Search Content of a FlowFile for terms that are present in a user-defined dictionary.
• ValidateXml: Validation XML Content against an XML Schema;

Attribute Extraction Processors
• EvaluateJsonPath: User give JSONPath Expressions and are evaluated against the JSON Content.
• ExtractText: Contents of a FlowFile are extracted using Regular Expressions.
• HashAttribute: Performs a hashing function against the concatenation of existing Attributes.
• HashContent: Performs a hashing function against the content of a FlowFile and add it as an Attribute.
• IdentifyMimeType: Evaluates the content of a FlowFile to determine MINE type of file the FlowFile.
• UpdateAttribute: Adds or updates any number of user-defined Attributes to a FlowFile.
System Interaction Processors

• ExecuteProcess: Runs a user-defined Operating System command. This Processor is a Source Processor.
• ExecuteStreamCommand: Runs the user-defined Operating System command. It must be fed incoming FlowFiles
in order to perform its work.
•
Data Ingestion Processors
• GetKafka: Consumes messages from Apache Kafka.
• GetMongo: Executes a user-specified query against MongoDB and writes the contents to a new FlowFile.
• GetTwitter: Allows a filter to listen to the Twitter endpoint, create FlowFile for each tweet that is received.
• GetHDFS: Monitors HDFS directory. When a file enters HDFS, copied into NiFi and deleted from HDFS.
• ListHDFS : Monitors a directory in HDFS and emits a FlowFile for each file with filename as its content.
• FetchHDFS: On receiving FlowFile from ListHDFS, it fetches the actual files from HDFS to NiFi.
• GetFTP: Downloads the contents of a remote file via FTP into NiFi and then deletes the original file.
• GetSFTP: Downloads the contents of a remote file via SFTP into NiFi and then deletes the original file
• GetFile: Streams the contents of a file from a local disk into NiFi and then deletes the original file.
• ListenHTTP: Starts an HTTP (or HTTPS) Server and listens for incoming connections.
• ListenUDP: Listens for incoming UDP packets and creates a FlowFile, emits to the success relationship.

Data Egress / Sending Data Processors
• PutEmail: Sends an E-mail to the configured recipients.
• PutFTP: Copies the contents of a FlowFile to a remote FTP Server.
• PutSFTP: Copies the contents of a FlowFile to a remote SFTP Server.
• PutSQL: Executes the contents of a FlowFile as a SQL DDL Statement (INSERT, UPDATE, or DELETE).
• PutKafka: Sends the contents of a FlowFile to Kafka as a message.
• PutMongo: Sends the contents of a FlowFile to Mongo as an INSERT or an UPDATE.
HTTP Processors
• GetHTTP: Downloads the contents of a remote HTTP- or HTTPS-based URL into NiFi.
• ListenHTTP: Starts an HTTP (or HTTPS) Server and listens for incoming connections.
• InvokeHTTP: Performs an HTTP Request that is configured by the user
• PostHTTP: Performs an HTTP POST request, sending the contents of the FlowFile as the body
• HandleHttpRequest : Is a Source Processor that starts an HTTP(S) server similarly to ListenHTTP.
• HandleHttpResponse: Sends a response back to the client after the FlowFile has finished processing.

Amazon Web Services Processors
• FetchS3Object: Fetches the content of an object stored in Amazon Simple Storage Service (S3).
• PutS3Object: Writes the contents of a FlowFile to an Amazon S3 object as configured.
• PutSNS: Sends the contents of a FlowFile as a notification to the Amazon Simple Notification Service (SNS).
• GetSQS: Pulls a message from the Amazon Simple Queuing Service (SQS) and writes to FlowFile.
• PutSQS: Sends the contents of a FlowFile as a message to the Amazon Simple Queuing Service (SQS).
• DeleteSQS: Deletes a message from the Amazon Simple Queuing Service (SQS).

Anatomy of a Connection
Anatomy of a Connection
• Once processors and other components are added the next step is to connect them to one another so that NiFi
knows what to do with each FlowFile after it has been processed.
• This is accomplished by creating a Connection between each component.
• When the user hovers the mouse over the center of a component, a new Connection icon appears.
• The user drags the Connection bubble from one component to another until the second component is
highlighted.
• When the user releases the mouse, a ‘Create Connection’ dialog appears.
• This dialog consists of two tabs: ‘Details’ and ‘Settings’. They are discussed in detail below.

Anatomy of a Connection (cont..)
• It is possible to draw a connection so that it loops back on the same processor.
• This can be useful if the DFM wants the processor to try to re-process FlowFiles if they go down a failure
Relationship.
• To create this type of looping connection, simply drag the connection bubble away and then back to the same
processor until it is highlighted.
• Then release the mouse and the same Create Connection dialog appears.

Details Tab
• The Details Tab of the Create Connection dialog provides information about a connection.
• At least one connection Relationship must be selected. If only one is available, it is auto selected.
• If multiple Connections are added with the same Relationship, FlowFile is automatically ‘cloned’, and a copy will
be sent to each of those Connections.
Destination Component and its type

Source Component and its type
Destination Component Process Group
Relationship included
Source Component Process Group

Settings Tab
• The Settings Tab provides the ability to configure the Connection’s name, FlowFile expiration, Back Pressure
thresholds, and Prioritization.
Id, will be generated automatically once configured

Name for connection (optional)
FlowFile expiration after a period
How to prioritize the data in the queue

Max number of FlowFile, connection can hold
Max amount of data, connection can hold

FlowFile expiration
• File expiration is a concept by which data that cannot be processed in a timely fashion can be automatically
removed from the flow.
• The expiration period is based on the time that the data entered the NiFi instance.
• In other words, if the file expiration on a given connection is set to 1 hour, and a file that has been in the NiFi
instance for one hour reaches that connection, it will expire.
• The default value of 0 sec indicates that the data will never expire.
• When a file expiration other than 0 sec is set, a small clock icon appears on the connection label, so the DFM
can see it at-a-glance when looking at a flow on the graph.
Clock symbol to indicate expiration

Back Pressure
• This allows the system to avoid being overrun with data.
• These thresholds indicate how much data should be allowed to exist in the queue before the component that is
the source of the Connection is no longer scheduled to run.
• NiFi provides two configuration elements for Back Pressure.
1) Back pressure object threshold:
 This is the number of FlowFiles that can be in the queue before back pressure is applied.
2) Back pressure data size threshold:
 This specifies the maximum amount of data (in size) that should be queued up before applying back
pressure.
 This value is configured by entering a number followed by a data size (B for bytes, KB for kilobytes, MB for
megabytes, GB for gigabytes, or TB for terabytes).
Note: Starting HDF-2.0.1

Object threshold is set to a default of 10000
Data size threshold is set to 1GB to avoid overwhelming.
Back Pressure Indicators
• When back pressure is enabled, small progress bars appear on the connection label.
• The DFM can see it at-a-glance when looking at a flow on the canvas.
• The progress bars change color based on the queue percentage:
• Green (0-60%)
• Yellow (61-85%)
• Red (86-100%).

Prioritizers
• The right-hand side of the tab provides the ability to prioritize the data in the queue so that higher priority data
is processed first.
• Prioritizers can be dragged from the top (‘Available prioritizers’) to the bottom (‘Selected prioritizers’).
• Multiple prioritizers can be selected.
• The prioritizer that is at the top of the ‘Selected prioritizers’ list is the highest priority.
• If two FlowFiles have the same value according to this prioritizer, the second prioritizer will determine which
FlowFile to process first, and so on.
• If a prioritizer is no longer desired, it can then be dragged from the ‘Selected prioritizers’ list to the ‘Available
prioritizers’ list.

Types of Prioritizers
• FirstInFirstOutPrioritizer: Given two FlowFiles, the on that reached the connection first will be processed first.
• NewestFlowFileFirstPrioritizer: Given two FlowFiles, the one that is newest in the dataflow will be processed
first.
• OldestFlowFileFirstPrioritizer: Given two FlowFiles, the on that is oldest in the dataflow will be processed first.
This is the default scheme that is used if no prioritizers are selected.
• PriorityAttributePrioritizer: Given two FlowFiles that both have a "priority" attribute, the one that has the
highest priority value will be processed first.

Empty queue
• This option allows the DFM to clear the queue of FlowFiles that may be waiting to be processed.
• This option can be especially useful during testing, when the DFM is not concerned about deleting data from
the queue.
• When this option is selected, users must confirm that they want to delete the data in the queue.
Click ok on
confirmation
Right Click and select

‘Empty queue’
Click on Empty

List queue
• This option allows the DFM to view/list the queue of FlowFiles that may be waiting to be processed.
Click on List queue
View Details of View Provenance

each Flowfiles

Explicit Processor Connectivity
• Processors that do not expect incoming data will no longer allow incoming Connections.
• Attempting to draw one will show a red, dotted line, and will not allow the Connection to be made.
• Likewise, Processors that require input to perform work will be invalid until they have an incoming Connection.
Red lines, and No connections Made
Click and drag connection

Bending Connections
• To add a bend point (or elbow) to an existing connection, simply double-click on the connection in the spot
where you want the bend point to be.
• Then, you can use the mouse to grab the bend point and drag it so that the connection is bent in the desired
way.
• You can add as many bend points as you want. You can also use the mouse to drag and move the label on the
connection to any existing bend point.
• To remove a bend point, simply double-click it again.

Controller Services and
Reporting Tasks
Working With Controller Settings
• There is also a central place within the User Interface for adding and configuring both Controller
Services and Reporting Tasks.
• Once you click on the controller Settings the below window opens with following three tabs:
• The first tab in Controller Settings window is general:
General:
-- Name of the flow.
-- Comments that describes parent flow
-- Maximum thread counts of the instance.
-- Info here will be visible to every user.
-- Backup/Archive your current flow.

Working With Controller Settings(cont..)
• The Next tab in Controller Settings window is Controller Services:
Controller Services:
-- view all the controller services added.
-- click + button to add new Controller services
-- Then Configure the Controller services
-- also edit, remove, enable and see usage buttons are available

Working With Controller Settings(cont..)
• The Next tab in Controller Settings window is Controller Services:
Reporting Tasks:
-- view all the Reporting Tasks added.
-- click + button to add new Reporting Tasks.
-- Then Configure the Reporting Tasks.
-- also edit, remove, enable and see usage buttons are available

Reporting Task
• Reporting Tasks run in the background to provide statistical reports about what is happening in the NiFi
instance.
• The DFM adds and configures Reporting Tasks in the User Interface as desired.
• Available reporting tasks include the:
1) ControllerStatusReportingTask : Logs the 5-minute stats that are shown in the NiFi Summary Page
2) MonitorDiskUsageReportingTask : Checks storage space available for Repositories and warns
3) MonitorMemoryReportingTask: Checks Java Heap available in the JVM for a JVM Memory Pool.
4) StandardGangliaReporter : Reports metrics to Ganglia ]for external monitoring of the application.
5) AmbariReportingTask: Publishes metrics from NiFi to Ambari
6) DataDogReportingTask: Publishes metrics from NiFi to datadog
7) SiteToSiteProvenanceReportingTask: Publishes Provenance events using the Site To Site protocol.
Note: Will discuss about each task in detail while covering ‘Monitoring NiFi’

Controller Services
• Extension points that, after being added and configured by a DFM in the User Interface.
• It will start up when NiFi starts up and provide information for use by other components.
• A commonly used on is StandardSSLContextService, to configure keystore/truststore and reuse.
• The idea is to configure Once, Use it for multiple Processors and connectors.
• Common Controller Services are:
• DBCPConnectionPool • StandardSSLContextService
• DistributedMapCacheClientService • AWSCredentialsProviderControllerService
• DistributedMapCacheServer • JMSConnectionFactoryProvider
• DistributedSetCacheClientService • HBase_1_1_2_ClientService
• DistributedSetCacheServer • HiveConnectionPool
• StandardHttpContextMap • CouchbaseClusterService

List of Controller Services
• AvroReader • GrokReader • ScriptedReader
• AvroRecordSetWriter • HBase_1_1_2_ClientMapCacheService • ScriptedRecordSetWriter
• AvroSchemaRegistry • HBase_1_1_2_ClientService • SimpleCsvFileLookupService
• AWSCredentialsProviderControllerService • HiveConnectionPool • SimpleKeyValueLookupService
• CouchbaseClusterService • HortonworksSchemaRegistry • StandardHttpContextMap
• CSVReader • IPLookupService • StandardSSLContextService
• CSVRecordSetWriter • JettyWebSocketClient • XMLFileLookupService
• DBCPConnectionPool • JettyWebSocketServer
• DistributedMapCacheClientService • JMSConnectionFactoryProvider
• DistributedMapCacheServer • JsonPathReader
• DistributedSetCacheClientService • JsonRecordSetWriter
• DistributedSetCacheServer • JsonTreeReader
• FreeFormTextRecordSetWriter • PropertiesFileLookupService
• GCPCredentialsControllerService • ScriptedLookupService
Few Common Controller Services
DBCPConnectionPool
• Provides Database Connection Pooling Service.
• Connections can be asked from pool and returned after usage. Below are the required properties

HBase_1_1_2_ClientService
• Implementation of HBaseClientService for HBase 1.1.2.
• This service can be configured by providing a comma-separated list of configuration files, or by specifying
values for the other properties.

AWSCredentialsProviderControllerService
• Defines credentials for Amazon Web Services processors.
• AWS credentials provider service which is an interface for creating aws clients.

JMSConnectionFactoryProvider
• Provides a generic service to create vendor specific javax.jms.ConnectionFactory implementations.
• ConnectionFactory can be served once this service is configured successfully

HiveConnectionPool
• Provides Database Connection Pooling Service for Apache Hive.
• Connections can be asked from pool and returned after usage.

Demo: NiFi User Interface
Building a NiFi DataFlow
Adding Components to the Canvas
• A DFM is able to build an automated dataflow using the NiFi User Interface (UI).
• Simply drag components from the toolbar to the canvas
• Configure the components to meet specific needs
Add Processors
• Connect the components together
Add a connection
Fix Processor Configurations
Start the processors

Adding a Processor to the Canvas
• The Processor is the most commonly used component.
• It is capable for data ingress, egress, routing, and manipulating.
• When a Processor is dragged onto the graph, the user is presented with a dialog to choose type:
Select Source
Apply filter if you need
Processors with tags hadoop & ingest
Click hadoop
Click Ingest
Click add/double click

Configuring a Processor
• Once a Processor has been dragged onto the Canvas, it is ready to configure.
• This is done by right-clicking on the Processor and clicking the Configure option.
• The configuration dialog is opened with four different tabs, each will be discussed.
• Once finished configuring you can click Apply button to save changes or press Cancel to exit without applying
any changes.
• While a Processor is running, you can only View configuration.
Right click and

select configure

Configuring a Processor – Settings Tab
Change the name of the Processor Enable/disable processor
Unique identifier
Relationship termination
Type
Run Schedule interruption
To penalize flowfile
Display WARN and above

Configuring a Processor – Scheduling Tab
Timer/Event/CRON driven
All nodes/Primary node
How long Processor should run
How often Processor should run
Max threads a processor can use

Configuring a Processor – Properties Tab
• The Properties Tab provides a mechanism to configure Processor-specific behavior.
• Different processors have different properties by default.
• Some support dynamically created properties
Click to add property
Property
Save Changes
Enter Values

Configuring a Processor – Comments Tab
• The last tab in the Processor configuration dialog is the Comments tab.
• Provides an area for users to include comments are appropriate for this component.
• Use of the Comments tab is optional.
Apply/Save Changes
Comments for your processor

Adding and Configuring Input Port
• Input Ports provide a mechanism for transferring data into a Process Group.
• When an Input Port is dragged onto the canvas, the DFM is prompted to name the Port.
• If the Input Port is dragged onto the Root Process Group, its used for Site-to-Site.
• It can be configured to restrict access to appropriate users by configuring to run securely.
Drag Input Port to Canvas
Give a Name & Add Configure, Connect and Start

Adding and Configuring Output Port
• Output Ports provide a mechanism for transferring data outside of a Process Group.
• When an Out Port is dragged onto the canvas, the DFM is prompted to name the Port.
• If the Output Port is dragged onto the Root Process Group, its used for Site-to-Site.
• It can be configured to restrict access to appropriate users by configuring to run securely.
Drag Output Port to Canvas
Give a Name & Add

Configure, Connect and Start
Adding and Configuring Process Group
• It is used to group a set of components so that the dataflow is easier to understand and maintain.
• When a Process Group is dragged onto the canvas, the DFM is prompted to name it.
• All Process Groups within the same parent group must have unique names.
• The Process Group will then be nested within that parent group. Drag Process Group to Canvas
Configure/Enter Group
Give a Name & Add
Notice on the bottom, you are one level inside

Adding and Configuring Remote Process Group
• Remote Process Groups appear and behave similar to Process Groups
• But Remote Process Group (RPG) references a remote instance of NiFi.
• When an RPG is dragged onto the canvas, rather than name, it asks for URL of remote NiFi instance.
• If Remote Instance is a cluster, you can specify any node’s URL.
Drag Process Group to Canvas
Right Click to configure, manage ports to establish site-site
Give a remote NiFi Url & Add

Adding and Configuring Funnel
• Funnels are used to combine the data from many Connections into a single Connection
• It Makes your Canvas pretty.
• Connections can be configured with FlowFile Prioritizers.
• Data from several Connections can be funneled into a single Connection, and prioritize them together.
Drag Funnel to Canvas
You may configure output based on priority
Start connecting processors through this

Adding Templates to Canvas
• When Template is dragged onto the canvas, DFM is asked to choose Template to add to the canvas:
Now you can reuse or export the templates
Drag Template to Canvas
215 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Select the required Template to Import
Adding Labels
• Labels are used to provide documentation to parts of a dataflow.
• When a Label is dropped onto the canvas, it is created with a default size.
• The Label can then be resized by dragging the handle in the bottom-right corner.
• The Label has no text when initially created. The text of the Label can be added by right-clicking on the Label
and choosing Configure... Drag Input Port to Canvas
Now we have a dataFlow Labeled.

216 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Click configure and add a Label Name
Command and Control of
DataFlow
Command and Control of DataFlow
• When a component is added to the NiFi canvas, it is in the Stopped state.
• In order to cause the component to be triggered, the component must be started.
• Once started, the component can be stopped at any time.
• From a Stopped state, the component can be configured, started, or disabled.
• Will Learn More about Control of Data flow in following Slides:

Starting/Stopping a Component
In order to start a component, the following conditions must be met:
• The component’s configuration must be valid to start it, once started you can stop it at any time.
• The component must be enabled and have no active tasks to be started.
• If a Process Group is stopped, all of the components within the Process Group will be stopped.
• Stopping a component does not interrupt its currently running tasks, Rather it stops scheduling new tasks.
You can right click and stop if running
Fix configurations and make it valid
219 © Hortonworks Inc. 2011 – 2016. All Rights Reserved You can right click and start.. or
Enabling/Disabling a Component
• When a component is enabled, it is able to be started.
• Users may choose to disable components when they are part of a dataflow that is still being assembled.
• This helps to distinguish between components intentionally stopped and stopped temporarily.
• A component can be enabled by clicking Enable icon in the Actions Toolbar, or in configuration.
• Only Ports and Processors can be enabled and disabled.
Now its is enabled and can be started/disables
Component is in Disabled state

Click Enable to use this processor
Example Dataflow
• Now, lets try to put it all together.
• The following example dataflow consists of just two processors: GenerateFlowFile and LogAttribute.
• These processors are normally used for testing, but they can also be used to build a quick flow for
demonstration purposes and see NiFi in action.
• After you drag the GenerateFlowFile and LogAttribute processors to the graph and connect configure them as
follows:
• GenerateFlowFile
On the Scheduling tab: set Run schedule to: 5 sec.
On the Properties tab: set File Size to: 10 kb
• LogAttribute
On the Settings tab: under Auto-terminate relationships, select the checkbox next to Success.
On the Settings tab, set the Bulletin level to Info.

Example Dataflow
• The dataflow should look like the following:
• Now you can try start the dataflow.

• When the dataflow is running, be sure to note the statistical information that is displayed on the face of
each processor

How about a complex Dataflow??
• Lets Imagine we have the following requirement consists of 4 Data flows:
Pulling data from kafka Pulling data from X Pulling data from Y Pulling http data
Check if its compressed
Decompress
Push it to hdfs

• In order have good visual representation, use Process Groups to provide a nice logical separation.
• If we do that for each of those, we end up with something like:
• Now, let's say that that you've got a new requirement.

• While sending text data to HDFS, each file pushed to HDFS needs to have 1,000 lines of text or less.
• Now, consider how much work it is to make all of those modifications, as we have several different dataflow
side-by-side on the same graph.
• We can then double-click each of those Process Groups and edit what's inside to meet requirement.

• So let us consider the alternate approach of merging it all into a single dataflow, and we end up with:
Add splitText Processor here to make sure

hdfs files have <=1000 lines

• We don't have to insert a SplitText processor 3 more times as we restructured the flow like below:.
Split Text Processor Added

Lab: Building A DataFlow
Anatomy of a Processor Group
Anatomy of a Processor Group
• The Process Group provides a mechanism for grouping components together into a logical construct to makes
it more understandable from a higher level.
• The following image highlights the different elements that make up the anatomy of a Process Group:
The Process Group consists of the following elements:

Anatomy of a Processor Group (cont)
1) Name:
• This is the user-defined name of the Process Group.
• This name is set when the Process Group is added to the canvas.
• The name can later by changed by right-clicking on the Process Group and clicking the “Configure” option.
• In this example, the name of the Process Group is “Process Group ABC.”
2) Bulletin Indicator:
• When a child component of a Process Group emits a bulletin, that bulletin is propagated to the component’s
parent Process Group, as well.
• When any component has an active Bulletin, this indicator will appear, allowing the user to hover over the icon
with the mouse to see Bulletin.

3) Active Tasks:
• The number of tasks that are currently executing by the components within this Process Group.
• Here, we can see that the Process Group is currently performing one task.
• If the NiFi instance is clustered, this value represents the number of tasks that are currently executing across all
nodes in the cluster.
4) Comments:
• When the Process Group is added to the canvas, the user is given the option of specifying Comments in order
to provide information about the Process Group.
• The comments can later be changed by right-clicking on the Process Group and clicking the “Configure” menu
option.
• In this example, the Comments are set to “Example Process Group.”

5) Statistics:
• Process Groups provide statistics about the amount of data that has been processed by the Process Group in
the past 5 minutes as well as the amount of data currently enqueued within the Process Group.
• The following elements comprise the “Statistics” portion of a Process Group:
• Queued: The number of FlowFiles currently enqueued within the Process Group.
• In: The number of FlowFiles that have been transferred into the Process Group through all of its Input
Ports over the past 5 minutes.
• Read/Write: The total size of the FlowFile content that the components within the Process Group have
read from disk and written to disk.
• Out: The number of FlowFiles that have been transferred out of the Process Group through its Output
Ports over the past 5 minutes.

6) Component Counts:
• The Component Counts element provides information about how many components of each type exist within the
Process Group.
• The following provides information about each of these icons and their meanings:
• Transmitting Ports: The number of RPG Ports that currently are configured to transmit/pull data.
• Non-Transmitting Ports: The number of RPG Ports that are currently have their transmission disabled.
• Running Components: The number of Processors, Ports that are currently running.
• Stopped Components: The number of Processors, Ports that are currently not running.
• Invalid Components: The number of Processors, Ports that are enabled but are not in a valid state.
• Disabled Components: The number of Processors, Ports that are currently disabled.

Lab: Processor Group
Anatomy of a Remote
Processor Group
Anatomy of a Remote Processor Group
• When creating a DataFlow, it is often necessary to transfer data from one instance of NiFi to another.
• For this reason, NiFi provides the concept of a Remote Process Group.
• From the User Interface, the Remote Process Group looks similar to the Process Group.
• The information rendered about a RPG is related to the interaction that occurs between this instance of NiFi
and the remote instance.
The Process Group consists of the following elements:

Anatomy of a Remote Processor Group (cont)
1) Transmission Status:
• The Transmission Status indicates whether or not data Transmission between this instance of NiFi and the
remote instance is currently enabled.
• The icon shown will be the Transmission Enabled icon if any of the Input Ports or Output Ports is currently
configured to transmit or the Transmission Disabled icon if all of the Input Ports and Output Ports that are
currently connected are stopped.
2) Remote Instance Name:

• This is the name of the NiFi instance that was reported by the remote instance.
• When the Remote Process Group is first created, before this information has been obtained, the URL of the
remote instance will be shown here instead.

3) Remote Instance URL:
• This is the URL of the remote instance that the Remote Process Group points to.
• This URL is entered when the Remote Process Group is added to the canvas and it cannot be changed.
4) Secure Indicator:
• This icon indicates whether or not communications with the remote NiFi instance are secure.
• If communications with the remote instance are secure, this will be indicated by
• If the communications are not secure, this will be indicated by
• If the communications are secure, this instance of NiFi will not be able to communicate with the remote
instance until an administrator for the remote instance grants access.

7) 5-Minute Statistics: Two statistics are shown for Remote Process Groups: Sent and Received.
8) Comments: The Comments that are provided for a Remote Process Group are not comments added by the
users of this NiFi but rather the Comments added by the administrators of the remote instance.
9) Last Refreshed Time: The information that is pulled from a remote instance and rendered on the Remote
Process Group in the User Interface is periodically refreshed in the background.

Remote Process Group
Transmission
Individual Port Transmission
• There are times when the DFM may want to enable/disable transmission for a specific Port in the RPG.
• This can be accomplished by right-clicking on the RPG and choosing the “Remote ports” menu item.
• This provides a configuration dialog from which each Port can be configured:
To view configurations
Enable/Disable whole
transmission
Enable/Disable Remote ports to control transmission

Port Transmission Configuration url of remote instance (https:// if secure)
Left side: list of input ports Right side: list of output ports
Connected input port

– transmitting
Data transmitted is compressed or not
Output port: Not Connected
242 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Batch Settings
NiFi Site-to-Site
NiFi Site-To-Site
• Direct communication between two NiFi instances
• Push to Input Port on receiver, or Pull from Output Port on source
• Communicate between clusters, standalone instances, or both
• Handles load balancing and reliable delivery
• Secure connections using certificates (optional)
http://node1:8080/nifi http://node5:8080/nifi

Site-To-Site Push
• Source connects Remote Process Group to Input Port on destination
• Site-To-Site takes care of load balancing across the nodes in the cluster
C
Node 1
Input Port
Node 2
Standalone NiFi Input Port
RPG
Node 3
Input Port

Site-To-Site Pull
• Destination connects Remote Process Group to Output Port on the source
• If source was a cluster, each node would pull from each node in cluster
C
Node 1
RPG
Node 2
Standalone NiFi RPG
Output Port
Node 3
RPG

Site-To-Site Client
• Code for Site-To-Site broken out into reusable module
• https://github.com/apache/nifi/tree/master/nifi-commons/nifi-site-to-site-client
• Can be used from any Java program to push/pull from NiFi
C Node 1
Output Port
Node 2
Output Port
Java Program
Node 3 Site-To-Site Client
Output Port

Site-to-Site : How to..
• To communicate with a remote NiFi instance via Site-to-Site, simply drag a Remote Process Group onto the
graph and enter the URL of the remote NiFi instance.
• The URL is the same URL you would use to go to that instance’s User Interface.
• If the Remote Instance is a Cluster, Add Any node’s url for RPG.
• When you drag the connection, you will have a chance to choose which Port to connect to.
• Note that it may take up to one minute for the RPG to determine which ports are available.
• If connection is starting from the RPG , this indicates that you are pulling data from remote instance.
• If connection ends on the RPG, that implies that you will be pushing data to the remote instance.

Site-to-Site : Benefits
Using Site-to-Site provides the following benefits:
• Easy to configure: After entering the URL of the remote NiFi, the ports are automatically discovered.
• Secure: Site-to-Site optionally let us encrypt data and provide authentication and authorization.
• Scalable: As nodes in the remote cluster change, changes are automatically detected & taken care.
• Efficient: Site-to-Site allows batches of FlowFiles to be sent at once in order to avoid the overhead of
establishing connections and making multiple round-trip requests.
• Reliable: Checksums are automatically calculated by sender and receiver and compared after the data has
been transmitted.

Site-to-Site : Benefits
• Automatically load balanced: As nodes come online or drop out of the remote cluster, or a node’s load
becomes heavier or lighter, the amount of data that is directed to that node will automatically be adjusted.
• Adaptable: When a connection is made to a remote NiFi instance, a handshake is performed in order to
negotiate which protocol and which version of the protocol will be used.
• FlowFiles Attributes: When a FlowFile is transferred over this protocol, all of the FlowFile’s attributes are
automatically transferred with it.

Lab: Remote Processor Group
Working With Attributes
Working With Attributes
• Each FlowFile is created with several Attributes, and these may change over the life of the FlowFile.
• The concept of a FlowFile is extremely powerful. It provides three primary benefits:
1) Decision while routing:
• It allows the user to make routing decisions in the flow so that FlowFiles that meeting some criteria can be
handled differently than other FlowFiles.
• This is done using the RouteOnAttribute and similar Processors.
2) Processors configuration:
• Attributes are used in order to configure Processors in such a way that the configuration of the Processor is
dependent on the data itself.
• For instance, the PutFile Processor is able to use the Attributes in order to know where to store each FlowFile.
3) Information about Data
• The Attributes provide extremely valuable context about the data.
• This is useful when reviewing the Provenance data for a FlowFile.

Common Attributes
Each FlowFile has a minimum set of Attributes:
• filename: A filename that can be used to store the data to a local or remote file system.
• path: The name of a directory that can be used to store the data to a local or remote file system.
• uuid: A Universally Unique Identifier that distinguishes the FlowFile from other FlowFiles in the system.
• entryDate: The date and time at which the FlowFile entered the system (i.e., was created).
• lineageStartDate: Any time that a FlowFile is cloned, merged, or split, this results in a "child" FlowFile being
created. This value represents the date and time at which the oldest ancestor entered the system.
• fileSize: This attribute represents the number of bytes taken up by the FlowFile’s Content.
Note: the uuid, entryDate, lineageStartDate, and fileSize attributes are system-generated and cannot be changed.

Extracting Attributes
• NiFi provides several different Processors out of the box for extracting Attributes from FlowFiles.
• We have list of Processors for Attribute Extraction purpose that can be used.
• This is a very common use case for building custom Processors, as well.
• Many Custom Processors are written:
• To understand a specific data format.
• Extract pertinent information from a FlowFile’s content.
• Creating Attributes to hold that information.
• Decisions can then be made about how to route or process the data.

Routing on Attributes
• One of the most powerful features of NiFi is the ability to route FlowFiles based on their Attributes.
• UpdateAttribute / RouteOnAttribute Processors help in adding/routing based on an attribute.
How Attributes are analyzed:

• Each FlowFile’s Attributes will be compared against the configured properties to determine whether or not the
FlowFile meets the specified criteria.
• The value of each property is expected to be an Expression Language expression and returns boolean.
How to Route FlowFile:
• Processor evaluates Expression Language expressions provided against the FlowFile’s Attributes.
• The Processor determines how to route the FlowFile based on the Routing Strategy selected.
• The most comnmon strategy is the "Route to Property name" strategy.
• With this strategy selected, the Processor will expose a Relationship for each property configured.
• If Attributes satisfy the expression, a copy of the FlowFile will be routed to corresponding Relationship.

Adding User-Defined Attributes
• It is common for users to want to add their own user-defined Attributes to each FlowFile at a particular place in
the flow.
• The UpdateAttribute Processor is designed specifically for this purpose.
• Any number of properties can be added by clicking the "New Property" icon in Properties tab.
• The user should enter the name of the property and then a value.
• For each FlowFile that is processed by this UpdateAttribute Processor, an Attribute will be added.
• The value of the property may contain the Expression Language, as well.
• This allows Attributes to be modified based on other Attributes/external values/flowfile content itself.
• In addition UpdateAttribute Processor has an Advanced UI that allows the user to configure a set of rules for
which Attributes should be added when.

Expression Language / Using Attributes in Property
• As we extract Attributes from FlowFiles' contents and add user-defined Attributes, they don’t do us much good
as an operator unless we have some mechanism by which we can use them.
• The NEL allows us to access and manipulate FlowFile Attribute values as we configure our flows.
• Not all Processor properties allow the Expression Language to be used, but many do.
• For properties that do support the NEL, it is used by adding an expression within the opening ${ tag and the
closing } tag.
• An expression can be as simple as an attribute name. For example, to reference the uuid Attribute, we can
simply use the value ${uuid}.
• We can perform a number of functions and comparisons on Attributes.
• We can also embed one expression within another. [example: ${attr1:equals( ${attr2} )}. ]
• The Expression Language contains many different functions that can be used in order to perform the tasks
needed for routing and manipulating Attributes.
• We will learn more about EL in coming sessions.

Custom Properties Within Expression Language
• In addition to using FlowFile attributes, you can also define custom properties for Expression Language use.
• Defining custom properties gives you additional flexibility in processing and configuring dataflows.
• For example, you can refer to custom properties for connection, server, and service properties.
• Once you create custom properties, you can identify their location in the nifi.variable.registry.properties field
in the nifi.properties file.
• After you have updated the nifi.properties file and restarted NiFi, you are able to use custom properties as
needed.

Demo: Attributes
NiFi Expression Language
NiFi Expression Language - Introduction
• The NiFi expression language is a flexible, consistent mechanism for manipulating FlowFile attributes.
• In addition, the expression language provides access to system environment variables and JVM properties.
• One of the first questions you may ask is, “Can I use the expression language anywhere or is it only supported
on particular processors?”.
• While expression language statements are written in the value field on the properties tab of a processor’s
configuration window, not every ‘Property’ supports the use of expression language.
• NiFi has built in these ‘tooltips’ that appear on both processor pre-defined and user defined properties.
• If you hove over tooltips ( ), it will declare whether or not expression language is supported.

Structure of a NiFi Expression
• The NiFi Expression Language always begins with the start delimiter ${ and ends with the end delimiter }.
• Between the start and end delimiters is the text of the Expression itself.
• In its most basic form, the Expression can consist of just an attribute name.
 For example, ${filename} will return the value of the “filename” attribute.
• In a slightly more complex example, we can instead return a manipulation of this value.
 For example, ${filename:toUpper()} will return the upper-case value of the “filename” attribute.
• Continuing with our example, we can chain together multiple functions by using the expression:
 ${filename:toUpper():equals('HELLO.TXT')} will return true if the file name is HELLO.TXT
• There is no limit to the number of functions that can be chained together.
• Any FlowFile attribute can be referenced using the Expression Language.
• If the attribute name contains a “special character,” the attribute name must be escaped by quoting it.
 These are considered as special characters: ($|}{)(][,*;/:’\t\r\nspace);

• Let us consider and example of adding an attribute and modifying its value using Expression Language.
Adding an attribute myAttribute with value ‘New-File’
As it is passing through, if we log, it would be:
Changing myAttribute Value to ‘Old-File’ using NEL
As it is passing through, we can see the new value is updated:

• We successfully changed the value of ‘myAttribute’ from ‘My-new-file’ to ‘an old file!’.
• Expression language statements are generally processed left to right.
• NiFi must know where an expression language statement starts and where it ends.
• The ‘${‘ tells NiFi you are starting a statement and the ‘}’ tells NiFi where that statement ends.
• NiFi will evaluate the statement that fall between them.
• We have a subject which is myAttribute, we apply some function to it, where function is replace() here.

Expression Language functions:
• Functions provide a convenient way to manipulate and compare values of attributes.
• The Expression Language provides different functions to meet the needs of a automated dataflow.
• Each function takes zero or more arguments and returns a single value.
• These functions can then be chained together to create powerful Expressions to evaluate conditions and
manipulate values.
• Function inputs can have one of 4 different data types:
1. String literal – Must be enclosed in matching quotes. Both single quotes and double quotes are
supported.
- Valid: ‘myAttribute’ or “myAttribute”
- Invalid: ‘myAttribute, ‘myAttribute”, myAttribute”, myAttribute
2. Boolean literal – must be true or false
3. Number literal – any length number consisting of only 0s through 9s
- Valid: 12345
- Invalid: 123.45 (decimal and negative numbers are not supported as function inputs.
4. Embedded Expressions – A NiFi expression language statement serving as an input to a function.

Embedded Expressions
• Up to this point all the expression language functions we have shown used static values as inputs to those
functions.
• What if I wanted to use the value from some other attribute as an input to a function or I wanted to use the
resulting value of an expression language statement as input to some function as an input to a function.
• In the case of embedded expressions, that rule of thumb does not exactly apply.
• The embedded expression(s) must evaluate to a result before the function it is contained within can be
evaluated.

Expression Language functions
• There are a number of expression language functions that supports:
 Boolean logic: Compare an attribute value against some other value and return a Boolean value.
 String manipulation: Manipulates a String in some way.
 Searching: Used to search its subject for some value.
 Mathematical operations & Numeric manipulation: Performs arithmetic calculations
 Date manipulation: Manipulates or convert date formats from one to another.
 Data type coercion: Convert from string to number or vice-versa.
 Subjectless functions: Functions that are not expected to have subjects.
 Evaluating Multiple Attributes: Helps evaluating the same conditions on multiple attributes.
• The below table provides a quick listing of these various expression language functions:

Expression Language functions:

The Expression Language Editor:
• NiFi has an embedded expression language editor that helps users with the syntax of the expression language
statements.
• It also provides users with the ability to add comments directly in to their statements.

• The editor provides many features, which can help greatly with creating valid expression language statements.
• These feature include:
1) Syntax color coding:
• Various elements of the expression language statement are color coded (subjects, functions, function
inputs, etc..).
• If invalid syntax is detected, the coloring is removed.
2) Structure Highlighting:
• When an open curly bracket, open square bracket, or open parentheses are highlighted, the
corresponding close curly bracket, close square bracket, or close parentheses is also highlighted.
• You can avoid the most common syntax error by immediately adding the close right after the open, then
backing up your cursor one space to continue your expression language statement.

3) Multi-line support:
• A NiFi expression language statement can be written across multiple lines.
• This helps with the readability of more complex statements.
• To add a new line, hold shift while clicking enter.
• Just hitting enter will close the editor.
4) Comments:
• Comments can be added at the end of any line in eth editor.
• Use the pound/hash symbol (#) to designate where a comment begins.
• Comments continue to the end of the current line.
• If you want to wrap your comment over multiple lines, every line will need to start with a # otherwise,
each new line will be treated as part of the expression language statement.

5) Auto-complete:
• Functions are case sensitive so getting them syntactically correct can be challenging at first.
• The auto-complete function can be used to get the syntax correct.
• Single clicking on a function will open a window that provides a brief description of the function.
• You can also reduce the size of the returned list by continuing to type the name of a function after the
auto-complete window is displayed.
• After typing ${ , use Ctrl + spacebar to see a list of available functions. In this case, only subjectless
functions will be shown.
• After typing ${<subject>: , use Ctrl + Spacebar to see a list of functions that require a subject.

If/then/else and the expression language:
• The expression language does not support writing if/then statements.
• For this purpose an advanced tab feature was added to the UpdateAttribute processor.
• From the configuration window of an UpdateAttribute processor select the Advanced button.
Rule created Condition NEL
Action(s)

• The advanced tab of the UpdateAttribute processor allows you to manipulate attributes based on some
defined condition.
• The Advanced tab supports two FlowFile Policies:
1) Use clone – (default)
If more then one rule’s conditions are true for a single FlowFile, a new FlowFile (clone) will be created
for each.
2) Use original
If more then one rule’s conditions are true for a single FlowFile, the action(s) will for each rule will be
applied to the same FlowFile.

• The advanced tab allows you to create as many “Rules” has you want.
• Rules are evaluated from the top down and can be re-ordered simply by drag and drop.
• The order is important when the FlowFile Policy is set to use original.
• If multiple rules try to update the same attribute, the very last rule that is evaluated will have its value
applied to that attribute.
• Each rule has a defined condition(s) and corresponding action(s).
• All Conditions defined for a single Rule must evaluate to true before Action(s) will be applied.

What about the else?
• You can think of the Properties tab (not inside advanced) of the UpdateAttribute processor as the else
condition.
• When an attribute/property is defined both in a rule within the advanced tab and in the properties tab, what
was defined in the properties tab will only be applied if it was not updated by the rule.
Set myAttribute=My-new-file Set myAttribute=My-new-file
Advanced tab tab says update to

Property tab tab says update
‘My-if-else-file’
to ‘an old file!’
If myAttribute=My-new-file: PASS
If myAttribute=My-new-file :
updation
PASS
Property says, update to ‘an old file!’

updated myAttribute=‘an old file!’ If myAttribute=My-new-file : FAIL

Adding text using Expression Language
• A NiFi expression language statement can be as simple as ${filename}.
• In the case where only a subject is provided, the expression language will simply return the value assigned to
that subject/attribute.
• This simple expression ${someAttribute} can be used to return the value of someAttribute even when ‘Supports
expression language: false’
• Since NiFi understands that an Expression language statement starts with ${ and ends with }, additional text
can be pre-pended or appended to the result.
• Take this simple example from an UpdateAttribute processor:

Adding text using Expression Language
• NiFi will resolve ${filename} to the existing value of filename on the incoming FlowFile.
• It will then append .packaged to the end of that value and update the attribute filename on the outgoing
FlowFile.
• You can just as easily pre-pend text. Here in the same example expression language statement can be made
to pre-pend ‘systemA-‘ and append ‘.packaged’ to the filename:
Filename set to testFile-01
• systemA-${filename}.packaged
280 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Filename set to testFile-01.packaged
Using Multiple expressions
• Now that we know shat a basic expression language statement looks like and learned that it can be wrapped
with strings, lets take a look at using multiple expressions chained together.
• Consider a system running NiFi that ingests the syslog file many times per day.
• Every syslog file that is ingested will have the exact same filename.
• Assuming the original filename is always ‘system.log’ and what we want is ‘system.<some uuid>.log’ as the
new filename, we can use mutilple expressions like so:
• We used the ‘substringBeforeLast function so we would only capture the filename up until the last ‘.’. We then
appended a ‘.’ Followed by our second expression that will return the uuid assigned to the FlowFile. Finally, we
added back on the .log by appending it to the end.

Using Multiple expressions
• The updateAttribute processor would be configures as follows:
• As you can from our output, each file now has a unique name.
• NiFi has no limit on the number of expression language statements you can chain together.

Function Chaining an example
• Assume your NiFi instance is receiving application log files with the following name structure:
filename = application-hostname-XYZ-20151009.log
• What if we want to extract the date (20151009 in yyyyMMdd format) from this filename and put it in its own
attribute attribute named logDate?
• We would use the updateAttribute with a New property added as follows:

Function Chaining an example
• What if you wanted to take that expression one step further and…
• Extract the year (2015) and put it in an attribute named logYEAR.
• Extract the month (10) and put it in an attribute named logMonth.
• Extract the day (09) and put it in an attribute named logDay.
• How do we break this apart? We would use the function substring() here.
• The substring begins at the starting index and extends to the character at ending index -1.
• Remember that we can continue to chain as many functions as needed in a single expression language
statement.

LAB: NiFi Expression language
Working With Templates
Templates
• Processors, connections funnels etc.. can be thought of as the most basic building blocks for constructing a
DataFlow.
• At times, using small building blocks can become tedious if the same logic gets repeated several times.
• To solve this issue, NiFi provides the concept of a Template.
• A Template is a way of combining these basic building blocks into larger building blocks.
• Once a DataFlow has been created, parts of it can be formed into a Template.
• Template can be dragged onto the canvas, or can be exported as an XML file and shared with others.
• Templates received from others can be imported into an instance of NiFi and dragged onto the canvas.

Creating a Template
• Select the components to include in the template.
• Use Shift key to select multiple components. Create successf
• Select the Create Template Icon from the middle toolbar at the top of the screen.
• Provide a name and optionally comments about the template.
• Click the Create button.
Enter Name & Description
Select all components and click save

Instantiating a Template
• When Template is dragged onto the canvas, DFM is asked to choose Template to add to the canvas:
Drag and drop on canvas
Choose Template, click OK
289 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Template Added
Managing Templates
• One of the most powerful features of NiFi Templates is the ability to easily export a Template to an XML file and
to import a Template that has already been exported.
• This provides a very simple mechanism for sharing parts of a DataFlow with others.
• You have options to:
• Import a Template
• Export a Template
• Remove a Template

Import a Template
Select the template to import, click open
Template Imported
Click Import

Click Select
Export a Template Click Template Management button
In Template Management, click export button

Template saved as xml to local
Remove a Template
Click Template management button
Choose template, click remove button
Click yes and confirm

Demo: Working With Templates
HDF Dataflow Optimization
What is meant by Dataflow optimization?
• Dataflow Optimization isn’t an exact science with hard and fast rules that will always apply.
• But it is more of a balance between:
• System resources (memory, network, disk space, disk speed and CPU),
• Number of files
• Size of those files
• The types of processors used,
• The size of dataflow that has been designed and
• The underlying configuration of NiFi on the system.

Grouping Common Functionality
• Group common functionality when and where it makes sense
• A simple approach to dataflow optimization is to group repeated operations into a Process Group.
• Then pass the data through the group and then continue through the flow.
• When repeating the same process in multiple places on the graph, try to put the functionality into a single
group.

Use the fewest number of processors
• The simplest approach to optimization is to use fewest number of processors to accomplish the task.
• Use one processor instead of many of the same type.
• For example, if you have multiple GetSFTP processors pulling from the same server, use one and use the
attribute path on the flow file to determine how to separate the data.
• Here is a simple example using the GetSFTP processor:

How do I prevent my system from overwhelming?
• Another approach to optimization is to prevent the dataflows from overwhelming the underlying system and
affecting NiFi software stability and/or performance.
• NiFi corruption can occur in some cases.
• For example, if you have 50GB for the content_repository partition and data normally is pulled from the server
in groups of 10GB.
• What would happen if the data source was down for a few hours and then came back online with a backlog of
75GB?
• Once the content_repository filled, the GetSFTP processor would start generating errors trying to write the files
it was attempting to pull.

Preventing my system from Overwhelming
• The performance of NiFi would drop because the disk/disks where the content_repository resided would be
trying write new files at the same time NiFi would be trying to access the disk/s to deal with the current
flowfiles.
• Back pressure can be configured based on number of objects and/or the size of the files in the connection.
• Using back pressure would be one way to prevent this scenario, below is an example:

Preventing my system from Overwhelming(cont..)
• Back pressure is set in the connection from the GetSFTP processor to the ControlRate processor:
• If the backlog of data reaches 1GB, then the GetSFTP processor would stop pulling data until the backlog
dropped below the threshold and then it would resume pulling data from the source system.
• This would allow the NiFi to pull the backlog of data at a rate that wouldn’t over utilize the system resources.

ControlRate processor to Restrict the rate
• Adding the ControlRate processor to the flow will ensure
that the backlog of data will not overwhelm any
processors further down the flow path.
• This method of combining back pressure with using the
ControlRate processor would be easier than trying to set
backpressure in every connection through the complete
flow path.

ControlRate processor to Restrict the rate
• In this flow snippet, the UpdateAttribute processor is used to add the filesize attribute to be used in the
ControlRate processor.
• The configuration of the ControlRate processor is shown below.
• It is set to allow ~50000 bytes/minute, based on the cumulative size of files through the processor.
Criteria could be
-Data Rate
- Flow file count
- Attribute Value

Understanding what resources a processor uses
• Another method to use in dataflow optimization is to understand the resources needed by each processor.
• For instance, the CompressContent processor for example will use 1 CPU/concurrent task, so if this
processor has 4 concurrent tasks and there are four files in the queue, then 4 CPUs will be utilized by
this processor until the files have been compressed.
• For small files this becomes less of a resource bottleneck than dealing with large files.
• Good example might be separating small files from medium and large and feeding 3 CompressContent
processors, small files 2 threads, medium and large files 1 thread.

Understanding what resources a processor uses
• As discussed, lets take an example of below flow, we are sending files to different Compress Processor based
on size.

Processor Status Help to optimize
• There is a great deal of information provided by each processor, that can assist the DFM in determining
where the trouble spots could be in the flow; see the description below:
• Utilize the information provided by the processors, number of reads/write and tasks/time per task to find
“hot spots” on the graph.
• For instance, if there is a large number of tasks but the amount of data traversing the processor is low, then
the processor might be configured to run too often or with too many concurrent tasks.
• Few completed tasks along with high task time indicates that this processor is CPU intensive.
• If the dataflow volume is high and a processor show a high number of completed threads and high task time,
performance can be improved by increasing the run duration in the processor scheduling.

Data backlog in Connection
• If there is connection in the flow where data is always backlogged, it can be a point of concern if any delay in
processing the data is unacceptable.
• But, simply adding more concurrent tasks to the processor with the backlogged can lead to thread starvation
in another part of the graph.
• Here again, the DFM must take care in understanding why the data is backing up at this particular point in
the graph.
• It could be a processor that is very CPU intensive.
• The files might be very large that it might require an expensive read and a write for each file.
• If resources aren’t an issue, then add more concurrent tasks and see if the backlog is resolved.
• If resources are an issue, then either the flow will have to be redesigned to better utilize what is available or
else the work load using cluster
• Files could be backlogging because the processor that is working on them is I/O intensive.
• Adding more threads will not help process files faster but instead will lead to thread starvation.

When to Cluster
• Physical resource exhaustion can and does occur even with an optimized dataflow.
• When this occurs the best approach is to spread the data load out across multiple NiFi instances in a NiFi
cluster.
• NiFi Administrators or Dataflow Managers (DFMs) may find that using one instance of NiFi on a single server
is not enough to process the amount of data they have. So, one solution is to run the same dataflow on
multiple NiFi servers.
• However, this creates a management problem, because each time DFMs want to change or update the
dataflow, they must make those changes on each server and then monitor each server individually.
• By clustering the NiFi servers, it’s possible to have that increased processing capability along with a single
interface through which to make dataflow changes and monitor the dataflow.
• Clustering allows the DFM to make each change only once, and that change is then replicated to all the
nodes of the cluster.

Data Provenance
Data Provenance
• While monitoring a dataflow, users often need a way to determine what happened to a particular data object
(FlowFile).
• NiFi keeps a very granular level of detail about each piece of data that it ingests.
• As the data is processed through the system and is transformed, routed, split, aggregated, and distributed to
other endpoints, this information is all stored within NiFi’s Provenance Repository.
• Because NiFi records and indexes data provenance details as objects flow through the system,
• Users may perform searches, conduct troubleshooting and evaluate things like dataflow compliance and
optimization in real time.
• By default, NiFi updates this information every five minutes, but that is configurable.

Data Provenance

Data Provenance Events
• Each point in a dataflow where a FlowFile is processed in some way is considered a "processing event".
• Various types of processing events occur, depending on the dataflow design:
• RECEIVE event occurs, when data is brought into the flow.
• SEND event occurs ,when data is sent out of the flow.
• CLONE event occurs, when data is cloned.
• ROUTE event occurs, when data is routed.
• CONTENT_MODIFIED or ATTRIBUTES_MODIFIED when content or attribute of a file is changed.
• FORK or JOIN event occurs when data objects are split or combined with other data objects.
• DROP event occurs when data object is removed from the flow.
• FETCH occurs wheb an existing FlowFile's contents were modified as a result of obtaining
data from an external resource.
• It is also possible to open additional dialog windows to see event details, replay data at any point within the
dataflow, and see a graphical representation of the data’s lineage, or path through the flow.

Searching for Events
• Most common tasks performed in the Data Provenance page is a search for a given FlowFile to determine what
happened to it.
• To do this, click the Search button in the upper-right corner of the Data Provenance page.
• This opens a dialog window with parameters that the user can define for the search.
• For example, To determine if a particular FlowFile was

received, search:
• Event Type of "RECEIVE”
• FlowFile with "ABC" anywhere in its filename.
• Received at any time on Jul. 28, 2016.
• The search shown in the following image could be
performed:
• [The asterisk (*) may be used as a wildcard.]

Details of an Event
• In the far-left column of the Data Provenance page, there is a View Details icon for each event
• Clicking this button opens a dialog window with three tabs: Details, Attributes, and Content.
Details Tab
Attributes Tab
Content Tab
Download/View content

Replaying a FlowFile
• A DFM may need to inspect a FlowFile’s content at some point in the dataflow to ensure that it is being
processed as expected.
• And if it is not being processed properly, the DFM may need to make adjustments to the dataflow and replay
the FlowFile again.
• The Content tab of the View Details dialog window is where the DFM can do these things.
• The user may also click the REPLAY button to replay the FlowFile at this point in the flow.
• Upon clicking REPLAY, the FlowFile is sent to the connection feeding the component that produced this
processing event.
Replay the flow file

Viewing FlowFile Lineage
• It is useful to see a graphical representation of the lineage or path a FlowFile took within the dataflow.
• To view lineage, click on the "Show Lineage" icon in the far-right column of the Data Provenance table.
• This opens a graph displaying the FlowFile and the various processing events that have occurred.
• The selected event will be highlighted in yellow.

FlowFile Lineage : Find Parents
• Sometimes, user may need to track the original FlowFile that another FlowFile was spawned from.
• When a FORK or CLONE event occurs, NiFi keeps track of the parent FlowFile that produced other FlowFiles.
• It is possible to find that parent FlowFile in the Lineage.
• Right-click on the event in the lineage graph and select "Find parents" from the context menu.
Click on Find parents
The graph is re-drawn with parent

FlowFile and its lineage as well.

FlowFile Lineage : Expanding an Event
• In the same way that it is useful to find a parent FlowFile, the user may also want to determine what children
were spawned from a given FlowFile.
• To do this, right-click on the event in the lineage graph and select "Expand" from the context menu.
Click on Expand
Graph is re-drawn to show the

children and their lineage.

Demo: Data Provenance
Working With NiFi Cluster
NiFi Clustering
• NiFi employs a Zero-Master Clustering
paradigm.
• Each node in the cluster performs the same
tasks on the data, but each operates on a
different set of data.
• One of the nodes is automatically elected (via
Apache ZooKeeper) as the Cluster Coordinator.
• All nodes in the cluster will send
heartbeat/status information to this node, and
this node is responsible for disconnecting nodes
that do not report any heartbeat status for
some amount of time.

Why Cluster?
• NiFi Administrators or Dataflow Managers (DFMs) may find that using one instance of NiFi on a single server is
not enough to process the amount of data they have.
• So, one solution is to run the same dataflow on multiple NiFi servers.
• However, this creates a management problem, because each time DFMs want to change or update the
dataflow, they must make those changes on each server and then monitor each server individually.
• By clustering the NiFi servers, it’s possible to have that increased processing capability along with a single
interface through which to make dataflow changes and monitor the dataflow.
• Clustering allows the DFM to make each change only once, and that change is then replicated to all the nodes
of the cluster.
• Through the single interface, the DFM may also monitor the health and status of all the nodes.

NiFi Cluster Terminology
NiFi Cluster Coordinator
• A NiFi Cluster Cluster Coordinator is the node in a NiFI cluster that is responsible for carrying out tasks to
manage which nodes are allowed in the cluster and providing the most up-to-date flow to newly joining nodes.
• When a DataFlow Manager manages a dataflow in a cluster, they are able to do so through the User Interface
of any node in the cluster.
• Any change made is then replicated to all nodes in the cluster.

NiFi Cluster Node
• The nodes do the actual data processing.
• While nodes are connected to a cluster, the DFM may access the UI for any of the individual nodes.
• In a NiFi cluster, the same dataflow runs on all the nodes.
• As a result, every component in the flow runs on every node.
• If a node is disconnected from the cluster due to any reason, DFM cannot make changes on the graph.

Primary Node
• Every cluster has one Primary Node.
• On this node, it is possible to run "Isolated Processors”.
• Every cluster has one Primary Node. On this node, it is possible to run "Isolated Processors" (see below).
• ZooKeeper is used to automatically elect a Primary Node.
• If that node disconnects from the cluster for any reason, a new Primary Node will automatically be elected.
• Users can determine which node is currently elected as the Primary Node by looking at the Cluster
Management page of the User Interface.

Isolated Processors
• In a NiFi cluster, the same dataflow runs on all the nodes.
• As a result, every component in the flow runs on every node.
• However, there may be cases when the DFM would not want every processor to run on every node.
• The most common case is when using a processor that communicates with an external service using a protocol
that does not scale well.
• For example, the GetSFTP processor pulls from a remote directory, and if the GetSFTP Processor runs on every
node in the cluster tries simultaneously to pull from the same remote directory, there could be race conditions.
• Therefore, the DFM could configure the GetSFTP on the Primary Node to run in isolation, meaning that it only
runs on that node.
• It could pull in data and - with the proper dataflow configuration - load-balance it across the rest of the nodes
in the cluster.
• Note that while this feature exists, it is also very common to simply use a standalone NiFi instance to pull data
and feed it to the cluster.

Heartbeats
• The nodes communicate their health and status to the currently elected Cluster Coordinator via "heartbeats",
which let the Coordinator know they are still connected to the cluster and working properly.
• By default, the nodes emit heartbeats every 5 seconds, and if the Cluster Coordinator does not receive a
heartbeat from a node within 40 seconds, it disconnects the node due to "lack of heartbeat".
• The reason that the Cluster Coordinator disconnects the node is because the Coordinator needs to ensure that
every node in the cluster is in sync, and if a node is not heard from regularly, the Coordinator cannot be sure it
is still in sync with the rest of the cluster.
• If, after 40 seconds, the node does send a new heartbeat, the Coordinator will automatically request that the
node re-join the cluster, to include the re-validation of the node’s flow.
• Both the disconnection due to lack of heartbeat and the reconnection once a heartbeat is received are
reported to the DFM in the User Interface.

Communication within the Cluster
• As noted, the nodes communicate with the Cluster Coordinator via heartbeats.
• When a Cluster Coordinator is elected, it updates a well-known ZNode in Apache ZooKeeper with its
connection information so that nodes understand where to send heartbeats.
• If one of the nodes goes down, the other nodes in the cluster will not automatically pick up the load of the
missing node.
• It is possible for the DFM to configure the dataflow for failover contingencies; however, this is dependent on
the dataflow design and does not happen automatically.
• When the DFM makes changes to the dataflow, the node that receives the request to change the flow
communicates those changes to all nodes and waits for each node to respond, indicating that it has made the
change on its local flow.

Dealing with Disconnected Nodes
• A DFM may manually disconnect a node from the cluster.
• If Any cluster nodes are disconnected, UI notification will be triggered.
• No Changes can be done on dataflow until the issue of the disconnected node is resolved.
• The DFM or the Administrator may trouble shoot and fix issue to proceed.
• There are cases where a DFM may wish to continue making changes to the flow, even though a node is not
connected to the cluster.
• In this case, they DFM may elect to remove the node from the cluster entirely through the Cluster
Management dialog. Once removed, the node cannot be rejoined to the cluster until it has been restarted.

Basic Cluster Setup
• For each instance, certain properties in the nifi.properties file will need to be updated.
• Lets say it’s a 3 node cluster, For all three instances, the Cluster Common Properties can be left with the default
settings.
• Cluster common properties (all nodes must have same values) #
• nifi.cluster.protocol.heartbeat.interval=5 sec
• nifi.cluster.protocol.is.secure=false

Basic Cluster Setup(cont..)
• For All nodes node1,node2.. Node(n), the minimum properties to configure are as follows:
• # cluster node properties (only configure for cluster nodes) #
• nifi.cluster.is.node=true
• nifi.cluster.node.address=node1
• nifi.cluster.node.protocol.port=8055
• Now, it is possible to start up the cluster.
• Technically, it does not matter which instance starts up first.
• # zookeeper properties, used for cluster management #
• nifi.zookeeper.connect.string=node1:2181,node2:2181,node3:2181
• nifi.zookeeper.connect.timeout=3 secs
• nifi.zookeeper.session.timeout=3 secs
• nifi.zookeeper.root.node=/nifi

NiFi Cluster - User Interface

Cluster Management Window
Other tabs
disconnected
nodes
connected Node
Primary and Coordinator nodes

Nodes in Cluster

State Management
• NiFi provides a mechanism for Processors, Reporting Tasks, Controller Services, and the framework itself to
persist state.
• This allows a Processor, for example, to resume from the place where it left off after NiFi is restarted.
• Additionally, it allows for a Processor to store some piece of information so that the Processor can access that
information from all of the different nodes in the cluster.
• This allows one node to pick up where another node left off, or to coordinate across all of the nodes in a
cluster.
• Two different ways to store state:
• local-provider : Persists the data to the $NIFI_HOME/state/local directory.

• zk-provider : When clustered, state can be stored in Zookeeper.

Configuring State Providers
• When a component decides to store or retrieve state, it does so by providing a "Scope" - either Node-local or
Cluster-wide.
• The nifi.properties file contains three different properties that are relevant to configuring these State
Providers.
Property Description
nifi.state.management.configuratio XML file that is used for configuring the local and/or
n.file cluster-wide State Providers. [./conf/state-
management.xml]
nifi.state.management.provider.loca Property that provides the identifier of the local State
l Provider configured in this XML file. [local-provider]
nifi.state.management.provider.clus Property provides the identifier of the cluster-wide
ter State Provider configured in this XML file. [zk-
provider]

Configuring State Providers (cont..)
• Once these State Providers have been updated in nifi.properties with their ids, we have to configure state-
management.xml with details of state management.
• The local-provider element must always be present and populated.
• If its clustered with zk-provider, zookeeper connection string and details should be provided:

Embedded ZooKeeper Server
• As mentioned above, the default State Provider for cluster-wide state is the ZooKeeperStateProvider
• This means is that NiFi has dependencies on ZooKeeper in order to behave as a cluster.
• To avoid the burden of forcing administrators to also maintain a separate ZooKeeper instance, NiFi provides the
option of starting an embedded ZooKeeper server.
• Configuration to enable/disable embedded Zookeeper in nifi.properties is as below:
nifi.state.management.embedded.zookeeper.start Should NiFi run an embedded
ZooKeeper server [true/false]
nifi.state.management.embedded.zookeeper.prop Properties file that provides the
erties ZooKeeper properties
[./conf/zookeeper.properties]

Embedded ZooKeeper Configuration
• Embedded zookeeper configurations can be done is zookeeper.properties.
• Generally, it is advisable to run ZooKeeper on either 3 or 5 nodes.
• Each node which are supposed to run embedded ZK server needs this configuration.
• Ambari Managed cluster does not support Embedded zookeeper

• Each of these servers is configured as <hostname>:<quorum port>[:<leader election port>]

ZooKeeper Access Control
• ZooKeeper provides Access Control to its data via an Access Control List (ACL) mechanism.
• Which ACL is used depends on the value of the Access Control property for the ZooKeeperStateProvider
CreatorOnly ACL that indicates that only the user that created the data is allowed to
access the data.
Open When data is written to ZooKeeper, NiFi will provide an ACL that indicates
that any user is allowed to have full permissions to the data.
• For ACL to work We need to tell ZooKeeper who the Creator is. We have two options for this:
 The first mechanism is to provide authentication using Kerberos.

 The second option is to use a user name and password

Lab: Installing and Working
with NiFi Cluster
Monitoring NiFi
Monitoring NiFi
• As data flows through your dataflow in NiFi, it is important to understand how well your system is performing
in order to assess if you will require more resources and in order to assess the health of your current resources.
• NiFi provides a few mechanisms for monitoring your system.
• Status Bar
• Component Statistics
• Bulletins
• Also NiFi Supports Several Reporting Tasks which we can utilize to monitor effectively:
• AmbariReportingTask
• ControllerStatusReportingTask • ScriptedReportingTask
• DataDogReportingTask • SiteToSiteBulletinReportingTask
• MonitorDiskUsage • SiteToSiteProvenanceReportingTask
• MonitorMemory • SiteToSiteStatusReportingTask
• StandardGangliaReporter

Monitoring via Status Bar
• Near the top of the NiFi screen is a blue bar that is referred to as the Status Bar.
• It contains a few important statistics about the current health of NiFi.
• The number of Active Threads can indicate how hard NiFi is currently working.
• The Queued stat indicates how many FlowFiles are currently queued across the entire flow, as well as the
total size of those FlowFiles.
• Number of components in each states as well as last Status refreshed time.
• If the NiFi instance is in a cluster, we will also see an indicator here telling us how many nodes are in the cluster
and how many are currently connected.
• In this case, the number of active threads and the queue size are indicative of all the sum of all nodes that are
currently connected.

Monitoring via Component Statistics
• Each Processor, Process Group, and Remote Process Group on the canvas provides several statistics about how
much data has been processed by the component.
• These statistics provide information about how much data has been processed in the past five minutes.
• This is a rolling window and allows us to see things like the number of FlowFiles that have been consumed by a
Processor, as well as the number of FlowFiles that have been emitted by the Processor.
• The connections between Processors also expose the number of items that are currently queued.

Monitoring via Component Statistics (cont..)
• It may also be valuable to see historical values for these metrics and, if clustered, how the different nodes
compare to one another.
• In order to see this information, we can right-click on a component and choose the Stats menu item. This will
show us a graph that spans the time since NiFi was started, or up to 24 hours, whichever is less.
• The amount of time that is shown here can be extended or reduced by changing the configuration in the
properties file.
• In the top-right corner is a drop-down that allows the user to select which metric they are viewing.
• The graph on the bottom allows the user to select a smaller portion of the graph to zoom in.

Monitoring via Bulletins
• Along with statistics provided by each component, a user may want to know if any problems occur.
• While we could monitor the logs for anything interesting, it is much more convenient to have notifications pop
up on the screen.
• If a Processor logs anything as a WARNING or ERROR, we will see a "Bulletin Indicator" show up in the top-left-
hand corner of the Processor.
• This indicator looks like a sticky note and will be shown for five minutes after the event occurs.
• Hovering over the bulletin provides information about what happened so that the user does not have to sift
through log messages to find it.
• If in a cluster, the bulletin will also indicate which node in the cluster emitted the bulletin.
• We can also change the log level at which bulletins will occur in the Settings tab of the Configure dialog for a
Processor.

AmbariReportingTask
AmbariReportingTask
• Apache NiFi 0.3.0 adds a Reporting Task to send metrics to Ambari, the ticket for this work is NIFI-790.
• Ambari Reporting Task directly publishes metrics from NiFi to Ambari.
• This ReportingTask sends the following metrics to Ambari:
• FlowFilesReceivedLast5Minutes • jvm.heap_used
• BytesReceivedLast5Minutes • jvm.heap_usage
• FlowFilesSentLast5Minutes • jvm.non_heap_usage
• BytesSentLast5Minutes • jvm.thread_states.runnable
• FlowFilesQueued • jvm.thread_states.blocked
• BytesQueued • jvm.thread_states.timed_waiting
• BytesReadLast5Minutes • jvm.thread_states.terminated
• BytesWrittenLast5Minutes • jvm.thread_count
• ActiveThreads • jvm.daemon_thread_count
• TotalTaskDurationSeconds • jvm.gc.runs
• jvm.uptime • jvm.gc.time
• In order to make use of these metrics in Ambari, a NIFI service must be created and installed in Ambari.

AmbariReportingTask
• Publishes metrics from NiFi to Ambari Metrics Service (AMS).
• Due to how the Ambari Metrics Service works, this reporting task should be scheduled to run every 60 seconds.
• Each iteration it will send the metrics from the previous iteration, and calculate the current metrics to be sent
on next iteration.
• Scheduling this reporting task at a frequency other than 60 seconds may produce unexpected results

AmbariReportingTask
• Assuming you have the AmbariReportingTask running on a NiFi instance somewhere pointing to the given
Ambari instance, you should see metrics on the NIFI Service page with in a few minutes.

ControllerStatusReportingTask
• ControllerStatusReportingTask Logs the 5-minute stats that are shown in the NiFi Summary Page for Processors and
Connections, as well optionally logging the deltas between the previous iteration and the current iteration.
• Processors' stats and Connections' stats are logged to ./nifi-app.log by default.
• These can be configured in the NiFi logging configuration to log to different files, if desired.
• For Processors, the following information is included (sorted by descending Processing Timing):
• Processor Name • FlowFiles Out (5 mins)

• Processor ID • Bytes Read from Disk (5 mins)
• Processor Type • Bytes Written to Disk (5 mins)
• Run Status • Number of Tasks Completed (5 mins)
• Flow Files In (5 mins) • Processing Time (5 mins)

• For Connections, the following information is included (sorted by descending size of queued FlowFiles):
• Connection Name
• Connection ID
• Source Component Name
• Destination Component Name
• Flow Files In (5 mins)
• FlowFiles Out (5 mins)
• FlowFiles Queued

Setting up ControllerStatusReportingTask
• For setting up and starting ControllerStatusReportingTask, Click on the Controller settings tab in NiFi UI
Click on add button

to add service
View Details/Help Start/Edit/Delete
You can start/stop the task when ever you

want

MonitorDiskUsage
MonitorDiskUsage
• Checks the amount of storage space available for the Content Repository and FlowFile Repository and warns via a
log message and a System-Level Bulletin.
• Applicable to each partition on which repository exceeds some configurable threshold of storage space.
• Parameters Can be set while configuring the reporting task:

Setting up MonitorDiskUsage
• For setting up and starting MonitorDiskUsage, Click on the Controller settings tab in NiFi UI
Click Start and Now monitor is running
Click + to add service
Can view notification if Click edit button to

exceeded threshold configure
Delete/Start

MonitorMemory
MonitorMemory
• Watches the amount of Java Heap available in the JVM for a particular JVM Memory Pool.
• MonitorMemory Reporting Task checks how much Java Heap Space is available after Full Garbage Collections.
• When the heap exceeds a specified threshold immediately following a Full Garbage Collection, Reporting Task will
create a WARNING level log message and create a System-Level bulletin to notify the user.

Setting up MonitorMemory
• For setting up and starting MonitorMemory, Click on the Controller settings tab in NiFi UI
Click Add Button
Click Edit Button to

configure
Start/Delete
Select memory
pool to monitor

StandardGangliaReporter
• Reports metrics to Ganglia so that Ganglia can be used for external monitoring of the application.
• Metrics reported include 5-minute NiFi statistics and JVM Metrics (optional).

• Reporting Task that reports metrics to a Ganglia server.
• The following metrics are reported:
FlowFiles In: The number of FlowFiles received via Site-to-Site in the last 5 minutes
Bytes In : The number of bytes received via Site-to-Site in the last 5 minutes
FlowFiles Out: The number of FlowFiles pulled from Output Ports via Site-to-Site in the last 5 minutes
Bytes Out : The number of bytes pulled from Output Ports via Site-to-Site in the last 5 minutes
Bytes Read : The number of bytes read from disk by NiFi in the last 5 minutes
Bytes Written : The number of bytes written to disk by NiFi in the last 5 minutes
FlowFiles Queued: The total number of FlowFiles currently queued on the system at the point in time at
which the Reporting Task is run
Bytes Queued: The total number of bytes allocated by the FlowFiles that are currently queued on the
system at the point in time at which the Reporting Task is run
Active Threads: The number of threads actively running at the point in time at which the Reporting Task is
run
• Also Default JVM metrics are made available to ganglia.

Setting up StandardGangliaReporter
• For setting up and starting MonitorMemory, Click on the Controller settings tab in NiFi UI
Click Add Button
Start/Delete
Click Edit
Button to
configure

Setting up StandardGangliaReporter
• Once Set up and Started, you can see metrics updated on Ganglia UI
Navigate to ganglia UI to see

metrics.

DataDogReportingTask
• Publishes metrics from NiFi to datadog.
• For accurate and informative reporting, components should have unique names.

• This ReportingTask sends the following metrics to DataDog:
• FlowFilesReceivedLast5Minutes • jvm.heap_usage
• BytesReceivedLast5Minutes • jvm.non_heap_usage
• FlowFilesSentLast5Minutes • jvm.thread_states.runnable
• BytesSentLast5Minutes • jvm.thread_states.blocked
• FlowFilesQueued • jvm.thread_states.timed_waiting
• BytesQueued • jvm.thread_states.terminated
• BytesReadLast5Minutes • jvm.thread_count
• BytesWrittenLast5Minutes • jvm.daemon_thread_count
• ActiveThreads • jvm.file_descriptor_usage
• TotalTaskDurationSeconds • jvm.gc.runs
• jvm.uptime • jvm.gc.time
• jvm.heap_used

SiteToSiteProvenanceReportingTask
• Publishes Provenance events using the Site To Site protocol.

• Publishes Provenance events using the Site To Site protocol.
View last
sent
event_id
Click Add Button
Click Edit Start/Delete

Button to
configure

SiteToSiteBulletinReportingTask
• Publishes Bulletin events using the Site To Site protocol. Note: only up to 5 bulletins are stored per component and
up to 10 bulletins at controller level for a duration of up to 5 minutes. If this reporting task is not scheduled
View last
frequently enough some bulletins may not be sent. sent
event_id
Click Add Button

Button to
configure

• LOCAL STATE STORE: Stores the Reporting Task's last bulletin ID so that on restart the task knows where it left off.
• Provides operator the ability to send sensitive details contained in bulletin events to any external system.

SiteToSiteStatusReportingTask
• Publishes Status events using the Site To Site protocol.
• The component type and name filter regexes form a union: only components matching both regexes will be
reported.
• However, all process groups are recursively searched for matching components, regardless of whether the process
group matches the component filters.
Click Add Button

Button to
configure

• This component does not store state.

ScriptedReportingTask
ScriptedReportingTask
• Provides reporting and status information to a script. ReportingContext, ComponentLog, and
VirtualMachineMetrics objects are made available as variables (context, log, and vmMetrics, respectively) to the
script for further processing.
• The context makes various information available such as events, provenance, bulletins, controller services, process
groups, Java Virtual Machine metrics, etc.
Click Add Button

Button to
configure

• This component does not store state.

NiFi Notification Services
NiFi Notification Services
• Now we know how to monitor Processes and metrics in NiFi while its running, But what if something goes wrong
with NiFi, we have NiFi Notification services.
• When the NiFi bootstrap starts or stops NiFi, or detects that it has died unexpectedly, it is able to notify configured
recipients.
• At this point the only mechanism supplied is to send an e-mail notification.
• The notification services configuration file, however, is a configurable XML file so that as new notification
capabilities are developed, they will be configured similarly.
• The default location of the XML file is conf/bootstrap-notification-services.xml, but this value can be changed in
the conf/bootstrap.conf file.

Notification Services configuration
• Once the desired services have been configured, they can then be referenced in the bootstrap.conf file.
• Currently only implementation is org.apache.nifi.bootstrap.notification.email.EmailNotificationService.
• Following is an example of bootstrap-notification-services.xml configuration:

Notification Services configuration
• When NiFi is managed by Ambari, you can configure Notification Services from Ambari:
• Configure in Ambari UI Services  NiFi  Advanced nifi-bootstrap-notification-services-env section like
below:

Notification Services configuration (Cont..)
• Once bootstrap-notification-services.xml have been configured, we have to make sure services are referenced in
the bootstrap.conf file.
• You can configure which service will be used and when it should be used.
• Following is an example of bootstrap.conf configuration:

Notification Services configuration (Cont..)
• If Ambari Manages NiFi Configure bootstrap-notification-services.xml it in below section :
• Ambari UI Services  NiFi  Advanced nifi-bootstrap-env section like below:
• Once Updated, click save and restart Service as required.

Demo: NiFi Notification Services
Lab: Monitoring NiFi
HDF with HDP – A Complete
Big Data solution
HDF Complements Hortonworks Data Platform
HDF dynamically connects and conducts data into HDP
HDF secures and encrypts data before it arrives in HDP
HDF offers traceability on the data’s flow from the source, with lineage and audit trails
before it reached HDP
HDF models flows graphically to dynamically adjust data coming to HDP.
HDF includes mature IoAT data protocols that improve device extensibility.
HDF manages IoAT flows bi-directionally with easy optimization and adjustment.

HDF with HDP – A Complete Big Data solution
Perishable
Hortonworks DataFlow (HDF)
powered by Apache NiFi
Insights
Store Data Enrich

and Metadata Context
Internet
of Anything Hortonworks Data Platform (HDP)
powered by Apache Hadoop Historical
Insights
Hortonworks Data Platform
powered by Apache Hadoop
Hortonworks DataFlow and the Hortonworks Data Platform
deliver the industry’s most complete Big Data solution
HDF Makes Big Data Ingest Easy
Complicated, messy, and takes weeks to Streamlined, Efficient, Easy
months to move the right data into Hadoop
HDP HDP
HORTONWORKS HORTONWORKS
DATA PLATFORM DATA PLATFORM
Powered by Apache Hadoop Powered by Apache Hadoop

BigData Ingestion with HDF a Closer Look
HDF Core Hadoop

Raw Network Stream
Service
Kafka
Management /
Network Metadata Stream
Phoenix Spark Workflow
Data Stores
NiFi Storm
Syslog HBase Hive SOLR
Raw Application Logs

SIEM
Streaming YARN
Options
Other Streaming Telemetry
HDFS

HDF Put/Get Data Directly to/from HDFS
GetHDFS
PutHDFS
ListHDFS
HDF Instance HDFS HDF Instance

CreateHadoopSequenceFile FetchHDFS
GetHDFSSequenceFile
• Provide locations of Hadoop Configuration Files to the processors: hdfs-site.xml, core-site.xml.

• Hdfs File/Directory path to read/write
• *Optional: Kerbros Keytab + Principal is connecting to kerberised environment

HDF Can Store/Read Data to Solr
HDF Instance PutSolrContentStream GetSolr

HDF Instance
• Provide type and url of Solr Instance [zookeeper url if cloud type]
• Collection Name if cloud type
• Path to post ContentStream and type of content [eg : Json]
• Solr query and Filter options for GetSolr

HDF Can Store/Read Data to Hive
PutHiveStreaming
HDF Instance SelectHiveQL

HDF Instance
PutHiveQL
• Create a HiveConnection Pool to connect to Hive Server

• SelectHiveQL queries Hive Tables and provide output in Avro or CSV format.
• PutHiveQL executes a DDL/DML command on a hive DB
• PutHiveStreaming Uses Hive Streaming to sent FlowFile data to Hive Tables.

HDF Can Push/Read messages to Kafka
PublishKafka ConsumeKafka
HDF Instance Kafka HDF Instance

PutKafka GetKafka
• Provide zookeeper Connection String associated with Kafka cluster to GetKafka

• Provide Topic name
• Provide Kafka Broker list, Topic and Partition to PutKafka.

HDF Expose Streaming data to Spark
HDF Instance Output port

With Site-2-site
nifi-spark-receiver
• Configure HDF to run Site-2-Site

• Create an Output Port
• Configure Spark Application to run with NiFi-Spark-Receiver and Site-2-Site Client
• Spark Will start pulling Steam Data from the NiFi port.

HDF Expose data to Storm Directly
HDF Instance Output port

With Site-2-site
Storm
NiFi-Storm-Spout
• Configure HDF to run Site-2-Site

• Create an Output Port
• Configure Storm Application to run with NiFi-Storm-Spout and Site-2-Site Client
• Storm Will start pulling data from the NiFi port.

HDF Stores/Reads date to Hbase
PutHbaseShell
HDF Instance GetHbase

Hbase HDF Instance
PutHBaseJSON
• Configure Hbase Client Service and Distributed Cache Services

• Provide Table names and Column Information etc to connect to pull and push data.
• You can use Phoenix jdbc connect to query Hbase via regular SQl connectors.

Lab: HDF Integration with HDP
HDF Best Practices
HDF Best Practices
• Its Important to understand your Data flow's behavior when it comes to Resource
consumption.
• NiFi is pre-configured to run with very minimal configuration to get started.
• This may get you up and running, but that basic configuration is far from ideal for high
volume/high performance dataflows.
• Some NiFi processors can be CPU, I/O or memory intensive.
• In fact some can be intensive in all three areas.
• Here we will focus on the areas where you can improve performance by changing some out of
the box default values.

nifi.properties file
• We will start by looking at the nifi.properties file located in the conf directory of the NiFi installation.
• The file is broken up into following sections
• Core Properties
• H2 Settings
• FlowFile Repository
• Content Repository
• Provenance Repository
• Component Status Repository
• Site to Site properties
• Web Properties
• Security properties and
• Cluster properties
• The various properties that make up each of these sections come pre-configured with default values.

nifi.properties - Core Properties
1) nifi.bored.yield.duration [Default value is 10 millis]
• This property is designed to help with CPU utilization by preventing processors, that are using the timer driven
scheduling strategy, from using excessive CPU when there is no work to do.
• The default 10-millisecond value already makes a huge impact on cutting down on CPU utilization.
• Smaller values equate to lower latency, but higher CPU utilization.
• So depending on how important latency is to your overall dataflow, increasing the value here will cut down on
overall CPU utilization even further.

nifi.properties - Core Properties (Cont..)
2) nifi.ui.autorefresh.interval [Default Value is 30 sec]
• It does not have an impact on NiFi performance but can have an impact on browser performance.
• This property sets the value at which the latest statistics, bulletins and flow revisions will be refreshed pushed
to connected browser sessions.
• In order to reload the complete dataflow the user must trigger a refresh.
• Decreasing the time between refreshes will allow bulletins to present themselves to the user in a timelier
manner; however, doing so will increase the network bandwidth used.
• The number of concurrent users accessing the UI compounds this.
• We suggest keeping the default value and only changing it if closer to real–time bulletin or statistics reporting
in the UI is needed.
• The user can always manually trigger a refresh at any time by right clicking on any open space on the graph and
selecting “refresh status”.

nifi.properties - H2 Settings
• There are two H2 databases used by NiFi.
• A user DB - keeps track of user logins when the NiFi is secured.
• A history DB - keeps track of all changes made on the graph.
• They stay relatively small and require very little hard drive space.
• The default installation path of <root-level-nifi-dir>/database_repository would result in the directory being
created at the root level of your NiFi installation (same level as conf, bin, lib, etc directories).
• While there is little to no performance gain by moving this to a new location, we do recommend moving all
repositories to a location outside of the NiFi install directories to simplify upgrading.
• This allow you retain the user and component history information after upgrading.

nifi.properties - FlowFile Repository
1) nifi.flowfile.repository.directory [Default Value is ./flowfile_repository ]
• FlowFile repo maintains state on all FlowFiles located anywhere in the data flows on the NiFi UI.
• The most common cause of corruption of FlowFiles are the result of running out of disk space.
• The default configuration again has the repository located in <root-level-nifi-dir>/flowfile_repository.
• Recommendation is to move this repository out of the base install path.
• You will also want to have the FlowFile repository located on a disk (high performance RAID preferably) that is
not shared with other high I/O software.
• On high performance systems, the FlowFile repository should never be located on the same hard disk/RAID as
either the content repository or provenance repository if at all possible.

nifi.properties - FlowFile Repository (Cont..)
2) nifi.queue.swap.threshold [Default Value is 20000 ]
• NiFi does not move the physical file (content) from processor to processor, FlowFiles serve as the unit of
transfer from one processor to the next.
• In order to make that as fast as possible, FlowFiles live inside the JVM memory.
• This is great until you have so many FlowFiles in your system that you begin to run out of JVM memory and
performance takes a serious dive.
• To reduce the likelihood of this happening, NiFi has a threshold that defines how many FlowFiles can live in
memory on a single connection queue before being swapped out to disk.
• If the number of total FlowFiles in any one-connection queue exceeds this value, swapping will occur.
• Depending on how much swapping is occurring, performance can be affected.
• If queues having an excess of 20,000 FlowFiles is the norm rather then the occasional data surge for your data
flow, it may be wise to increase this value.

nifi.properties - Content Repository
1) nifi.content.repository.directory.default [Default Value is ./content_repository ]
• Since the content for every FlowFile that is consumed by NiFi is placed inside the content repository, the hard
disk that this repository is loaded on will experience high I/O on systems that deal with high data volumes.
• As you can see, once again the repository is created by default inside the NiFi installation path.
• The content repository should be moved to its own hard disk/RAID.
• Sometimes even having a single dedicated high performance RAID is not enough.
• NiFi allows you to configure multiple content repositories within a single instance of NiFi.
• NiFi will then round robin files to however many content repositories are defined:
nifi.content.repository.directory.contS1R1=/cont-repo1/content_repository
• In a NiFi cluster, every Node can be configured to use the same names for their various repositories, but it is
recommend to use different names.

nifi.properties - Provenance Repository
1) nifi.provenance.repository.directory.default [Default Value is ./provenance_repository]
• Similar to the content repository, the provenance repository can use a large amount of disk I/O for writing and
reading provenance events.
• Every transaction within the entire dataflow that affects either the FlowFile or content has a provenance event
created.
• The default configuration has the provenance repository being created inside the NiFi installation path.
• It is recommended that the provenance repository is also located on its own hard disk /RAID and does not
share its disk with any of the other repositories (database, FlowFile, or content).
• Multiple provenance repositories can be defined by providing unique names in place of ‘default’ and paths:
nifi.provenance.repository.directory.provS1R1=/prov-repo1/provenance_repository
nifi.provenance.repository.directory.provS1R2=/prov-repo2/provenance_repository

nifi.properties - Provenance Repository (Cont..)
2) nifi.provenance.repository.query.threads [The default value is 2 ]
• The number of threads that are available to conduct provenance queries is defined by this property.
• On systems where numerous users may be making simultaneous queries against the provenance repository, it
may be necessary to increase the number of threads allocated to this process.
3) nifi.provenance.repository.index.threads [The default value is 1 ]
• The number of threads to use for indexing Provenance events so that they are searchable can be adjusted by
editing this.
• For flows that operate on a very high number of FlowFiles, the indexing of Provenance events could become a
bottleneck.
• If this is the case, a bulletin will appear indicating, "The rate of the dataflow is exceeding the provenance
recording rate. Slowing down flow to accommodate."
• If this happens, increasing the value of this property may increase the rate at which the Provenance Repository
is able to process these records, resulting in better overall throughput.

nifi.properties - Provenance Repository (Cont..)
4) nifi.provenance.repository.index.shard.size [The default value is 500MB ]
• When provenance queries are performed, the configured shard size has an impact on how much of the heap is
used for that process.
• Large values for the shard size will result in more Java heap usage when searching the Provenance Repository
but should provide better performance.
• Default is 500MB, this does not mean that 500 MB of heap is used.
• Keep in mind that if you increase the size of the shard, you may also need to increase the size of your overall
heap, which is configured in the bootstrap.conf file.

Bootstrap.conf file
• The bootstrap.conf file in the conf directory allows users to configure settings for how NiFi should be started.
• This includes parameters, such as the size of the Java Heap, what Java command to run, and Java System
Properties.
• This files comes pre-configured with default values.
• We will focus on just the properties that should be changed when installing NiFi for the purpose of high volume
/ high performance dataflows.
• The bootstrap.conf file is broken in to sections just like the nifi.properties file.
• Since NiFi is a Java application it requires that Java be installed. NiFi requires Java 7 or later.
• We will only highlight the sections that need changing.

Bootstrap.conf - JVM memory settings
• This section is used to control the amount heap memory to use by the JVM running NiFi.
• Xms defines the initial memory allocation, while Xmx defines the maximum memory allocation for the JVM.
• As you can see the default values are very small and not suitable for dataflows of any substantial size.
• We recommend increasing both the initial and maximum heap memory allocations to at least 4 GB or 8 GB for
starters.
java.arg.2=-Xms8g
java.arg.3=-Xmx8g
• If you should encounter any “out of memory” errors in your NiFi app log, this is an indication that you have
either a memory leak or simply insufficient memory allocation to support your dataflow.

Bootstrap.conf - JVM memory settings (Cont..)
• When using large heap allocations, garbage collection can be a performance killer (most noticeable when
Major Garbage Collection occurs against the Old Generation objects).
• When using larger heap sizes it is recommended that a more efficient garbage collector is used to help reduce
the impact to performance should major garbage collection occur.
• You can configure your NiFi to use the G1 garbage collector by uncommenting the above line.
java.arg.13=-XX:+UseG1GC

Bootstrap.conf – Java 8 and above only
• Increase the Code Cache size by uncommenting this line:
• java.arg.7=-XX:ReservedCodeCacheSize=256m
• The code cache is memory separate from the heap that contains all the JVM bytecode for a method compiled
down to native code. If the code cache fills, the compiler will be switched off and will not be switched back on
again. This will impact the long running performance of NiFi.
• The only way to recover performance is a restart the JVM (restart NiFi). So by removing the comment on this
line, the code cache size is increased to 256m,which should be sufficient to prevent the cache from filling up.
• Below parameter establishes a boundary for how much of the code cache can be used before flushing of the
code cache will occur to prevent it from filling and resulting in the stoppage of the compiler.
• java.arg.8=-XX:CodeCacheMinimumFreeSpace=10m
• java.arg.9=-XX:+UseCodeCacheFlushing

Security: HDF Authentication
HDF Authentication
• NiFi provides several different configuration options for security purposes.
• The most important properties are those under the "security properties" heading in the
nifi.properties
• Options are to:
 Secure the Instance using a 2-Way-SSL Authentication
 Optionally you can Integrate it will LDAP to authenticate users over https.
 Optionally you can Integrates with a Kerberos Key Distribution Center (KDC) to
authenticate users over https.
 Once Authenticated, Administrator can Assign roles to Users to determine who can do
what?

SSL User Authentication
Securing NiFi with 2-Way-SSL
• NiFi Provides 2-Way-SSL or Mutual SSL Authentication Option to secure NiFi.
• Mutual SSL authentication or certificate based mutual authentication refers to two parties authenticating each
other through verifying the provided digital certificate so that both parties are assured of the others' identity.
• In our terms, it refers to client, a web browser authenticating themselves to a NiFi node and that NiFi node
also authenticating itself to the client through verifying the public key certificate/digital certificate issued by
the trusted Certificate Authorities (CAs).
• Because authentication relies on digital certificates, certification authorities such as Verisign or Microsoft
Certificate Server, etc.. are an important part of the mutual authentication process.

2-Way-SSL – How it works
• From a high-level point of view, the process of authenticating and establishing an encrypted channel using
certificate-based mutual authentication involves the following steps:
 A Web Browser client requests access to a protected resource.

 The NiFi server presents its certificate to the client.
 The client verifies the NiFi server’s certificate.
 If successful, the client sends its certificate to the NiFi server.
 The NiFi server verifies the client’s credentials.
 If successful, the NiFi server grants access to the protected resource requested by the client.

2-Way-SSL – How it works

Important NiFi SSL Configuration properties
• NiFi provides several different configuration options for security purposes.
• The most important properties are those under the "security properties" heading in the nifi.properties file. In
order to run securely, the following properties must be set:
Property Name Description

nifi.security.keystore Keystore that contains the server’s private key.
nifi.security.keystoreType The type of Keystore. Must be either PKCS12 or JKS.
nifi.security.keystorePasswd The password for the Keystore.
nifi.security.keyPasswd The password for the certificate in the Keystore.
nifi.security.truststore Truststore that will be used to authorize those connecting to NiFi
nifi.security.truststoreType The type of the Truststore. Must be either PKCS12 or JKS.
nifi.security.truststorePasswd The password for the Truststore.
nifi.security.needClientAuth Specifies whether or not connecting clients must authenticate
themselves.
Important NiFi SSL Configuration properties
• Once the above properties have been configured, we can enable the User Interface to be accessed over HTTPS
instead of HTTP.
• This is accomplished by setting the nifi.web.https.host and nifi.web.https.port properties.
• The nifi.web.https.host property indicates which hostname the server should run on.
• It is important when enabling HTTPS that the nifi.web.http.port property be unset.
• Now that the User Interface has been secured, we can easily secure Site-to-Site connections and inner-cluster
communications, as well.
• This is accomplished by setting the nifi.remote.input.secure and nifi.cluster.protocol.is.secure properties,
respectively, to true.

NiFi SSL Configuration Options
• To configure Certificate based Security in NiFi, administrator have couple of options, they are listed below:
1. Configure SSL for NiFi with your own certificate

2. Use NiFi provided TLS Generation Toolkit manually
3. Use Ambari to Automatically generate Certificates with TLS Generation Toolkit

1. Configure SSL for NiFi with your own certificate
To configure NiFi to use external certificates follow below steps
• Generate your own Certificates from vendors like Verisign or Microsoft Certificate Server.
• Determine an Initial Admin Identity when you configure NiFi for first time.( as well as for each
cluster nodes to communicate with each other).
• Configure the properties for SSL authentication
• Configure properties Manually in the NiFi property
files OR
• Leverage Ambari to configure NiFi once you have
certificates created.
• Restart NiFi Manually/via Ambari to make changes take
effect.
• Login as Initial Admin after loading certificates to browser.
2. Use NiFi provided TLS Generation Toolkit manually
To use NiFi provided TLS toolkit follow below steps:
• Generate Certificates with TLS toolkit, example:

$ bin/tls-toolkit.sh standalone –n ‘node1.hortonworks.com’ -C 'CN=node1,OU=NIFI’
• Determine an Initial Admin Identity when you configure NiFi for first time.( as well as for each
cluster nodes to communicate with each other).
• Configure properties Manually in the NiFi property files
• Restart NiFi Manually/via Ambari to make changes take effect.
• Login as Initial Admin after loading certificates to browser

3. Use Ambari to Automatically generate Certificates
To configure NiFi SSL with Ambari, follow below Steps:
• In Ambari Select Services  NiFi Config  Advanced nifi-ambari-ssl-config
Configure Initial Admin
Enable SSL Truststore path, type,

password
Keystore path,type,
password
Force regenerate CA
Cluster node Identities
Initial Admin Identity
• When you setup a secured NiFi instance for the first time, you must manually designate an “Initial Admin
Identity”.
• This initial admin user is granted access to the UI and given the ability to create additional users, groups, and
policies.
• You can Manually add it in “authorizers.xml” file or use Ambari to update the same.
• Restart NiFi once updated.
Configure Initial Admin Manually
Configure Initial Admin via Ambari

Node Identity
• The identity of a NiFi cluster node.
• When clustered, a property for each node should be defined, so that every node knows about every other
node.
• The authorization policies required for the nodes to communicate are created during startup.
• If not clustered, these properties can be ignored.
• If Using Ranger you can ignore these. Restart NiFi once updated.
Configure node identity Manually
Configure node identity via Ambari

LDAP User Authentication
• NiFi supports user authentication via client certificates or via username/password.
• Username/password authentication is performed by a Login Identity Provider.
• The Login Identity Provider is a pluggable mechanism for authenticating users via their username/password.
• Which Login Identity Provider to use is configured in two properties in the nifi.properties file.
• nifi.login.identity.provider.configuration.file : property specifies the configuration file for Login Identity
Providers.
• nifi.security.user.login.identity.provider : property indicates which of the configured Login Identity
Provider should be used.
• Set nifi.security.user.login.identity.provider=ldap-provider to use LDAP authentication
• NiFi does not perform user authentication over HTTP. Using HTTP all users will be granted all roles.
• You can Configure It Manually or via Ambari

Configure login identity details,

details to connect to ldap server
Set login identity provider as ldap
Restart NiFi, login as admin and

create User in NiFi and assign rules.
LDAP User Authentication Configuration
Authentication Strategy How the connection to the LDAP server is authenticated.
Manager DN The DN of the manager that is used to bind to the LDAP server to search for users.
Manager Password The password of the manager that is used to bind to the LDAP server to search for users.
TLS - Keystore Path to the Keystore that is used when connecting to LDAP using START_TLS.
TLS - Keystore Password Password for the Keystore that is used when connecting to LDAP using START_TLS.
TLS - Keystore Type Type of the Keystore that is used when connecting to LDAP using START_TLS (i.e. JKS or PKCS12).
TLS - Truststore Path to the Truststore that is used when connecting to LDAP using START_TLS.
TLS - Truststore Password Password for the Truststore that is used when connecting to LDAP using START_TLS.
TLS - Truststore Type Type of the Truststore that is used when connecting to LDAP using START_TLS (i.e. JKS or PKCS12).
TLS - Client Auth Client authentication policy when connecting to LDAP using START_TLS.
TLS - Protocol Protocol to use when connecting to LDAP using START_TLS. (i.e. TLS, TLSv1.1, TLSv1.2, etc).
TLS - Shutdown Gracefully Specifies whether the TLS should be shut down gracefully before the target context is closed. Defaults to false.
Referral Strategy Strategy for handling referrals. Possible values are FOLLOW, IGNORE, THROW.
Url Url of the LDAP servier (i.e. ldap://<hostname>:<port>).
User Search Base Base DN for searching for users (i.e. CN=Users,DC=example,DC=com).

Kerberos User Authentication
Kerberos User Authentication
• NiFi can be configured to use Kerberos SPNEGO (or "Kerberos Service") for authentication
• NiFi will only respond to Kerberos SPNEGO negotiation over an HTTPS connection.
• Which Login Identity Provider to use is configured in two properties in the nifi.properties file.
• Set nifi.security.user.login.identity.provider=kerberos-provider to use Kerberos authentication.
• Set nifi.login.identity.provider.configuration.file=./conf/login-identity-providers.xml to point to login
identity provider file.
nifi.kerberos.krb5.file The location of the krb5 file. [eg: /etc/krb5.conf]
nifi.kerberos.service.principal The name of the NiFi Kerberos service principal. [
eg: nifi/HDF@EXAMPLE.COM]
nifi.kerberos.keytab.location The file path of the NiFi Kerberos keytab. [ eg:
/opt/nifi-HDF.keytab]
nifi.kerberos.authentication.e The expiration duration of a Kerberos user
xpiration authentication. [eg:12 hrs]
Manually Configuring Kerberos User Authentication
• Below is an example and description of configuring a Login Identity Provider that integrates with a Kerberos
Key Distribution Center (KDC) to authenticate users.
• Configure ./conf/login-identity-providers.xml as below to enable username/password authentication.
• Set Below property in nifi.properties

nifi.security.user.login.identity.provider=kerberos-provider

Enabling Kerberos User Authentication via Ambari
Configure KDC and

proceed till enabled
Enable Kerberos from Ambari
As admin add
Kerberos users in
NiFi and provide
access policies, and
can login now.
Complete Kerberos
Enablement
Lab: NiFi Security with 2-Way SSL
Lab: Integrating LDAP
Lab: Integrating Kerberos
Security: HDF Authorization
and Multi-Tenancy
Enabling Authentication & Authorization
Authorization Support Starting HDF-2.0:

• File Based Policies (NiFi Managed)
• Ranger Policies

HDF 3.0 – NiFi Managed Authorizer vs. External Authorizer
 Managed Authorizer
– File based persistence
• Could be be extended to other persistence mechanisms
– NiFi UI to manage policies
– NiFi controls authorization logic
 External Authorizer
– Ranger integration
– Ranger UI to manage policies
– Ranger controls authorization logic

HDF 3.0 - Authorization Model
 HDF 2.0 introduces a new delegated authorization model

 Delegate authorization to a pluggable Authorizer interface
AuthorizationResult authorize(AuthorizationRequest request)
 Authorize each request based on user identity, action, and resource
– Example for user1 modifying properties on processor1:
• User Identity: user1
• Action: WRITE
• Resource: processor1 (uuid)
 Authorizer determines if the user can perform the action on the given resource
 If authorizer says resource not found, parent is checked… if parent isn’t found, parent’s
parent is checked, and so on…

Managed/File Based Authorizer
Secured UI Overview
Users Icon in Global Menu used to

access Users/Groups
Selected root
process group
Lock Icon in Global Menu used to

access Global policies
Lock Icon in palette used to access policies

for currently selected component

Secured NiFi - Managing Users
 Clicking the new user icon

allows the admin to create
Users and Groups
– Individual Users can be grouped
– Groups can be assigned
members
 Clicking the edit user icon
allows the admin to update a
specific User/Group

Managing Global Policies
 Select the policy in question

– Optionally select if the action is view
or modify
 Click the add user button to search
for the desired User
 Controller policies extend to
Reporting Tasks and Controller level
Service unless explicitly overridden
 Global admin policies extend to all
components (cannot be overridden)

Global Policies
Global Menu
Policy Privilege Selection
view the UI Allow users to view the UI N/A

Allows users to view/modify the controller including Reporting
Access the controller Tasks, Controller Services, and Nodes in the Cluster Controller Settings
Allows users to submit a Provenance Search and request Event
Query provenance Lineage Data Provenance
Access restricted Allows users to create/modify restricted components assuming
components otherwise sufficient permissions N/A
Access all policies Allows users to view/modify the policies for all components Policies
Access users/user groups Allows users to view/modify the users and user groups Users
Retrieve site-to-site
details Allows other NiFi instances to retrieve Site-To-Site details N/A
View system diagnostics Allows users to view System Diagnostics Summary

Proxy user requests Allows proxy machines to send requests on the behalf of others N/A
Access counters Allows users to view/modify Counters Counters

Managing Component Policies
 Clicking the lock icon brings up

policies for the selected component
 Drop-down specifies the action
 Click Create to define a new policy if
none is defined, then add Users and
Groups
 Data policies require entire request
chain to be authorized
– Zero master clustering - User requests
replicated through other nodes

Component Policies
Policy Privilege
View the component Allows users to view component configuration details

Modify the component Allows users to modify component configuration details
Allows user to view metadata and content for this component through
View the data provenance data and flowfile queues in outbound connections
Modify the data Allows user to empty flowfile queues in outbound connections and submit replays
View the policies Allows users to view the list of users who can view/modify a component
Modify the policies Allows users to modify the list of users who can view/modify a component
Receive data via site-to-site Allows a port to receive data from NiFi instances
Send data via site-to-site Allows a port to send data from NiFi instances

Viewing Policies on Users
 From the UI, select “Users” from the

Global Menu. This opens the NiFi
Users dialog.
 Select the View User Policies icon

Overriding Component Policies
 Component inherit policies

from the closest ancestor
Process Group with policies
defined
 View/Modify policies
handled independently
 Click Override to define a
new policy, then add Users
and Groups
 New Users and Groups
override the inherited
policies (whitelisting)

Multi-Tenancy Example
 Assume two development teams – Team 1 and Team 2

 Each team gets a Process Group and shouldn’t be able to interfere with other team

 Create a Group for Team 1 and a Group for Team 2

 Give Team 1 view & modify for Process Group 1 Can’t see the name of the group and
can’t right-click to configure the
group, but can enter the group
 Give Team 2 view & modify for Process Group 2
 A user from Team 1 would see:

Inside Team2 Process Group
 If the user from Team 1 entered the group for
Team 2 they can see the structure of the
dataflow and status but not any configuration
– Processor type, properties, etc
 All components underneath Team 2’s process
group inherit the policies of the process group,
unless more specific policies defined

Additional changes to promote Multi-Tenancy
 Controller level scoping for

– Reporting Tasks
– Controller Services
 Introducing Process Group level scoping for
– Templates
– Controller Services

Controller Settings
 Authorization based on policies for accessing the Controller
– Can be overridden for a given Controller Service or Reporting Task
 Controller Services defined here are accessible to Reporting Tasks
– Are not accessible to components on the canvas

Process Group Settings
 Authorization based on policies for accessing the Process Group
– Can be overridden for a given Controller Service
 Lists all Controller Services created in this Process Group and above (ancestor
Process Groups)
 Controller Services defined here are accessible to Processors in this Process Group
and below (descendant Processors)

Templates
 Templates stored per Process Group
– Establishes base authorization but can be overridden
 Available Template listing still accessed through the Global Menu
 If authorized, a Template can be instantiated in any Process Group

Revisions
 Revision per component
 Supports concurrent editing of different components without need for refreshing

Lab: File Based Authorizer
External/Ranger Based Authorizer
Architecture: Authorization via Ranger-NiFi Plugin
• Ranger-NiFi plugin supports policy Apache Ranger

management via Ranger for NiFi Policies Audit Logs Resource Lookup
• Supports two-way communication for policy

Retrieve Retrieve
retrieval from Ranger (NiFi) and Policies Solr Resources
REST/Resources information from NiFi
(Ranger)
REST /resources
Ranger Authorizer
• Audits logged to Solr (Ambari Infra on HDF)
Policy Policy
Refresher Cache
• Can be used with/without Ambari

NiFi

HDF - Apache Ranger Integration
• Apache Ranger provides a centralized platform to define, administer and
manage security policies consistently across Hadoop components.
• In the case of HDF, it enables the administrator to create/manage
authorization policies for Kafka, Storm and Nifi from the same web interface
(or REST APIs).
• high level steps for NiFi Integration are:
 Ranger install prerequisites

 Ranger install
 Update Nifi Ranger repo
 Test Ranger plugin
 Create Ranger users and policies
 Test Nifi access as nifiadmin user

1. Ranger install prerequisites:
 Make sure Logsearch or external Solr is installed/running before installing
Ranger (used to store audits)
 A MySQL, Oracle, or PostgreSQL database instance must be running and

available to be used by Ranger.
 Configure RDBMS for Ranger (used to store policies), set of RDBMS scripts.
 Set up Ambari with Ranger database jars.

2. Ranger install:
Click Add Service Select Ranger and click next
Configure Ranger plugin and Ranger Audit

section. Click next to complete installation
Configure Ranger Admin, user

sync section

3. Update Nifi Ranger repo:
 This is needed to enable auto-completion when creating policies in Ranger for
Nifi.
 Note that if this step is skipped, Ranger plugin will still work as usual - it just
impacts lookups when creating Nifi policies from the Ranger web interface.
 To access the Nifi repo in Ranger:
Ranger > Access Manager > Nifi > click Edit icon
Open Ranger UI from Ambari
Update the configs, test and save

4. Test Ranger plugin:
 First attempt to open Nifi UI results in "Access denied" due to insufficient
permissions:
 Navigate to the ‘Audit’ tab to verify Ranger Nifi plugin is working and auditing
is done properly:

5. Create Ranger users and policies:
 Create NiFi admin user, proxy users for each nodes in cluster and create
policies for them to access NiFi UI.
Create/Sync Users in Ranger
Create Access policies

for those users

6. Test Nifi access as nifiadmin user:
 Note that it may take up to 30s after creating the policies in Ranger UI for
them to take affect.
 Now you could login in as admin user who was given privileges in ranger
Login as Admin user

created in Ranger
You could see the

audit log in ranger is
populated with latest
access.
475 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Now you can add more users and policies for their access
Lab: Ranger Based Authorizer
Thank You!!!

HDF-3 0

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

HDF-3 0

Uploaded by

Copyright:

Available Formats

HDF: NiFi DataFlow Management

HDF Powered by Apache NiFi

Day 1 Day 2 Day 3

2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

• Remote sensor delivery (Internet of Things - IoT)

• Intra-site / Inter-site / global distribution (Enterprise)

• Ingest for feeding analytics (Big Data)

• Data Processing (Simple Event Processing)

7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Process and Analyze

The Data Flow Thing

8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Producer Consumer 4. Priority

9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Sensors The Opportunity Much of the new data

Web & social

Files & emails

Traditional Data Sources

17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Modern applications need access to both data-in-motion and data-at-rest

IoAT data flows are multi-directional and point-to-point

18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Data Must Be Secured

19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

It’s not just how quickly you

22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

• Deliver Conduct: Mediate the Data Flow

• Parse Curate: Gain Insights

24 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

 Constrained  Hybrid – cloud / on-premises

"NSA's innovators work on

26 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

27 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Visual User Interface

29 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

32 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

33 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

35 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Record Based Processing Mechanism

37 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Record Readers and Writers

38 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

39 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

40 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

41 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

42 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Entry-Level Change Data Capture (CDC)

43 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Entry-Level Change Data Capture (CDC)

Leverage plugins on Ensure sequence of

44 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

45 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Single Ambari/Ranger Managing HDF and HDP Services