Professional Documents
Culture Documents
HDF-3 0
HDF-3 0
• Your name
• Job responsibilities
• Previous NiFi exposure (if any)
• Your expectations for the course
Schedule
Facilities, breaks, restrooms
Lunch
Computers and Wireless Access
Store Data
? ?
?
?
? ?
Internet of Anything
Clickstream
The perimeter is outside the data center and can be very jagged
– This “Jagged Edge” creates new opportunity for security, data protection, data governance and provenance
Small Footprints
operate with very little power
DELIVER
Limited Bandwidth
can create high latency
PRIORITIZE
Data Availability
exceeds transmission bandwidth GATHER
recoverability
•
•
Logs
Files
Collect: Bring Together
• Feeds
• Sensors Aggregate all IoAT data from sensors, geo-location devices, machines, logs,
files, and feeds via a highly secure lightweight agent
25 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Apache NiFi, Apache MiNiFi, Apache Kafka, Apache Storm are trademarks of the Apache Software Foundation
NiFi Developed by the National Security Agency
Developed by the NSA over
the last 8 years.
2006
NiagaraFiles (NiFi) was first incepted at the
National Security Agency (NSA)
November 2014
NiFi is donated to the Apache Software Foundation
(ASF) through NSA’s Technology Transfer Program
and enters ASF’s incubator.
July 2015
NiFi reaches ASF top-level project status
What?
– Introduce ‘record’ based operation model
– ‘RecordReader’ and ‘RecordWriter’ controller services
– A series of processors supporting the reader/writer processing mechanism
• Plugin record reader to de-serialize bytes to record objects
• Plugin record writer to serialize record objects to bytes
• Enable operations against in-memory record objects
How?
Record Readers
– JsonTreeReader
– JsonPathReader
– AvroReader
– CSVReader
– GrokReader
– Scripted Reader
Record Writers
– JsonWriter
– AvroWriter
– CSVWriter
– FreeFormTextWriter
– ScriptedRecordSetWriter
Component Versioning
Why?
– Foundational work to enable extension registry
– Foundational work to enhance flow migration experience
What?
– Support multiple versions of the same NAR in a single NIFI instance
– E.g. Hadoop NAR version A: Apache Hadoop client lib; Hadoop NAR version B: proprietary Hadoop
client lib
How?
Component Versioning
What?
– Entry level CDC solution
– Supported source DB: MySQL. will support others in the following releases
– Supported target DB
• INSERTS/UPDTAES/DELETES: support DBs that take standard SQL out of the box
• DDL: need to customize the template due to SQL syntax difference
How?
CaptureChange
Processor 2 Processor 3 EnforceOrder PutDBRecord
MySQL
Why?
– Optimized manageability
– Reduced operational overhead
What?
– Make NIFI available as an add-on service to HDP stack
– Single Ambari for cluster management, single Ranger for policy management
– Available to customers paying for both HDF and HDP support
Pre-requisite
– Ambari 2.5.1, HDP 2.6.1, HDF 3.0 management pack
Existing HDP Existing HDF Wants to deploy Wants to deploy Wants to deploy HDF-
customer customer HDF-NiFi HDF- StreamInsight TP Deployment scenario
Storm/Kafka/SAM (HDP dependency)
1 Ambari/Ranger instance,
NO NO YES YES YES install HDP 2.6.x, add HDF 3.x
services
1 Ambari/Ranger instance,
NO YES YES YES NO upgrade to Ambari 2.5.1,
upgrade NIFI, add SAM, etc.
2 Ambari/Ranger instances.
One managing existing NIFI,
NO YES YES YES YES
install a new Ambari 2.5.1 to
manage StreamInsight
Pre-requisite
– Ambari 2.5.1, HDP 2.6.1, HDF 3.0 management pack
Existing HDP Existing HDF Wants to deploy Wants to deploy Wants to deploy HDF-
customer customer HDF-NiFi HDF- StreamInsight TP
Deployment scenario
Storm/Kafka/SAM (HDP dependency)
1 Ambari/Ranger instance,
upgrade to HDP 2.6.1, add HDF
YES NO YES YES YES
3.x services
2 Ambari/Ranger instances.
One managing existing NIFI, one
YES YES YES YES YES
managing Storm/Kafka/SAM
Zero-master clustering
– Multiple entry points, no master node, no
single point of failure
– Auto-elected cluster coordinator for cluster
maintenance
– Automatic failover handling
Authorization management
– Internal management (NIFI)
– External management (Ranger, etc.)
– Could be IoT use cases, where you have a large number devices. connected vehicles, etc
– Could be in data center, think of a number of log servers. Key is, you want to deploy the
same flow, with simple functions, on multiple devices
HTTP
WebSocket
Email
Route Text Distribute Load
HTML
Route Content Generate Table Fetch
Image
Route Context Jolt Transform JSON
Syslog
Control Rate Prioritized Delivery
AMQP
All Apache project logos are trademarks of the ASF and the respective projects.
59 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
NiFi Positioning
Enterprise Processing
Service Bus Framework
(Fuse, Mule, etc.) (Storm, Spark, etc.)
Apache
NiFi / MiNiFi
Messaging
ETL
(Informatica, etc.) Bus
(Kafka, MQ, etc.)
Bounded Connection The linkage between processors, acting as queues and allowing various
Buffer processes to interact at differing rates.
Scheduler Flow Maintains the knowledge of how processes are connected, and manages the
Controller threads and allocations thereof which all processes use.
Subnet Process A set of processes and their connections, which can receive and send data via
Group ports. A process group allows creation of entirely new component simply by
composition of its components.
Web Server
• The purpose of the web server is to host NiFi’s HTTP-based command and control API.
Flow Controller
• The flow controller is the brains of the operation.
• It provides threads for extensions to run on and manages their schedule of when they’ll receive resources to
execute.
Extensions
• There are various types of extensions for NiFi which will be described in other documents.
• But the key point here is that extensions operate/execute within the JVM.
Content Repository
• The Content Repository is where the actual content bytes of a given FlowFile live.
• The default approach stores blocks of data in the file system.
• More than one file system storage location can be specified so as to get different physical partitions engaged to
reduce contention on any single volume.
Provenance Repository
• The Provenance Repository is where all provenance event data is stored.
• The repository construct is pluggable with the default implementation being to use one or more physical disk
volumes.
• Within each location event data is indexed and searchable.
Nodes:
• Each cluster is made up of one or more nodes. The nodes do the actual data processing.
Primary Node:
• Every cluster has one Primary Node. On this node, it is possible to run "Isolated Processors" (see below).
ZooKeeper Server:
• It is used to automatically elect a Primary Node and cluster co-ordinator.
Authentication
Authenticate users and systems • 2-Way SSL support out of the box; LDAP Integration; Kerberos Integration
Initial Admin Manually designate initial Legacy Authorized Users converted previously configured Cluster Node Secure identities for
admin user granted access to users and roles to the multi- Identities each node.
the UI tenant model
• Maximum Forked Processes: increase the allowable number of threads by editing /etc/security/limits.conf
• hard nproc 10000
• soft nproc 10000
Objectives
Objectives
Password-less SSH
configured
Objectives
http://public-repo-1.hortonworks.com/HDF/3.0.0.0/nifi-1.2.0.3.0.0.0-453-bin.tar.gz
OR
http://nifi.apache.org/download.html
• OS Configuration Best Practices: Typical Linux defaults are not necessarily well tuned for the needs of an
IO intensive application like NiFi.
• Security Configuration: NiFi provides several different configuration options for security purposes.
important properties are those under the "security properties" heading in the nifi.properties file.
• Controlling Levels of Access: Configuring who will have access to the system and what types of access
those people will have. NiFi controls this through the user of an Authority Provider.
• Bootstrap Properties: Allows users to configure settings for how NiFi should be started.
• Notification Services: When the NiFi bootstrap starts or stops NiFi, or detects that it has died
unexpectedly, it is able to notify configured recipients.(As of now only email notification)
• NiFi System Properties: The nifi.properties file in the conf directory is the main configuration file for
controlling how NiFi runs.
• For Windows users, navigate to the folder where NiFi was installed.
• Within this folder is a subfolder named bin, Navigate to this subfolder and double-click the run-nifi.bat file.
• To shut down NiFi, select the window that was launched and hold the Ctrl key while pressing C.
• Navigate to the directory where NiFi was installed. To run NiFi in the foreground, run:
bin/nifi.sh run.
• This will leave the application running. To shut down press Ctrl-C. At that time.
• To check the status and see if NiFi is currently running, execute the command:
bin/nifi.sh status.
bin/nifi.sh install
• To specify a custom name for the service, execute the command with an optional second argument that is the
name of the service. For example, to install NiFi as a service with the name dataflow, use the command:
• Once installed, the service can be started and stopped using the appropriate commands:
- Note: Once NiFi is started, you will see the number of sub
directories increases inside the Installation directory.
http://localhost:8080/nifi
• The port can be changed by editing the nifi.properties file in the NiFi conf directory, but the default port is
8080.
• This will bring up the User Interface, which at this point is a blank canvas for orchestrating a dataflow:
• Processor
• Input port
• Output Port
• Process Groups
• Remote Process Groups
• Funnels
• Template
• Label
• Enable • Copy
• Disable • Paste
• Start • Group components
• Stop • Change Color
• Create Template • Delete
Users are able to search by component name, type, identifier, configuration properties, and their values.
• Summary
• Counters
• Bulletin Board
• Data Provenance
• Controller Settings
• Flow Configuration History
• Users
• Policies
• Templates
• User Settings
• Cluster
• Help
• About
To the left is the Components Toolbar. This toolbar consists of the different components that can be dragged
onto the canvas like:
• Processor
• Input port
• Output Port
• Process Groups
• Remote Process Groups
• Funnels
• Template
• Label
• Process Groups can be used to logically group a set of components so that the dataflow is easier to understand
and maintain.
• When a Process Group is dragged onto the canvas, the DFM is prompted to name the Process Group.
• All Process Groups within the same parent group must have unique names.
• The Process Group will then be nested within that parent group.
2) Configure priorities
- Connections can be configured with FlowFile Prioritizers.
- Data from several Connections can be funneled into a single Connection, providing the ability
to Prioritize all of the data on that one Connection.
NiFi Version
System Tab Details
GC details
153 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Flow File and Content Repo Disk usage
Historical Statistics of a Component
• While the Summary table and the canvas show numeric statistics pertaining to the performance of a
component over the past five minutes, it is often useful to have a view of historical statistics as well.
• This information is available by right-clicking on a component and choosing the “Stats” menu option or by
clicking on the Stats History in the Summary page.
• The amount of historical information that is stored is configurable in the NiFi properties but defaults to 24
hours.
• When the Stats dialog is opened, it provides a graph of historical statistics:
2) Processor Name:
• This is the user-defined name of the Processor.
• By default, the name of the Processor is the same as the Processor Type.
• In the example, this value is "Copy to /review".
Shows the current Status of the Processor. The following indicators are possible:
Invalid: The Processor is enabled but is not currently valid and cannot be started. Hovering over this icon
will provide a tooltip indicating why the Processor is not valid.
Disabled: The Processor is not running and cannot be started until it has been enabled. This status does
not indicate whether or not the Processor is valid.
• Data Transformation
• Routing and Mediation
• Database Access
• Attribute Extraction
• System Interaction
• Data Ingestion
• Data Egress / Sending Data
• Splitting and Aggregation
• HTTP
• Amazon Web Services
•
170 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Data Ingestion Processors
• GetKafka: Consumes messages from Apache Kafka.
• GetMongo: Executes a user-specified query against MongoDB and writes the contents to a new FlowFile.
• GetTwitter: Allows a filter to listen to the Twitter endpoint, create FlowFile for each tweet that is received.
• GetHDFS: Monitors HDFS directory. When a file enters HDFS, copied into NiFi and deleted from HDFS.
• ListHDFS : Monitors a directory in HDFS and emits a FlowFile for each file with filename as its content.
• FetchHDFS: On receiving FlowFile from ListHDFS, it fetches the actual files from HDFS to NiFi.
• GetFTP: Downloads the contents of a remote file via FTP into NiFi and then deletes the original file.
• GetSFTP: Downloads the contents of a remote file via SFTP into NiFi and then deletes the original file
• GetFile: Streams the contents of a file from a local disk into NiFi and then deletes the original file.
• ListenHTTP: Starts an HTTP (or HTTPS) Server and listens for incoming connections.
• ListenUDP: Listens for incoming UDP packets and creates a FlowFile, emits to the success relationship.
HTTP Processors
• GetHTTP: Downloads the contents of a remote HTTP- or HTTPS-based URL into NiFi.
• ListenHTTP: Starts an HTTP (or HTTPS) Server and listens for incoming connections.
• InvokeHTTP: Performs an HTTP Request that is configured by the user
• PostHTTP: Performs an HTTP POST request, sending the contents of the FlowFile as the body
• HandleHttpRequest : Is a Source Processor that starts an HTTP(S) server similarly to ListenHTTP.
• HandleHttpResponse: Sends a response back to the client after the FlowFile has finished processing.
Relationship included
Source Component Process Group
Click ok on
confirmation
Click on Empty
• Once you click on the controller Settings the below window opens with following three tabs:
• The first tab in Controller Settings window is general:
General:
-- Name of the flow.
-- Comments that describes parent flow
-- Maximum thread counts of the instance.
-- Info here will be visible to every user.
-- Backup/Archive your current flow.
Controller Services:
-- view all the controller services added.
-- click + button to add new Controller services
-- Then Configure the Controller services
-- also edit, remove, enable and see usage buttons are available
Reporting Tasks:
-- view all the Reporting Tasks added.
-- click + button to add new Reporting Tasks.
-- Then Configure the Reporting Tasks.
-- also edit, remove, enable and see usage buttons are available
1) ControllerStatusReportingTask : Logs the 5-minute stats that are shown in the NiFi Summary Page
2) MonitorDiskUsageReportingTask : Checks storage space available for Repositories and warns
3) MonitorMemoryReportingTask: Checks Java Heap available in the JVM for a JVM Memory Pool.
4) StandardGangliaReporter : Reports metrics to Ganglia ]for external monitoring of the application.
5) AmbariReportingTask: Publishes metrics from NiFi to Ambari
6) DataDogReportingTask: Publishes metrics from NiFi to datadog
7) SiteToSiteProvenanceReportingTask: Publishes Provenance events using the Site To Site protocol.
Note: Will discuss about each task in detail while covering ‘Monitoring NiFi’
• DBCPConnectionPool • StandardSSLContextService
• DistributedMapCacheClientService • AWSCredentialsProviderControllerService
• DistributedMapCacheServer • JMSConnectionFactoryProvider
• DistributedSetCacheClientService • HBase_1_1_2_ClientService
• DistributedSetCacheServer • HiveConnectionPool
• StandardHttpContextMap • CouchbaseClusterService
Add a connection
Click hadoop
Click Ingest
Unique identifier
Relationship termination
Type
To penalize flowfile
Property
Save Changes
Enter Values
Apply/Save Changes
Configure/Enter Group
215 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Select the required Template to Import
Adding Labels
• Labels are used to provide documentation to parts of a dataflow.
• When a Label is dropped onto the canvas, it is created with a default size.
• The Label can then be resized by dragging the handle in the bottom-right corner.
• The Label has no text when initially created. The text of the Label can be added by right-clicking on the Label
and choosing Configure... Drag Input Port to Canvas
219 © Hortonworks Inc. 2011 – 2016. All Rights Reserved You can right click and start.. or
Enabling/Disabling a Component
• When a component is enabled, it is able to be started.
• Users may choose to disable components when they are part of a dataflow that is still being assembled.
• This helps to distinguish between components intentionally stopped and stopped temporarily.
• A component can be enabled by clicking Enable icon in the Actions Toolbar, or in configuration.
• Only Ports and Processors can be enabled and disabled.
Now its is enabled and can be started/disables
Pulling data from kafka Pulling data from X Pulling data from Y Pulling http data
Decompress
Push it to hdfs
1) Name:
• This is the user-defined name of the Process Group.
• This name is set when the Process Group is added to the canvas.
• The name can later by changed by right-clicking on the Process Group and clicking the “Configure” option.
• In this example, the name of the Process Group is “Process Group ABC.”
2) Bulletin Indicator:
• When a child component of a Process Group emits a bulletin, that bulletin is propagated to the component’s
parent Process Group, as well.
• When any component has an active Bulletin, this indicator will appear, allowing the user to hover over the icon
with the mouse to see Bulletin.
3) Active Tasks:
• The number of tasks that are currently executing by the components within this Process Group.
• Here, we can see that the Process Group is currently performing one task.
• If the NiFi instance is clustered, this value represents the number of tasks that are currently executing across all
nodes in the cluster.
4) Comments:
• When the Process Group is added to the canvas, the user is given the option of specifying Comments in order
to provide information about the Process Group.
• The comments can later be changed by right-clicking on the Process Group and clicking the “Configure” menu
option.
• In this example, the Comments are set to “Example Process Group.”
1) Transmission Status:
• The Transmission Status indicates whether or not data Transmission between this instance of NiFi and the
remote instance is currently enabled.
• The icon shown will be the Transmission Enabled icon if any of the Input Ports or Output Ports is currently
configured to transmit or the Transmission Disabled icon if all of the Input Ports and Output Ports that are
currently connected are stopped.
4) Secure Indicator:
• This icon indicates whether or not communications with the remote NiFi instance are secure.
• If communications with the remote instance are secure, this will be indicated by
• If the communications are not secure, this will be indicated by
• If the communications are secure, this instance of NiFi will not be able to communicate with the remote
instance until an administrator for the remote instance grants access.
8) Comments: The Comments that are provided for a Remote Process Group are not comments added by the
users of this NiFi but rather the Comments added by the administrators of the remote instance.
9) Last Refreshed Time: The information that is pulled from a remote instance and rendered on the Remote
Process Group in the User Interface is periodically refreshed in the background.
To view configurations
Enable/Disable whole
transmission
Left side: list of input ports Right side: list of output ports
242 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Batch Settings
NiFi Site-to-Site
NiFi Site-To-Site
• Direct communication between two NiFi instances
• Push to Input Port on receiver, or Pull from Output Port on source
• Communicate between clusters, standalone instances, or both
• Handles load balancing and reliable delivery
• Secure connections using certificates (optional)
http://node1:8080/nifi http://node5:8080/nifi
C
Node 1
Input Port
Node 2
RPG
Node 3
Input Port
C
Node 1
RPG
Node 2
Output Port
Node 3
RPG
C Node 1
Output Port
Node 2
Output Port
Java Program
Output Port
• filename: A filename that can be used to store the data to a local or remote file system.
• path: The name of a directory that can be used to store the data to a local or remote file system.
• uuid: A Universally Unique Identifier that distinguishes the FlowFile from other FlowFiles in the system.
• entryDate: The date and time at which the FlowFile entered the system (i.e., was created).
• lineageStartDate: Any time that a FlowFile is cloned, merged, or split, this results in a "child" FlowFile being
created. This value represents the date and time at which the oldest ancestor entered the system.
• fileSize: This attribute represents the number of bytes taken up by the FlowFile’s Content.
Note: the uuid, entryDate, lineageStartDate, and fileSize attributes are system-generated and cannot be changed.
• In the case of embedded expressions, that rule of thumb does not exactly apply.
• The embedded expression(s) must evaluate to a result before the function it is contained within can be
evaluated.
• The below table provides a quick listing of these various expression language functions:
2) Structure Highlighting:
• When an open curly bracket, open square bracket, or open parentheses are highlighted, the
corresponding close curly bracket, close square bracket, or close parentheses is also highlighted.
• You can avoid the most common syntax error by immediately adding the close right after the open, then
backing up your cursor one space to continue your expression language statement.
4) Comments:
• Comments can be added at the end of any line in eth editor.
• Use the pound/hash symbol (#) to designate where a comment begins.
• Comments continue to the end of the current line.
• If you want to wrap your comment over multiple lines, every line will need to start with a # otherwise,
each new line will be treated as part of the expression language statement.
Action(s)
280 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Filename set to testFile-01.packaged
Using Multiple expressions
• Now that we know shat a basic expression language statement looks like and learned that it can be wrapped
with strings, lets take a look at using multiple expressions chained together.
• Consider a system running NiFi that ingests the syslog file many times per day.
• Every syslog file that is ingested will have the exact same filename.
• Assuming the original filename is always ‘system.log’ and what we want is ‘system.<some uuid>.log’ as the
new filename, we can use mutilple expressions like so:
• We used the ‘substringBeforeLast function so we would only capture the filename up until the last ‘.’. We then
appended a ‘.’ Followed by our second expression that will return the uuid assigned to the FlowFile. Finally, we
added back on the .log by appending it to the end.
• As you can from our output, each file now has a unique name.
• NiFi has no limit on the number of expression language statements you can chain together.
filename = application-hostname-XYZ-20151009.log
• What if we want to extract the date (20151009 in yyyyMMdd format) from this filename and put it in its own
attribute attribute named logDate?
289 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Template Added
Managing Templates
• One of the most powerful features of NiFi Templates is the ability to easily export a Template to an XML file and
to import a Template that has already been exported.
• This provides a very simple mechanism for sharing parts of a DataFlow with others.
• You have options to:
• Import a Template
• Export a Template
• Remove a Template
Template Imported
Click Import
• If the backlog of data reaches 1GB, then the GetSFTP processor would stop pulling data until the backlog
dropped below the threshold and then it would resume pulling data from the source system.
• This would allow the NiFi to pull the backlog of data at a rate that wouldn’t over utilize the system resources.
Criteria could be
-Data Rate
- Flow file count
- Attribute Value
• Utilize the information provided by the processors, number of reads/write and tasks/time per task to find
“hot spots” on the graph.
• For instance, if there is a large number of tasks but the amount of data traversing the processor is low, then
the processor might be configured to run too often or with too many concurrent tasks.
• Few completed tasks along with high task time indicates that this processor is CPU intensive.
• If the dataflow volume is high and a processor show a high number of completed threads and high task time,
performance can be improved by increasing the run duration in the processor scheduling.
Details Tab
Attributes Tab
Content Tab
Download/View content
Click on Expand
connected Node
Property Description
nifi.state.management.configuratio XML file that is used for configuring the local and/or
n.file cluster-wide State Providers. [./conf/state-
management.xml]
nifi.state.management.provider.loca Property that provides the identifier of the local State
l Provider configured in this XML file. [local-provider]
nifi.state.management.provider.clus Property provides the identifier of the cluster-wide
ter State Provider configured in this XML file. [zk-
provider]
• If its clustered with zk-provider, zookeeper connection string and details should be provided:
Property Description
nifi.state.management.embedded.zookeeper.start Should NiFi run an embedded
ZooKeeper server [true/false]
nifi.state.management.embedded.zookeeper.prop Properties file that provides the
erties ZooKeeper properties
[./conf/zookeeper.properties]
• Which ACL is used depends on the value of the Access Control property for the ZooKeeperStateProvider
Property Description
CreatorOnly ACL that indicates that only the user that created the data is allowed to
access the data.
Open When data is written to ZooKeeper, NiFi will provide an ACL that indicates
that any user is allowed to have full permissions to the data.
• For ACL to work We need to tell ZooKeeper who the Creator is. We have two options for this:
• If the NiFi instance is in a cluster, we will also see an indicator here telling us how many nodes are in the cluster
and how many are currently connected.
• In this case, the number of active threads and the queue size are indicative of all the sum of all nodes that are
currently connected.
• The connections between Processors also expose the number of items that are currently queued.
• FlowFilesReceivedLast5Minutes • jvm.heap_used
• BytesReceivedLast5Minutes • jvm.heap_usage
• FlowFilesSentLast5Minutes • jvm.non_heap_usage
• BytesSentLast5Minutes • jvm.thread_states.runnable
• FlowFilesQueued • jvm.thread_states.blocked
• BytesQueued • jvm.thread_states.timed_waiting
• BytesReadLast5Minutes • jvm.thread_states.terminated
• BytesWrittenLast5Minutes • jvm.thread_count
• ActiveThreads • jvm.daemon_thread_count
• TotalTaskDurationSeconds • jvm.gc.runs
• jvm.uptime • jvm.gc.time
• In order to make use of these metrics in Ambari, a NIFI service must be created and installed in Ambari.
Delete/Start
Select memory
pool to monitor
Start/Delete
Click Edit
Button to
configure
• FlowFilesReceivedLast5Minutes • jvm.heap_usage
• BytesReceivedLast5Minutes • jvm.non_heap_usage
• FlowFilesSentLast5Minutes • jvm.thread_states.runnable
• BytesSentLast5Minutes • jvm.thread_states.blocked
• FlowFilesQueued • jvm.thread_states.timed_waiting
• BytesQueued • jvm.thread_states.terminated
• BytesReadLast5Minutes • jvm.thread_count
• BytesWrittenLast5Minutes • jvm.daemon_thread_count
• ActiveThreads • jvm.file_descriptor_usage
• TotalTaskDurationSeconds • jvm.gc.runs
• jvm.uptime • jvm.gc.time
• jvm.heap_used
View last
sent
event_id
Click Add Button
Perishable
Hortonworks DataFlow (HDF)
powered by Apache NiFi
Insights
Internet
of Anything Hortonworks Data Platform (HDP)
powered by Apache Hadoop Historical
Insights
Hortonworks Data Platform
powered by Apache Hadoop
Hortonworks DataFlow and the Hortonworks Data Platform
deliver the industry’s most complete Big Data solution
392 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
HDF Makes Big Data Ingest Easy
Complicated, messy, and takes weeks to Streamlined, Efficient, Easy
months to move the right data into Hadoop
HDP HDP
HORTONWORKS HORTONWORKS
DATA PLATFORM DATA PLATFORM
Powered by Apache Hadoop Powered by Apache Hadoop
Service
Kafka
Management /
Network Metadata Stream
Phoenix Spark Workflow
Data Stores
NiFi Storm
Syslog HBase Hive SOLR
GetHDFS
PutHDFS
ListHDFS
GetHDFSSequenceFile
• Provide type and url of Solr Instance [zookeeper url if cloud type]
• Collection Name if cloud type
• Path to post ContentStream and type of content [eg : Json]
• Solr query and Filter options for GetSolr
PutHiveStreaming
PutHiveQL
PublishKafka ConsumeKafka
nifi-spark-receiver
PutHbaseShell
• They stay relatively small and require very little hard drive space.
• The default installation path of <root-level-nifi-dir>/database_repository would result in the directory being
created at the root level of your NiFi installation (same level as conf, bin, lib, etc directories).
• While there is little to no performance gain by moving this to a new location, we do recommend moving all
repositories to a location outside of the NiFi install directories to simplify upgrading.
• This allow you retain the user and component history information after upgrading.
• As you can see the default values are very small and not suitable for dataflows of any substantial size.
• We recommend increasing both the initial and maximum heap memory allocations to at least 4 GB or 8 GB for
starters.
java.arg.2=-Xms8g
java.arg.3=-Xmx8g
• If you should encounter any “out of memory” errors in your NiFi app log, this is an indication that you have
either a memory leak or simply insufficient memory allocation to support your dataflow.
• You can configure your NiFi to use the G1 garbage collector by uncommenting the above line.
java.arg.13=-XX:+UseG1GC
• The code cache is memory separate from the heap that contains all the JVM bytecode for a method compiled
down to native code. If the code cache fills, the compiler will be switched off and will not be switched back on
again. This will impact the long running performance of NiFi.
• The only way to recover performance is a restart the JVM (restart NiFi). So by removing the comment on this
line, the code cache size is increased to 256m,which should be sufficient to prevent the cache from filling up.
• Below parameter establishes a boundary for how much of the code cache can be used before flushing of the
code cache will occur to prevent it from filling and resulting in the stoppage of the compiler.
• java.arg.8=-XX:CodeCacheMinimumFreeSpace=10m
• java.arg.9=-XX:+UseCodeCacheFlushing
Force regenerate CA
Cluster node Identities
432 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Initial Admin Identity
• When you setup a secured NiFi instance for the first time, you must manually designate an “Initial Admin
Identity”.
• This initial admin user is granted access to the UI and given the ability to create additional users, groups, and
policies.
• You can Manually add it in “authorizers.xml” file or use Ambari to update the same.
• Restart NiFi once updated.
• NiFi does not perform user authentication over HTTP. Using HTTP all users will be granted all roles.
• You can Configure It Manually or via Ambari
Manager DN The DN of the manager that is used to bind to the LDAP server to search for users.
Manager Password The password of the manager that is used to bind to the LDAP server to search for users.
TLS - Keystore Path to the Keystore that is used when connecting to LDAP using START_TLS.
TLS - Keystore Password Password for the Keystore that is used when connecting to LDAP using START_TLS.
TLS - Keystore Type Type of the Keystore that is used when connecting to LDAP using START_TLS (i.e. JKS or PKCS12).
TLS - Truststore Path to the Truststore that is used when connecting to LDAP using START_TLS.
TLS - Truststore Password Password for the Truststore that is used when connecting to LDAP using START_TLS.
TLS - Truststore Type Type of the Truststore that is used when connecting to LDAP using START_TLS (i.e. JKS or PKCS12).
TLS - Client Auth Client authentication policy when connecting to LDAP using START_TLS.
TLS - Protocol Protocol to use when connecting to LDAP using START_TLS. (i.e. TLS, TLSv1.1, TLSv1.2, etc).
TLS - Shutdown Gracefully Specifies whether the TLS should be shut down gracefully before the target context is closed. Defaults to false.
Referral Strategy Strategy for handling referrals. Possible values are FOLLOW, IGNORE, THROW.
User Search Base Base DN for searching for users (i.e. CN=Users,DC=example,DC=com).
As admin add
Kerberos users in
NiFi and provide
access policies, and
can login now.
Complete Kerberos
Enablement
443 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Lab: NiFi Security with 2-Way SSL
Lab: Integrating LDAP
Lab: Integrating Kerberos
Security: HDF Authorization
and Multi-Tenancy
Enabling Authentication & Authorization
Managed Authorizer
– File based persistence
• Could be be extended to other persistence mechanisms
– NiFi UI to manage policies
– NiFi controls authorization logic
External Authorizer
– Ranger integration
– Ranger UI to manage policies
– Ranger controls authorization logic
Selected root
process group
Access users/user groups Allows users to view/modify the users and user groups Users
Retrieve site-to-site
details Allows other NiFi instances to retrieve Site-To-Site details N/A
Modify the data Allows user to empty flowfile queues in outbound connections and submit replays
View the policies Allows users to view the list of users who can view/modify a component
Modify the policies Allows users to modify the list of users who can view/modify a component
Receive data via site-to-site Allows a port to receive data from NiFi instances
Send data via site-to-site Allows a port to send data from NiFi instances
Configure RDBMS for Ranger (used to store policies), set of RDBMS scripts.
Ranger > Access Manager > Nifi > click Edit icon
Navigate to the ‘Audit’ tab to verify Ranger Nifi plugin is working and auditing
is done properly:
475 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Now you can add more users and policies for their access
Lab: Ranger Based Authorizer
Thank You!!!