Professional Documents
Culture Documents
Flumewithtwitterintegration 140428151459 Phpapp02
Flumewithtwitterintegration 140428151459 Phpapp02
Flume with
Twitter
Integration
by Swathi Kotturu
Date: 03/3/2014
Professor: Thanh Tran
Flume Agents
Flume can deploy any number of agents. An Agent is a
container for Flume data flow. It can run any number of
sources, sinks, and channels.
It must have a source, channel, and sink.
Flume Sources
Sources are not Necessarily restricted to log data.
It is possible to use Flume to transport event data such as
network traffic data, social-media-generated data,
e-mail messages, etc
The events can be HTTP POSTS, RPC calls, strings in
stdout, etc.
After an event occurs, Flume sources write the event to a
channel as a transaction.
Flume Channels
Channels are internal passive stores with specific
characteristics. This allows a source and a sink to run
asynchronously.
Two Main Types of Channels
Memory Channels
- Volatile Channel that buffers events in memory
only. If JVM crashes, all data is lost.
File Channels
- Persistant Channel that is stored to disk.
Flume in Cloudera
Download flume-sources-1.0-SNAPSHOT.jar and add it to the
flume class path. http://files.cloudera.com/samples/flume-sources1.0-SNAPSHOT.jar
In the Cloudera Manager, you can add the class path:
Services -> flume1 -> Configuration -> Agent(Default) ->
Advanced -> Java Configuration Options for Flume Agent, add:
classpath /opt/cloudera/parcels/CDH-4.3.01.cdh4.3.0.p0.22/lib/flume-ng/lib/flume-sources-1.0SNAPSHOT.jar
Example Tweet
We loaded raw tweets into HDFS which are represented
as chunks of JSON
Next Steps
Tell Hive how to read the data
Flume Resources
Learn More
https://dev.twitter.com/docs/streamingapis/parameters
https://cwiki.apache.org/confluence/display/FLUME/
Home
http://blog.cloudera.com/blog/2012/09/analyzingtwitter-data-with-hadoop/
Thank you!
Q/A