Introduction Flume to
ingest the data to HDFS
Flume is a subproject of the Hadoop ecosystem which ingests the log data from outside systems to Hadoop. In ingesting the data, Flume will run 1 or many agents, and agents have three mandatory components below:
§ Sources receive data and send it to channels.
§ Channels keep the data in the queue to wait for communication between sources and sinks.
§ Sinks process data collected from queues from channels and move it to HDFS.
Java:
JDK 1.7
Cloudera
version: CDH4.6.0
Initial steps
- We need to make sure we have some log files in our Linux system.
- Create the configuration config for the Flume agent as the configuration below.
Code walkthrough
This
configuration file will collect the real-time log from the tail command from
location /var/system.log to the destination location in HDFS.
# Define a source of Flume on my agent and use the memory-channel channel to call command of Linux tail the log file of Linux system
myagent.sources.tail-source.type = exec
myagent.sources.tail-source.command = tail -F /var/log/system.log
myagent.sources.tail-source.channels = memory-channel
# Define a sink of Flume that outputs to the logger from source input stream data
myagent.sinks.log-sink.channel = memory-channel
myagent.sinks.log-sink.type = logger
# Define a sink of Flume that outputs to HDFS
location with data stream file type.
myagent.sinks.hdfs-sink.channel = memory-channel
myagent.sinks.hdfs-sink.type = hdfs
myagent.hdfs_w1.hdfs.writeFormat = Text
myagent.sinks.hdfs-sink.hdfs.path = hdfs:///mydata/destinationLog
myagent.sinks.hdfs-sink.hdfs.fileType = DataStream
# Set the channel, source and sink component
for this agent config
myagent.channels = memory-channel
myagent.sources = tail-source
myagent.sinks = log-sink hdfs-sink
Run
this command to start the agent:
flume-ng agent -f /mylocalconfig.conf -n myagent
Verify the result
We will do some operations from our Linux system like creating files, removing files, etc.
vi a
rm a
After this operation, the sys log from Linux will update and the tail –f command will ingest that changes to our HDFS location as we configured above. We can check in the HDFS location to see the output
hadoop fs –text /mydata/destinationLog /* | head –n 10
It
will show the data change from the log file from Linux local in our HDFS file.
The agenda of big data analytics services providers
was to make you understand Flume and its use for streaming log files to
HDFS. For queries, kindly contact experts.
Hope
that this blog can help you guys understand the steps to config the Flume to
ingest the data from other systems to our HDFS for big data applications.
0 comments:
Post a Comment