How To Streaming Log File To HDFS Using Flume In Big Data Application

You will learn from big data analytics services suppliers regarding streaming log file technique to HDFS with Flume. This post will introduce Flume and everything else needed for streaming log files using Flume in big data applications. 

Introduction Flume to ingest the data to HDFS

    In big data applications, raw data is very important to do more analytic operations. In this blog, I will introduce Apache Flume which helps to ingest the data from many sources to our HDFS to process the data.
Flume is a subproject of the Hadoop ecosystem which ingests the log data from outside systems to Hadoop. In ingesting the data, Flume will run 1 or many agents, and agents have three mandatory components below:
§  Sources receive data and send it to channels.
§  Channels keep the data in the queue to wait for communication between sources and sinks.
§  Sinks process data collected from queues from channels and move it to HDFS.

Flow of Flume

      Environment

Java: JDK 1.7

Cloudera version:  CDH4.6.0

Initial steps

  1. We need to make sure we have some log files in our Linux system.
  2. Create the configuration config for the Flume agent as the configuration below.

   Code walkthrough

    This configuration file will collect the real-time log from the tail command from location /var/system.log to the destination location in HDFS.

    # Define a source of Flume on my agent and use the memory-channel channel to call command of Linux tail the log file of Linux system

     myagent.sources.tail-source.type = exec

        myagent.sources.tail-source.command = tail -F /var/log/system.log
       myagent.sources.tail-source.channels = memory-channel 

     # Define a sink of Flume that outputs to the logger from source input stream data

       myagent.sinks.log-sink.channel = memory-channel
       myagent.sinks.log-sink.type = logger 

     # Define a sink of Flume that outputs to HDFS location with data stream file type.

       myagent.sinks.hdfs-sink.channel = memory-channel
       myagent.sinks.hdfs-sink.type = hdfs
       myagent.hdfs_w1.hdfs.writeFormat = Text
       myagent.sinks.hdfs-sink.hdfs.path = hdfs:///mydata/destinationLog
       myagent.sinks.hdfs-sink.hdfs.fileType = DataStream 

     # Set the channel, source and sink component for this agent config

       myagent.channels = memory-channel
       myagent.sources = tail-source
       myagent.sinks = log-sink hdfs-sink

     Run this command to start the agent:

       flume-ng agent -f /mylocalconfig.conf -n myagent 

     

    Verify the result


     We will do some operations from our Linux system like creating files, removing files, etc. 

       

vi a
rm a 

After this operation, the sys log from Linux will update and the tail –f command will ingest that changes to our HDFS location as we configured above. We can check in the HDFS location to see the output 

  hadoop fs –text /mydata/destinationLog /* | head –n 10

It will show the data change from the log file from Linux local in our HDFS file.

The agenda of big data analytics services providers was to make you understand Flume and its use for streaming log files to HDFS. For queries, kindly contact experts.

Hope that this blog can help you guys understand the steps to config the Flume to ingest the data from other systems to our HDFS for big data applications.


SHARE

Ethan Millar

  • Image
  • Image
  • Image
  • Image
  • Image
    Blogger Comment
    Facebook Comment

0 comments:

Post a Comment