Wednesday, July 1, 2015

Playing with Apache Nifi

I decided to play a little with apache NiFi.

What is NiFi?

I had some problems with a Flume agent and someone wrote that i should check out NiFi.

I can't tell that it replaces Flume. Neither replacing any other tool. 
According to the official website, it is an easy to use, powerful, and reliable system to process and distribute data.

Well, too general for me. You can read more here.

For me, it is kind of a very configurable streaming engine, with some basic common built in features, together with nice graphical interface that is very helpful for building and monitoring the data process.


From a first look these guys have thought of almost everything - Events Merger before writing to HDFS, success/failure routes, It is easy to build new custom java processors, data provenance interface, an ability to store event on disk for recoverability and have many more configurations and features.

That's how it looks like:

Updated Merged Flow




In my case, I only had to replace 1 flume agent for a specific data source. Flume is probably the best and simplest tool to ingest data into the HDFS, but it might have some difficulties in some cases. One of them is when ingesting large events (more than 100M).

So, I needed to get NiFi getting data from an IBM MQ. It only supports ActiveMq right now so I had to build a new IBMMQ processor.

With a good instructions of developing custom processors (here) I was managed to add my own GetDataFromIbmMq processor (which is much simpler to use than the Flume to JMS source implementation that requires a binding file).  (I hope to upload the source code as soon as possible).

The new data stream is actually working very well. Large events  are coming out of the IbmMq processor, get merged, get transformed to sequence file and get written to hdfs. In addition it is very easy now to send the data anywhere else, or playing with the topology in any way we wish (adding more data sources, more etl processes and more data stores to save the data in). It is also fantastic that we are able to see how many events went through any processor, how many succeeded and how many failed.


The trend for us right now is storing first on hdfs, and it is kind of opposit to NiFi that focuses on stream processing. But still, even for simple use case of getting data, compression and storing, it is very easy to use and enable new capabilities of data monitoring and provenance.


Must read posts regarding NiFi:
https://blogs.apache.org/nifi/entry/basic_dataflow_design 

https://blogs.apache.org/nifi/entry/stream_processing_nifi_and_spark


10 comments:

  1. Can you post your processor for the community?

    ReplyDelete
  2. Any update as to when you can share the MQ processor code?

    ReplyDelete
  3. Hi
    Unfortunately I left the code in a the close network that I was working with.

    I remember that I had 6 classes, that I took from the jms processor and modified
    (added MQ as prefix as well)

    GetJmsProcessor
    PutJms
    JmsProperties
    JmsConsumer
    JmsFactory
    JmsProcessingSummary

    and the package should look like that: mosco.nifi.ibmmq.processors (ending with processors)

    I also used about 7 ibm mq jars.
    all the friends of com.ibm.mq and com.ibm.mqjms and some more.

    I am not writing this again because I don't have time to install an IBM MQ to check the code.. sorry for that

    ReplyDelete
  4. This comment has been removed by the author.

    ReplyDelete
  5. Hey ,
    Great post….
    Actually I am trying to use the process PutHDFS but getting the error of unresolvedaddressexception.I have done the basic steps-
    1.copied the core-site.xml and hdfs-site.xml into nifi local directory and assigned that path at hadoop configuration resources .
    2. given the default directory path as ‘/root’
    As i am using the hortonworks sandbox to connect HDFS i am able to access the name and datanode on my local machin by url 127.0.0.1:8000 and 127.0.0.1:500470.
    Is something i am missing?Thanks in advance

    ReplyDelete
  6. Hi. Sorry.. haven't seen the comment.
    Are you able to use these coresite and hdfssite to execute "hdfs dfs -ls /root" on your datanode?

    ReplyDelete
  7. Hi Gilad,

    np...The issue was resolved.Now am able to communicate hadoop via NiFi.Thanks

    ReplyDelete
  8. Hi Gilad,

    Nice post!!!
    Actually I am trying to connect Hadoop with Nifi server.I need some guides regarding in Apache Nifi to connect with Hadoop cluster. NiFi is running on a machine which is not a part of Hadoop cluster. I want to put files into HDFS. As to write into HDFS I have created the process PUTHDFS in Ni-fi.As per my understanding

    ReplyDelete