I decided to play a little with apache NiFi.
What is NiFi?
I had some problems with a Flume agent and someone wrote that i should check out NiFi.
I can't tell that it replaces Flume. Neither replacing any other tool.
According to the official website, it is an easy to use, powerful, and reliable system to process and distribute data.
Well, too general for me. You can read more here.
For me, it is kind of a very configurable streaming engine, with some basic common built in features, together with nice graphical interface that is very helpful for building and monitoring the data process.
From a first look these guys have thought of almost everything - Events Merger before writing to HDFS, success/failure routes, It is easy to build new custom java processors, data provenance interface, an ability to store event on disk for recoverability and have many more configurations and features.
That's how it looks like:
In my case, I only had to replace 1 flume agent for a specific data source. Flume is probably the best and simplest tool to ingest data into the HDFS, but it might have some difficulties in some cases. One of them is when ingesting large events (more than 100M).
So, I needed to get NiFi getting data from an IBM MQ. It only supports ActiveMq right now so I had to build a new IBMMQ processor.
With a good instructions of developing custom processors (here) I was managed to add my own GetDataFromIbmMq processor (which is much simpler to use than the Flume to JMS source implementation that requires a binding file). (I hope to upload the source code as soon as possible).
The new data stream is actually working very well. Large events are coming out of the IbmMq processor, get merged, get transformed to sequence file and get written to hdfs. In addition it is very easy now to send the data anywhere else, or playing with the topology in any way we wish (adding more data sources, more etl processes and more data stores to save the data in). It is also fantastic that we are able to see how many events went through any processor, how many succeeded and how many failed.
The trend for us right now is storing first on hdfs, and it is kind of opposit to NiFi that focuses on stream processing. But still, even for simple use case of getting data, compression and storing, it is very easy to use and enable new capabilities of data monitoring and provenance.
Must read posts regarding NiFi:
https://blogs.apache.org/nifi/entry/basic_dataflow_design
https://blogs.apache.org/nifi/entry/stream_processing_nifi_and_spark