2015 - Big Data Analytics

Home Archive for 2015

ID	KAFKA	FLUME
1	Kafka is a publish-subscribe model messaging system, which offers strong durability, scalabitity and fault-tolerance support.	Flume is a distributed, reliable, and available system for efficiently collecting, aggregating, and moving large amounts of data from many different sources to a centralized data store, such as HDFS

2	Kafka Provides a back pressure to prevent overflowing a broker	Flume/Flume NG doesn’t provide any such a functionality

3	With Kafka you pull data, so each consumer has and manages it's own read pointer. This allows a large number of consumers of each Kafka queue, that pull data at their own pace. With this, you could deliver your event streams to HBase, Cassandra, Storm, Hadoop, RDBMS all in parallel.	To get data out of Flume, you use a sink, which writes to your target store (HDFS, HBase, Cassandra etc). Flume will re-try connections to your sinks if they are offline. Because Flume pushes data, you have to do some interesting work to sink data to two data stores

5	With Kafka 0.8+ you get replication of your event data. If you lose a broker node, others will take up the slack to delivery your events without loss.	With Flume & FlumeNG, and a File channel, if you loose a broker node you will lose access to those events until you recover that disk. The database channel with Flume is reported too slow for any production use cases at volume.

6	Kafka just provides messaging	Flume Provides number of pre built collectors

7	Flume’s main use-case is to ingest data into Hadoop. It is tightly integrated with Hadoop’s monitoring system, file system, file formats, and utilities such a Morphlines. A lot of the Flume development effort goes into maintaining compatibility with Hadoop. Sure, Flume’s design of sources, sinks and channels mean that it can be used to move data between other systems flexibly, but the important feature is its Hadoop integration.	Kafka’s main use-case is a distributed publish-subscribe messaging system. Most of the development effort is involved with allowing subscribers to read exactly the messages they are interested in, and in making sure the distributed system is scalable and reliable under many different conditions. It was not written to stream data specifically for Hadoop, and using it to read and write data to Hadoop is significantly more challenging than it is in Flume.

8	Use Flume if you have an non-relational data sources such as log files that you want to stream into Hadoop.	Use Kafka if you need a highly reliable and scalable enterprise messaging system to connect many multiple systems, one of which is Hadoop.

Subscribe to: Posts ( Atom )