Flume vs Kafka
ID | KAFKA | FLUME |
---|---|---|
1 | Kafka is a publish-subscribe model messaging system, which offers strong durability, scalabitity and fault-tolerance support. | Flume is a distributed, reliable, and available system for efficiently collecting, aggregating, and moving large amounts of data from many different sources to a centralized data store, such as HDFS |
2 | Kafka Provides a back pressure to prevent overflowing a broker | Flume/Flume NG doesn’t provide any such a functionality |
3 | With Kafka you pull data, so each consumer has and manages it's own read pointer. This allows a large number of consumers of each Kafka queue, that pull data at their own pace. With this, you could deliver your event streams to HBase, Cassandra, Storm, Hadoop, RDBMS all in parallel. | To get data out of Flume, you use a sink, which writes to your target store (HDFS, HBase, Cassandra etc). Flume will re-try connections to your sinks if they are offline. Because Flume pushes data, you have to do some interesting work to sink data to two data stores |
5 | With Kafka 0.8+ you get replication of your event data. If you lose a broker node, others will take up the slack to delivery your events without loss. | With Flume & FlumeNG, and a File channel, if you loose a broker node you will lose access to those events until you recover that disk. The database channel with Flume is reported too slow for any production use cases at volume. |
6 | Kafka just provides messaging | Flume Provides number of pre built collectors |
7 | Flume’s main use-case is to ingest data into Hadoop. It is tightly integrated with Hadoop’s monitoring system, file system, file formats, and utilities such a Morphlines. A lot of the Flume development effort goes into maintaining compatibility with Hadoop. Sure, Flume’s design of sources, sinks and channels mean that it can be used to move data between other systems flexibly, but the important feature is its Hadoop integration. | Kafka’s main use-case is a distributed publish-subscribe messaging system. Most of the development effort is involved with allowing subscribers to read exactly the messages they are interested in, and in making sure the distributed system is scalable and reliable under many different conditions. It was not written to stream data specifically for Hadoop, and using it to read and write data to Hadoop is significantly more challenging than it is in Flume. |
8 | Use Flume if you have an non-relational data sources such as log files that you want to stream into Hadoop. | Use Kafka if you need a highly reliable and scalable enterprise messaging system to connect many multiple systems, one of which is Hadoop. |
How do you get started with Hadoop?
A year ago, I had to start a POC on Hadoop and I had no idea about what Hadoop is.
I would explain the way I started with and which helped others as well.
1. Go through some introductory videos on Hadoop
Its very important to have some high level idea of hadoop before directly starting working on it. These introductory videos will help in understanding the scope of Hadoop and the use cases where it can be applied. There are a lot of resources available online for the same and going through any of the videos will be beneficial.
2. Understanding MapReduce
The second thing which helped me was to understand what Map Reduce is and how it works. It is explained very nicely in this paper: http://static.googleuserc ontent....
Another nice tutorial is available here : http://ksat.me/map-reduce -a-real...
3. Getting started with Cloudera VM
Once you understand the basics of Hadoop, you can download the VM provided by cloudera and starting running some hadoop commands on it. You can download the VM from this link: http://www.cloudera.com/c ontent/...
It would be nice to get familiar with basic Hadoop commands on the VM and understanding how it works.
4. Setting up the standalone/Pseudo distributed Hadoop
I would recommend setting up your own standalone Hadoop on your machine once you are familiar with Hadoop using the VM. The steps for installing are explained very nicely on this blog by Michael G. Noll : Running Hadoop On Ubuntu Linux (Single-Node Cluster) - Michael G. Noll
5. Understanding the Hadoop Ecosystem
It would be nice to get familiar with other components in the Hadoop ecosystem like Apache Pig, Hive, Hbase, Flume-NG, Hue etc. All these serve different purposes and having some information on all these will be really helpful in building any product around the hadoop ecosystem. You can install all these easily on your machine and get started with them. Cloudera VM by has most of these installed already.
6. Writing Map Reduce Jobs
Once you are done with steps 1-5, I don't think writing Map Reduce would be a challenge. It is explained thoroughly in The Definitive Guide. If MapReduce really interests you a lot, I would suggest reading this book Mining Massive Datasets by Anand Rajaraman, Jure Leskovec and Jeffrey D. Ullman : Page on Stanford
I might have missed some points here but that is how I started with Hadoop and my journey so far has been really interesting and rewarding.
I would explain the way I started with and which helped others as well.
1. Go through some introductory videos on Hadoop
Its very important to have some high level idea of hadoop before directly starting working on it. These introductory videos will help in understanding the scope of Hadoop and the use cases where it can be applied. There are a lot of resources available online for the same and going through any of the videos will be beneficial.
2. Understanding MapReduce
The second thing which helped me was to understand what Map Reduce is and how it works. It is explained very nicely in this paper: http://static.googleuserc
Another nice tutorial is available here : http://ksat.me/map-reduce
3. Getting started with Cloudera VM
Once you understand the basics of Hadoop, you can download the VM provided by cloudera and starting running some hadoop commands on it. You can download the VM from this link: http://www.cloudera.com/c
It would be nice to get familiar with basic Hadoop commands on the VM and understanding how it works.
4. Setting up the standalone/Pseudo distributed Hadoop
I would recommend setting up your own standalone Hadoop on your machine once you are familiar with Hadoop using the VM. The steps for installing are explained very nicely on this blog by Michael G. Noll : Running Hadoop On Ubuntu Linux (Single-Node Cluster) - Michael G. Noll
5. Understanding the Hadoop Ecosystem
It would be nice to get familiar with other components in the Hadoop ecosystem like Apache Pig, Hive, Hbase, Flume-NG, Hue etc. All these serve different purposes and having some information on all these will be really helpful in building any product around the hadoop ecosystem. You can install all these easily on your machine and get started with them. Cloudera VM by has most of these installed already.
6. Writing Map Reduce Jobs
Once you are done with steps 1-5, I don't think writing Map Reduce would be a challenge. It is explained thoroughly in The Definitive Guide. If MapReduce really interests you a lot, I would suggest reading this book Mining Massive Datasets by Anand Rajaraman, Jure Leskovec and Jeffrey D. Ullman : Page on Stanford
I might have missed some points here but that is how I started with Hadoop and my journey so far has been really interesting and rewarding.
What are the best blogs to follow for hadoop, big data?
Here is list of some of the blogs you can follow to learn about Hadoop and big data.
Developer Content
Hortonworks Blog
Cloudera Engineering Blog
Hadoop - YDN
http://www.michael-noll.com/
http://gethue.com/blog/
Download Hortonworks or Cloudera Sandbox
Cloudera QuickStart VM
Hortonworks Sandbox
Data Analytics
Data Mining - Research at Google
Operations Content
I would recommend reading more papers for operational content.....
allthingsdistributed.com This blog is awesome for distributed computing topics
Ganglia Monitoring System
Page on googleusercontent.com
Distributed Systems and Parallel Computing
Research Publications at Facebook
Developer Content
Hortonworks Blog
Cloudera Engineering Blog
Hadoop - YDN
http://www.michael-noll.com/
http://gethue.com/blog/
Download Hortonworks or Cloudera Sandbox
Cloudera QuickStart VM
Hortonworks Sandbox
Data Analytics
Data Mining - Research at Google
Operations Content
I would recommend reading more papers for operational content.....
allthingsdistributed.com This blog is awesome for distributed computing topics
Ganglia Monitoring System
Page on googleusercontent.com
Distributed Systems and Parallel Computing
Research Publications at Facebook
Capgemini Supertechies: Heathrow Challenge
Capgemini Supertechies: Heathrow Challenge
Challenge:
Use SMAC (Social, Mobility, Analytics and Cloud) tools and provide solution to change the passenger behavior prior to security check and speedup the security check.
As per the data of 2013, total number of 72.3 million passengers traveled from Heathrow airport, which makes 198082 people traveling per day or 8253 people per hour. This causes long queues at the security check points and creates more delays which is really tiresome for passengers.
This problem can be solved or optimized to reduce the delays and improving passenger experience by using technology.
Solution:
Use SMAC stack to provide best passenger experience .
Pre-requisites:
System Architecture:
Challenge:
Use SMAC (Social, Mobility, Analytics and Cloud) tools and provide solution to change the passenger behavior prior to security check and speedup the security check.
As per the data of 2013, total number of 72.3 million passengers traveled from Heathrow airport, which makes 198082 people traveling per day or 8253 people per hour. This causes long queues at the security check points and creates more delays which is really tiresome for passengers.
This problem can be solved or optimized to reduce the delays and improving passenger experience by using technology.
Solution:
Use SMAC stack to provide best passenger experience .
Pre-requisites:
- Integrated with the Airport eco-system.
- Airport-Airline interface where airlines will inform airport about passenger bookings to the airport
- Integration with govt biometric identification system if any available (Not mandatory, this system will build it with the passengers as they check in)
System Architecture:
- Centralized Application server hosted on Cloud:
- Centralized application server will work as middleware where it will manage cloud hosted data analytics platform
- Provide administrative UI and elegant dashboard to Airport management authorities
- Interface with airlines to get passenger booking details
- Integrated with social media
- Serve requests to Mobile applications
- Generating heatmaps after analyzing queues and redirecting further passengers to other queues
- Mobile Application:
- This will provide passenger application where he will be entitled to view his boarding information, alert notifications
- Security counters availability according to free/less crowded queues
- Provide various offerings
Working:
Assuming passenger is having mobile app installed passenger will need to add belongings he is carrying with himself during travel. Accordingly app will inform him which belongings he can keep in his cabin bag, which to keep in luggage bag and which ones are banned. He will get notified about his schedule through mobile app as well as travel and informed about probable security counters couple of hours prior check-ins.
When user enters the airport he will go through Security metal detectors at the entrance and then through self assisted biometric checkup where his identity will be verified against available in the database. He will do the bar-code scanning of boarding pass as well. (can use thumb impression or retina scanning)
Once the user verification is done he will be divided in to the three classes fully verified, satisfactory and not verified through biometric checks.
Also at the same time mobile geo location monitoring will be started using wifi enabled tracking or iBeakon like tools (BLE). This will be monitoring passenger movement as well as heatmaps will be created to identify crouded and busy lines. Based on the analytic s further users will be redirected to less busy queues.
Fully verified and satisfactory class passengers will go through the queues which will be having lesser or security/baggage check. Non verified passengers will go through normal strigent security check. This will avoid/reduce random security check and provide faster passage to most of the passengers.
Each time passenger will be passed through the Heathrow security check, his information will be stored to further analysis. All the passengers will be given ratings.
Airport authority management will be continuously monitoring heatmaps and will direct there staff to the places where there is more crowded queues so as to serve speedily. Also they will be monitoring alarms raised by analytics system.
Analytics will be analyzing social media, biometric data, generating heatmaps and predicting security threats using passenger information.
Each passenger will be given ratings for each travel based upon categories he has selected and will be used for further security check analytics.
How to encourage People to adopt this solution?
Encourage people to use this app by providing various options.
- Provide them information discounts/coupons at outlets nearby airport
- Provide them air-miles credit etc.
Information about each SMAC components Features:
Social Media:
Grab the social media feeds of the passenger which can be used following way:
Provide a mobile App which will have following features:
This is a heart of this security system which will provide following features:
Also at the same time mobile geo location monitoring will be started using wifi enabled tracking or iBeakon like tools (BLE). This will be monitoring passenger movement as well as heatmaps will be created to identify crouded and busy lines. Based on the analytic s further users will be redirected to less busy queues.
Fully verified and satisfactory class passengers will go through the queues which will be having lesser or security/baggage check. Non verified passengers will go through normal strigent security check. This will avoid/reduce random security check and provide faster passage to most of the passengers.
Each time passenger will be passed through the Heathrow security check, his information will be stored to further analysis. All the passengers will be given ratings.
Airport authority management will be continuously monitoring heatmaps and will direct there staff to the places where there is more crowded queues so as to serve speedily. Also they will be monitoring alarms raised by analytics system.
Analytics will be analyzing social media, biometric data, generating heatmaps and predicting security threats using passenger information.
Each passenger will be given ratings for each travel based upon categories he has selected and will be used for further security check analytics.
How to encourage People to adopt this solution?
Encourage people to use this app by providing various options.
- Provide them information discounts/coupons at outlets nearby airport
- Provide them air-miles credit etc.
Information about each SMAC components Features:
Social Media:
Grab the social media feeds of the passenger which can be used following way:
- This information will be used to analyze passenger specific trends based upon public information available for alerting on suspected passengers prior to journey (Security purpose)
- Providing product/discount offerings at the airport based on analyzed interests
Provide a mobile App which will have following features:
- Boarding information
- Flight schedule details (Arrival/departure)
- Security check counter information (will be scheduled to direct to less trafficking counters)
- Notifications according to the flight timings
- User will fill in belongings he is carrying with him self
- FAQ section will contain list of banned contents (This can be showed in gamification way with questionnaire)
- Discount offers section for people who starts using this app at the initial period
This is a heart of this security system which will provide following features:
- Analyze the passenger movement using tools like iBeakon which uses Bluetooth low energy (BLE) also use tracking using geo information data (tracking using GPS)
- Analyze the queues with heatmaps and accordingly inform passengers which queues to follow
- Identify passengers identity using biometrics with the comparison to earlier stored biometric data
- Analyzing customers social data available in public domain and raise alerts prior to journey as well as provide product offerings or discounts based on passenger interests.
- Cloud will store all the passenger related data, social analytics data, biometrics information data etc
- This will host a central application which will serve all the mobile clients (Through mobile app) and Airport management through Admin UI interface.
Public Big data sets available for download?
Where can I find huge data sets, is a question in front of every person who is aiming to develop, test or study big data analytics tools. Here we provide the list of data sets available in public domain.
This project makes 260TB of human genome data available to the users.
Internet Archive
More than 80TB of data is available for research of internet web crawling
ClueWeb09
The TREC conference provided this data set to public. You will need to sign agreement and pay fee of (upto $610) to cover sneakernet data transfer. Total 5TB data is available.
http://lemurproject.org/clueweb09.php/
ClueWeb12
ClueWeb12 is available as freebase annotation.
CNetS
Indiana University makes a 2.5 TB click dataset available
ICWSM
ICWSM have made their 2011 conference blog posts available to public. You will have to register an actual form, but this is available for free, which is 2.1TB in compressed format.
Proteome Commons
Proteome commons have made several large datasets available. The largest, the Personal Genome Project, is 1.1 TB in size. There are several others over 100 GB in size.
Reference Energy Disaggregation Data Set
The has data on home energy usage it's about 500 GB compressed.
Tiny Images
This dataset has 227 GB of image data and 57 GB of metadata.
ImageNet
The dataset is pretty big.
MOBIO
The dataset is about 135 GB of video and audio data
Yahoo! Webscope
The program makes several 1 GB+ datasets available to academic researchers, including an 83 GB data set of Flickr image features and the dataset used for the 2011 KDD Cup, from Yahoo! Music, which is a bit over 1 GB.
Google
made a dataset mapping words to Wikipedia URLs (i.e., concepts) [15]. The dataset is about 10 GB compressed.
Yandex
has recently made a very large web search click dataset available [1]. You'll have to register online for the contest to download. It's about 5.6 GB compressed.
Freebase
Freebase provides data dumps on regular basis. The largest is their Quad dump, which is about 3.6 GB compressed.
Open American National Corpus
This is about 4.8 GB uncompressed.
Wikipedia
made a dataset containing information about edits available for a recent Kaggle competition. The training dataset is about 2.0 GB uncompressed.
Research and Innovative Technology Administration (RITA)
RITA has made available a dataset about the on-time performance of domestic flights operated by large carriers. The ASA compressed this dataset and makes it available for download.
Wiki-links
data made available by Google is about 1.75 GB total [20].
Amazon Public datasets
Amazon Provides centralized repository of public data sets available on Amazon cloud to integrate & downlaod.
More Than 1 TB data
1000 GenomesThis project makes 260TB of human genome data available to the users.
Internet Archive
More than 80TB of data is available for research of internet web crawling
ClueWeb09
The TREC conference provided this data set to public. You will need to sign agreement and pay fee of (upto $610) to cover sneakernet data transfer. Total 5TB data is available.
http://lemurproject.org/clueweb09.php/
ClueWeb12
ClueWeb12 is available as freebase annotation.
CNetS
Indiana University makes a 2.5 TB click dataset available
ICWSM
ICWSM have made their 2011 conference blog posts available to public. You will have to register an actual form, but this is available for free, which is 2.1TB in compressed format.
Proteome Commons
Proteome commons have made several large datasets available. The largest, the Personal Genome Project, is 1.1 TB in size. There are several others over 100 GB in size.
More than 1 GB
Reference Energy Disaggregation Data Set
The has data on home energy usage it's about 500 GB compressed.
Tiny Images
This dataset has 227 GB of image data and 57 GB of metadata.
ImageNet
The dataset is pretty big.
MOBIO
The dataset is about 135 GB of video and audio data
Yahoo! Webscope
The program makes several 1 GB+ datasets available to academic researchers, including an 83 GB data set of Flickr image features and the dataset used for the 2011 KDD Cup, from Yahoo! Music, which is a bit over 1 GB.
made a dataset mapping words to Wikipedia URLs (i.e., concepts) [15]. The dataset is about 10 GB compressed.
Yandex
has recently made a very large web search click dataset available [1]. You'll have to register online for the contest to download. It's about 5.6 GB compressed.
Freebase
Freebase provides data dumps on regular basis. The largest is their Quad dump, which is about 3.6 GB compressed.
Open American National Corpus
This is about 4.8 GB uncompressed.
Wikipedia
made a dataset containing information about edits available for a recent Kaggle competition. The training dataset is about 2.0 GB uncompressed.
Research and Innovative Technology Administration (RITA)
RITA has made available a dataset about the on-time performance of domestic flights operated by large carriers. The ASA compressed this dataset and makes it available for download.
Wiki-links
data made available by Google is about 1.75 GB total [20].
Amazon Public datasets
Amazon Provides centralized repository of public data sets available on Amazon cloud to integrate & downlaod.
Difference between Big Data and Big Data Analytics?
Big Data and Big data analytics:
Analytics: This is a technique, deriving insights from data
Big Data: Huge data sets, which are difficult to analyze in traditional ways.
Analytics: This is a technique, deriving insights from data
Big Data: Huge data sets, which are difficult to analyze in traditional ways.
Subscribe to:
Posts
(
Atom
)
About me
I am Java programmer with avid interest in Big data, hadoop & internet of things
Popular Posts
-
Here is list of some of the blogs you can follow to learn about Hadoop and big data. Developer Content Hortonworks Blog Cloudera Engineeri...
-
ID KAFKA FLUME 1 Kafka is a publish-subscribe model messaging system, which offers strong durability, scalabitity and fault-tolerance su...
-
Big Data and Big data analytics: Analytics: This is a technique, deriving insights from data Big Data: Huge data sets, which are diffic...
-
Where can I find huge data sets, is a question in front of every person who is aiming to develop, test or study big data analytics tools. H...
-
Huge Elephant representing as a big data Big Data: Big data a new buzzword in a market we are hearing all around now a days. What re...
-
A year ago, I had to start a POC on Hadoop and I had no idea about what Hadoop is. I would explain the way I started with and which helpe...
-
Capgemini Supertechies: Heathrow Challenge Challenge : Use SMAC (Social, Mobility, Analytics and Cloud) tools and provide solution to ch...
Follow me
twitter.com/abhijeetdhumal
Like us
Sponsor
Blog Archive
Powered by Blogger.
Hadoop
Hello World