Big Data Analytics

ID	KAFKA	FLUME
1	Kafka is a publish-subscribe model messaging system, which offers strong durability, scalabitity and fault-tolerance support.	Flume is a distributed, reliable, and available system for efficiently collecting, aggregating, and moving large amounts of data from many different sources to a centralized data store, such as HDFS

2	Kafka Provides a back pressure to prevent overflowing a broker	Flume/Flume NG doesn’t provide any such a functionality

3	With Kafka you pull data, so each consumer has and manages it's own read pointer. This allows a large number of consumers of each Kafka queue, that pull data at their own pace. With this, you could deliver your event streams to HBase, Cassandra, Storm, Hadoop, RDBMS all in parallel.	To get data out of Flume, you use a sink, which writes to your target store (HDFS, HBase, Cassandra etc). Flume will re-try connections to your sinks if they are offline. Because Flume pushes data, you have to do some interesting work to sink data to two data stores

5	With Kafka 0.8+ you get replication of your event data. If you lose a broker node, others will take up the slack to delivery your events without loss.	With Flume & FlumeNG, and a File channel, if you loose a broker node you will lose access to those events until you recover that disk. The database channel with Flume is reported too slow for any production use cases at volume.

6	Kafka just provides messaging	Flume Provides number of pre built collectors

7	Flume’s main use-case is to ingest data into Hadoop. It is tightly integrated with Hadoop’s monitoring system, file system, file formats, and utilities such a Morphlines. A lot of the Flume development effort goes into maintaining compatibility with Hadoop. Sure, Flume’s design of sources, sinks and channels mean that it can be used to move data between other systems flexibly, but the important feature is its Hadoop integration.	Kafka’s main use-case is a distributed publish-subscribe messaging system. Most of the development effort is involved with allowing subscribers to read exactly the messages they are interested in, and in making sure the distributed system is scalable and reliable under many different conditions. It was not written to stream data specifically for Hadoop, and using it to read and write data to Hadoop is significantly more challenging than it is in Flume.

8	Use Flume if you have an non-relational data sources such as log files that you want to stream into Hadoop.	Use Kafka if you need a highly reliable and scalable enterprise messaging system to connect many multiple systems, one of which is Hadoop.

A year ago, I had to start a POC on Hadoop and I had no idea about what Hadoop is.

I would explain the way I started with and which helped others as well.

1. Go through some introductory videos on Hadoop
Its very important to have some high level idea of hadoop before directly starting working on it. These introductory videos will help in understanding the scope of Hadoop and the use cases where it can be applied. There are a lot of resources available online for the same and going through any of the videos will be beneficial.

2. Understanding MapReduce
The second thing which helped me was to understand what Map Reduce is and how it works. It is explained very nicely in this paper: http://static.googleusercontent....

Another nice tutorial is available here : http://ksat.me/map-reduce-a-real...

3. Getting started with Cloudera VM
Once you understand the basics of Hadoop, you can download the VM provided by cloudera and starting running some hadoop commands on it. You can download the VM from this link: http://www.cloudera.com/content/...

It would be nice to get familiar with basic Hadoop commands on the VM and understanding how it works.

4. Setting up the standalone/Pseudo distributed Hadoop
I would recommend setting up your own standalone Hadoop on your machine once you are familiar with Hadoop using the VM. The steps for installing are explained very nicely on this blog by Michael G. Noll : Running Hadoop On Ubuntu Linux (Single-Node Cluster) - Michael G. Noll

5. Understanding the Hadoop Ecosystem
It would be nice to get familiar with other components in the Hadoop ecosystem like Apache Pig, Hive, Hbase, Flume-NG, Hue etc. All these serve different purposes and having some information on all these will be really helpful in building any product around the hadoop ecosystem. You can install all these easily on your machine and get started with them. Cloudera VM by has most of these installed already.

6. Writing Map Reduce Jobs
Once you are done with steps 1-5, I don't think writing Map Reduce would be a challenge. It is explained thoroughly in The Definitive Guide. If MapReduce really interests you a lot, I would suggest reading this book Mining Massive Datasets by Anand Rajaraman, Jure Leskovec and Jeﬀrey D. Ullman : Page on Stanford

I might have missed some points here but that is how I started with Hadoop and my journey so far has been really interesting and rewarding.

Capgemini Supertechies: Heathrow Challenge

Challenge:

Use SMAC (Social, Mobility, Analytics and Cloud) tools and provide solution to change the passenger behavior prior to security check and speedup the security check.

As per the data of 2013, total number of 72.3 million passengers traveled from Heathrow airport, which makes 198082 people traveling per day or 8253 people per hour. This causes long queues at the security check points and creates more delays which is really tiresome for passengers.

This problem can be solved or optimized to reduce the delays and improving passenger experience by using technology.

Solution:

Use SMAC stack to provide best passenger experience .

Pre-requisites:

Integrated with the Airport eco-system.
Airport-Airline interface where airlines will inform airport about passenger bookings to the airport
Integration with govt biometric identification system if any available (Not mandatory, this system will build it with the passengers as they check in)

System Architecture:

Centralized Application server hosted on Cloud:

Centralized application server will work as middleware where it will manage cloud hosted data analytics platform
Provide administrative UI and elegant dashboard to Airport management authorities
Interface with airlines to get passenger booking details
Integrated with social media
Serve requests to Mobile applications
Generating heatmaps after analyzing queues and redirecting further passengers to other queues

Mobile Application:

This will provide passenger application where he will be entitled to view his boarding information, alert notifications
Security counters availability according to free/less crowded queues
Provide various offerings

Working:

Assuming passenger is having mobile app installed passenger will need to add belongings he is carrying with himself during travel. Accordingly app will inform him which belongings he can keep in his cabin bag, which to keep in luggage bag and which ones are banned. He will get notified about his schedule through mobile app as well as travel and informed about probable security counters couple of hours prior check-ins.

When user enters the airport he will go through Security metal detectors at the entrance and then through self assisted biometric checkup where his identity will be verified against available in the database. He will do the bar-code scanning of boarding pass as well. (can use thumb impression or retina scanning)

Once the user verification is done he will be divided in to the three classes fully verified, satisfactory and not verified through biometric checks.

Also at the same time mobile geo location monitoring will be started using wifi enabled tracking or iBeakon like tools (BLE). This will be monitoring passenger movement as well as heatmaps will be created to identify crouded and busy lines. Based on the analytic s further users will be redirected to less busy queues.

Fully verified and satisfactory class passengers will go through the queues which will be having lesser or security/baggage check. Non verified passengers will go through normal strigent security check. This will avoid/reduce random security check and provide faster passage to most of the passengers.

Each time passenger will be passed through the Heathrow security check, his information will be stored to further analysis. All the passengers will be given ratings.

Airport authority management will be continuously monitoring heatmaps and will direct there staff to the places where there is more crowded queues so as to serve speedily. Also they will be monitoring alarms raised by analytics system.

Analytics will be analyzing social media, biometric data, generating heatmaps and predicting security threats using passenger information.

Each passenger will be given ratings for each travel based upon categories he has selected and will be used for further security check analytics.

How to encourage People to adopt this solution?
Encourage people to use this app by providing various options.
- Provide them information discounts/coupons at outlets nearby airport
- Provide them air-miles credit etc.

Information about each SMAC components Features:
Social Media:
Grab the social media feeds of the passenger which can be used following way:

This information will be used to analyze passenger specific trends based upon public information available for alerting on suspected passengers prior to journey (Security purpose)
Providing product/discount offerings at the airport based on analyzed interests

Mobility:
Provide a mobile App which will have following features:

Boarding information
Flight schedule details (Arrival/departure)
Security check counter information (will be scheduled to direct to less trafficking counters)
Notifications according to the flight timings
User will fill in belongings he is carrying with him self
FAQ section will contain list of banned contents (This can be showed in gamification way with questionnaire)
Discount offers section for people who starts using this app at the initial period

Analytics:
This is a heart of this security system which will provide following features:

Analyze the passenger movement using tools like iBeakon which uses Bluetooth low energy (BLE) also use tracking using geo information data (tracking using GPS)
Analyze the queues with heatmaps and accordingly inform passengers which queues to follow
Identify passengers identity using biometrics with the comparison to earlier stored biometric data
Analyzing customers social data available in public domain and raise alerts prior to journey as well as provide product offerings or discounts based on passenger interests.

Cloud:

Cloud will store all the passenger related data, social analytics data, biometrics information data etc
This will host a central application which will serve all the mobile clients (Through mobile app) and Airport management through Admin UI interface.

Where can I find huge data sets, is a question in front of every person who is aiming to develop, test or study big data analytics tools. Here we provide the list of data sets available in public domain.

More Than 1 TB data

1000 Genomes
This project makes 260TB of human genome data available to the users.

Internet Archive
More than 80TB of data is available for research of internet web crawling

ClueWeb09
The TREC conference provided this data set to public. You will need to sign agreement and pay fee of (upto $610) to cover sneakernet data transfer. Total 5TB data is available.
http://lemurproject.org/clueweb09.php/

ClueWeb12
ClueWeb12 is available as freebase annotation.

CNetS
Indiana University makes a 2.5 TB click dataset available

ICWSM
ICWSM have made their 2011 conference blog posts available to public. You will have to register an actual form, but this is available for free, which is 2.1TB in compressed format.

Proteome Commons
Proteome commons have made several large datasets available. The largest, the Personal Genome Project, is 1.1 TB in size. There are several others over 100 GB in size.

More than 1 GB

Reference Energy Disaggregation Data Set
The has data on home energy usage it's about 500 GB compressed.

Tiny Images
This dataset has 227 GB of image data and 57 GB of metadata.

ImageNet
The dataset is pretty big.

MOBIO
The dataset is about 135 GB of video and audio data

Yahoo! Webscope
The program makes several 1 GB+ datasets available to academic researchers, including an 83 GB data set of Flickr image features and the dataset used for the 2011 KDD Cup, from Yahoo! Music, which is a bit over 1 GB.

Google
made a dataset mapping words to Wikipedia URLs (i.e., concepts) [15]. The dataset is about 10 GB compressed.

Yandex
has recently made a very large web search click dataset available [1]. You'll have to register online for the contest to download. It's about 5.6 GB compressed.

Freebase
Freebase provides data dumps on regular basis. The largest is their Quad dump, which is about 3.6 GB compressed.

Open American National Corpus
This is about 4.8 GB uncompressed.

Wikipedia
made a dataset containing information about edits available for a recent Kaggle competition. The training dataset is about 2.0 GB uncompressed.

Research and Innovative Technology Administration (RITA)
RITA has made available a dataset about the on-time performance of domestic flights operated by large carriers. The ASA compressed this dataset and makes it available for download.

Wiki-links
data made available by Google is about 1.75 GB total [20].

Amazon Public datasets
Amazon Provides centralized repository of public data sets available on Amazon cloud to integrate & downlaod.

Subscribe to: Posts ( Atom )

Big Data Analytics

featured Slider

Sample Post

Flume vs Kafka

How do you get started with Hadoop?

What are the best blogs to follow for hadoop, big data?

Capgemini Supertechies: Heathrow Challenge

Public Big data sets available for download?

More Than 1 TB data

Difference between Big Data and Big Data Analytics?

About me

Popular Posts

Big Data Analytics

featured Slider

Sample Post

Flume vs Kafka

How do you get started with Hadoop?

What are the best blogs to follow for hadoop, big data?

Capgemini Supertechies: Heathrow Challenge

Public Big data sets available for download?

More Than 1 TB data

Difference between Big Data and Big Data Analytics?

About me

Popular Posts

Follow me

Like us

Sponsor

Blog Archive

Hadoop

Categories

Definition List

Text Widget

Contributors