Tech Notes

My notes on Statistics, Big Data, Cloud Computing, Cyber Security

Big Data and Hadoop Basics

1. How big is big data ?

Confusion! Most agree on a definition that it is data big enough that it cant be processed on a single machine.

2. 3V ?

Volume, Variety, Velocity

3. Core Hadoop

Screenshot_122213_074342_PM

Some of the other softwares on top of Hadoop to make it easier to talk to Hadoop

Screenshot_122213_075058_PM

How HDFS works

Eg . Each big file is split into 64 mb “blocks” and stored on a “node”

Screenshot_122213_080152_PM

Screenshot_122213_081127_PM

To prevent failure due to missing data in one of the DNs (Due to disk failure etc) Hadoop replicates the data 3 times across different DNs and the “Namenode” keeps track of this (Namenode has metadata to track which block of data is in which node). If a cluster fails and the data is under-replicated, the Namenode re-replicates it on another one.

Screenshot_122213_081544_PM

Since Namenode is a single point of failure, its also possible to have a standby namenode.

Screenshot_122213_082209_PM

Mapping and Reducing

Screenshot_122213_083256_PM

Eg : Calculate Sales by city.

Mappers read data and pile them up into index cards. Reducers then collect their sets of cards and do some operation on them. (Each reducer is told which city they are responsible for )

Screenshot_122213_083753_PM

When we run a MR job, we submit the job to a job tracker. The job tracker splits the work mappers and reducers.

Actually running the M/R jobs is done by a daemon called Task tracker. Since the MR job is run on the node itself, there is very less network traffic between nodes.

Screenshot_122213_085237_PM

Disclaimer : These are my study notes – online – instead of on paper so that others can benefit. In the process I’ve have used some pictures / content from other original authors. All sources / original content publishers are listed below and they deserve credit for their work. No copyright violation intended.

Referencesfor these notes :

The study material for the MOOC “Introduction to Hadoop and MapReduce” at Udacity.com

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: