UrbanPro
true

Learn Computer from the Best Tutors

  • Affordable fees
  • 1-1 or Group class
  • Flexible Timings
  • Verified Tutors

Search in

Big Data Hadoop Basic Tutorial For Beginners

R
Rahul R
13/07/2017 0 0

Hadoop Basics for Admin and developers

Hadoop is a framework used for storing and processing huge data sets. By huge data I mean Big Data. Big data is any data that cannot be handled by traditional RDBMS. For storing and processing big data hadoop framework has been divided into 2 parts

1) HDFS: For storing

2) Map Reduce: For processing

Now let’s discuss both these one by one in detail. For simplicity we will cover HDFS first and learn actually how HDFS handles data storage. We will consider both the versions of hadoop i.e. hadoop-1.x and hadoop-2.x and see how hadoop has evolved. But for simplicity we will start from scratch and then explore further.

So what is HDFS?

Hadoop Distributed File System (HDFS) is the storage unit of hadoop. You must have all heard about NTFS and FAT32 in windows. They are file system formats of windows. Similarly HDFS is file system of hadoop. How it is different from others is what we will be discussing now.

Hadoop has a block size of 64mb (by default and configurable as we will see further in hadoop-1.x). Now you must be wondering what a block is. Block is the minimum storage unit of any file system. In case of NTFS its 4 kb. It means that if you try to store a file of 2GB, it will be broken into blocks of 4kb,4kkb each and then stored. In case of hadoop it is 64mb. It means if you try to store a file of 512 mb it will be stored in blocks of 64 mb i.e total 8 blocks of 64mb each (64 x 8=512 mb).

Now, you must be imagining what if we have to store a file of 516mb i.e. if the file size is not a multiple of 64. Then the file of 516mb is stored as 8 blocks of 64mb each and the remaining 4 mb will be stored as 4mb block and hdfs will maintain the block size of 64mb when another file comes in storage ie in the same block it will store next file and read it as 60 mb file, totaling to 64 mb. Thus no space is wasted and if a 4 mb file is left then it doesn’t takes full 64 mb and doesn’t wastes remaining 60mb. It just stores it as 4mb file and the next file of 60mb is added to it later and a reference tag is put on it so that it can be connected to its other blocks (to identify its remaining 4 mb ). This concept is called Indexing in hadoop.

Block concept in HDFS.

It should be noted that the storage of flies in hadoop is logical. Physically the files will be stored in your hard-disk only.

Daemons in Hadoop:

There are five daemons (ever running process) in hadoop. As for now we will see in context of hadoop-1.x:

1) NameNode (NN)

2)DataNode (DN)

3) TaskTracker (TT)

4) JobTracker (JT)

5) SecondaryNameNode(SNN)

These hadoop daemons follow master-slave architecture. In case of hdfs there is a master called Namenode and slaves called Datanodes. The actual data resides in datanodes and the information about the data in the datanodes is present in the namenode as metadata. Always remember Namenode doesn’t contains actual data, it contains metadata (data about data). As mentioned earlier that the data is stored in form of blocks, the datanodes store the blocks of data with a replication factor of 3 (by default and configurable). This means that each data block is stored at 3 places at different datanodes as per availability.  This feature of hadoop makes it Fault Tolerant. If your hard-disk is full and you want to add more data then in traditional RDBMS you would have to add more RAM, ROM to your system/node. It is called Vertical Scaling and it may lead to hardware failure. In hadoop it is not the case. When your data grows, you don't need to add extra hardware to your system/node. You add another separate system/node to the existing set of datanodes. This is called Horizontal Scaling. These features of replication and horizontal scaling make hadoop scalable and fault tolerant and thus allow us to use commodity hardware (cheap hardware) and thus it is low cost. Since it’s an open source framework developed by apache team funded by yahoo it is blooming in the market. Originally it was developed by Doug Cutting.

This was a brief introduction about storage in hadoop. Now let’s come to processing. The processing part is handled by another master slave architecture where JobTracker is the master and TaskTracker is the slave(in hadoop-1.x version). They are together responsible to carryout processing by a major backbone concept in hadoop called Mapreduce. We will see mapreduce in detail.

Now you must be wondering what is SecondaryNameNode(SNN). Let me be clear it is not a backup for namenode. To know what SNN is first we need to know about fsimage and edit log.

As mentioned earlier Mapreduce is processing/analytical unit of hadoop and hdfs is for storage.

Now since data is growing continuously, it gets stored on datanodes. Thus the metadata in the namenode also grows continuously. So it needs to get updated. The metadata is stored in the Namenode in form of an image file called Fsimage. The fsimage can’t be edited directly. The metadata updates is done on editlogs. After every fixed time interval or fixed size of records entry (say Millions records or hours or whichever first) a checkpointing occurs i.e. the editlog and fsimaged are copied to SecondaryNameNode and they are merged to form a new fsimage. Meanwhile the metadata updates in the NameNode is done on a new editlog. After the merging is complete in SNN, the new fsimage is moved to NameNode and the old editlog and old fsimage are deleted from the NameNode.

The new editlog now becomes editlog and the whole process is back as it was before with of course a larger fsimage. This whole process is called checkpointing process.

It can be noted that if job tracker fails then mapreduce (processing) will not occur but the storage will not be affected.

 If the NameNode fails then the storage and processing both will not be carried out. It’s because the namenode contains the information about which data is present in which node.

So it can be concluded that the NameNode is the single Single point of failure (In hadoop-1.x of course, in later version of hadoop-2.x we will discuss how NameNode is no longer single point of failure).

Heartbeat mechanism in hadoop:-

The NameNode has fsimage which has the file to block mapping and the block report of all datanodes. The datanodes sends heartbeats (pings its ip address,block ids etc.) to the namenode after every 30 seconds (configurable) to tell it that it is alive. If any datanode doesn’t responds then after sometime the namenode treats it dead and replicates the data on that particular dead data node by copying the data on another datanode. You must be wondering how can it copy data from a dead datanode, it doesn’t copies it from the dead datanode. The namenode has reports of all the datanodes where already the replications of those blocks in the dead datanode where present. From those datanodes it maintains the overall replication factor by replicating a block. If the dead data node again becomes active and starts sending heartbeats then the namenode will again maintain the replication factor by deleting the over replicated data.

Let’s switch to the Mapreduce concept in hadoop.

What is mapreduce?

Map Reduce is framework for distributing tasks across multiple nodes in Hadoop Cluster. It is divided in two phases 1) The map phase; 2) The reduce phase. Let’s consider a simple word count example. You have to find out word count of each word from a text file.

The above diagram shows the various phases of map reduce. In Map phase first the input splitting is done according to block size (default 64mb in hadoop1 and 128mb in hadoop2).

The input splits are further mapped as a key, value pair as evident from the diagram.

Then shuffling and sorting phase occurs and the same words are grouped in one file with their values it’s in [list (key), value] form.

The next is the reduce phase. The key and total value is calculated and the outputs of reduce phase are combines in a single file.

Phase

INPUT

OUTPUT

MAP

<K1,V1>

LIST(<K2,V2>)

REDUCE

<K2,LIST(V2)>

LIST(<K3,V3>)

Java Code for Mapreduce :

Here is the java code for map reduce. We will see how to run this code in hadoop later while discussing hadoop commands:

import java.io.IOException;

import java.util.*;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.conf.*;

import org.apache.hadoop.io.*;

import org.apache.hadoop.mapreduce.*;

import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;

import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;

public class WordCount {

public static class Map extends Mapper {

private final static IntWritable one = new IntWritable(1);

private Text word = new Text();

public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {

        String line = value.toString();

        StringTokenizer tokenizer = new StringTokenizer(line);

        while (tokenizer.hasMoreTokens()) {

         word.set(tokenizer.nextToken());

          context.write(word, one);

        }

    }

 }

public static class Reduce extends Reducer {

public void reduce(Text key, Iterable values, Context context)

      throws IOException, InterruptedException {

        int sum = 0;

        for (IntWritable val : values) {

            sum += val.get();

        }

        context.write(key, new IntWritable(sum));

    }

 }

  public static void main(String[] args) throws Exception {

    Configuration conf = new Configuration();

      Job job = new Job(conf, "wordcount");

   job.setOutputKeyClass(Text.class);

   job.setOutputValueClass(IntWritable.class);

    job.setMapperClass(Map.class);

    job.setReducerClass(Reduce.class);

   job.setInputFormatClass(TextInputFormat.class);

   job.setOutputFormatClass(TextOutputFormat.class);

   FileInputFormat.addInputPath(job, new Path(args[0]));

    FileOutputFormat.setOutputPath(job, new Path(args[1]));

    job.waitForCompletion(true);

 }

}

Now before we move on to the architecture of MapReduce1 and MapReduce2 lets do some practical.

Hadoop can be installed in:

1) Standalone Mode

2) Pseudo distributed mode(Single node cluster)

3) Fully distributed node(Multiple node cluster)

Now we will see the steps of Single node cluster installation of hadoop-2.x on Ubuntu.

All you need is VMWare and Ubuntu on your system. You can use a virtual box also All these software are free of cost for learning purpose.

This manual is only recommended for POC(Proof of Concept). You will be provided a separate manual later for real time environment i.e. for Cloudera Hadoop multimode installation on aws cloud.

1) As you must be aware that hadoop is written in JAVA so our first step is Java installation. First check whether Java is already installed on your system by running the following command.

$ java -version

java version "1.7.0_65"

OpenJDK Runtime Environment (IcedTea 2.5.3) (7u71-2.5.3-0ubuntu0.14.04.1)

OpenJDK 64-Bit Server VM (build 24.65-b04, mixed mode)

2) Adding a dedicated Hadoop user:

$ sudo addgroup hadoopAdding group `hadoop' (GID 1002) ...Done. ~$ sudo adduser --ingroup hadoop hduserAdding user `hduser' ...Adding new user `hduser' (1001) with group `hadoop' ...Creating home directory `/home/hduser' ...Copying files from `/etc/skel' ...Enter new UNIX password: Retype new UNIX password: passwd: password updated successfullyChanging the user information for hduserEnter the new value, or press ENTER for the default               Full Name []:                Room Number []:                Work Phone []:                Home Phone []:                Other []: Is the information correct? [Y/n] Y 

3) Installing SSH:

SSH has two main components:

  1. SSH: The command we use to connect to remote machines - the client.
  2. SSHD: The daemon that is running on the server and allows clients to connect to the server.

The SSH is pre-enabled on Linux, but in order to start SSHD daemon, we need to install SSH first. Use this command to do that:

~$ sudo apt-get install ssh

4) This will install ssh on our machine. If we get something similar to the following, we can think it is setup properly:

~$ which ssh/usr/bin/ssh ~$ which sshd   /usr/sbin/sshd 

5) Create and Setup SSH Certificates: Hadoop requires SSH access to manage its nodes, i.e. remote machines plus our local machine. For our single-node setup of Hadoop, we therefore need to configure SSH access to localhost.
So, we need to have SSH up and running on our machine and configured it to allow SSH public key authentication.
Hadoop uses SSH (to access its nodes) which would normally require the user to enter a password. However, this requirement can be eliminated by creating and setting up SSH certificates using the following commands. If asked for a filename just leave it blank and press the enter key to continue. 
~$ su hduserPassword: k@laptop:~$ ssh-keygen -t rsa -P ""Generating public/private rsa key pair.Enter file in which to save the key (/home/hduser/.ssh/id_rsa): Created directory '/home/hduser/.ssh'.Your identification has been saved in /home/hduser/.ssh/id_rsa.Your public key has been saved in /home/hduser/.ssh/id_rsa.pub.The key fingerprint is:50:6b:f3:fc:0f:32:bf:30:79:c2:41:71:26:cc:7d:e3 hduser@laptopThe key's randomart image is:+--[ RSA 2048]----+|        .oo.o    ||       . .o=. o  ||      . + .  o . ||       o =    E  ||        S +      ||         . +     ||  &nbs

0 Dislike
Follow 0

Please Enter a comment

Submit

Other Lessons for You

Design Pattern
Prototype Design Pattern: Ø Prototype pattern refers to creating duplicate object while keeping performance in mind. Ø This pattern involves implementing a prototype interface which tells...

Lets look at Apache Spark's Competitors. Who are the top Competitors to Apache Spark today.
Apache Spark is the most popular open source product today to work with Big Data. More and more Big Data developers are using Spark to generate solutions for Big Data problems. It is the de-facto standard...
B

Biswanath Banerjee

1 0
0

Big Data & Hadoop - Introductory Session - Data Science for Everyone
Data Science for Everyone An introductory video lesson on Big Data, the need, necessity, evolution and contributing factors. This is presented by Skill Sigma as part of the "Data Science for Everyone" series.

Excel-Select All Data/Columns/Rows with One Click
You might know how to select all by using the Ctrl + A shortcut, but few know that with only one click of the corner button, all data/Columns/Rows will be selected in seconds.

Up, Up And Up of Hadoop's Future
The onset of Digital Architectures in enterprise businesses implies the ability to drive continuous online interactions with global consumers/customers/clients or patients. The goal is not just to provide...
X

Looking for Computer Classes?

The best tutors for Computer Classes are on UrbanPro

  • Select the best Tutor
  • Book & Attend a Free Demo
  • Pay and start Learning

Learn Computer with the Best Tutors

The best Tutors for Computer Classes are on UrbanPro

This website uses cookies

We use cookies to improve user experience. Choose what cookies you allow us to use. You can read more about our Cookie Policy in our Privacy Policy

Accept All
Decline All

UrbanPro.com is India's largest network of most trusted tutors and institutes. Over 55 lakh students rely on UrbanPro.com, to fulfill their learning requirements across 1,000+ categories. Using UrbanPro.com, parents, and students can compare multiple Tutors and Institutes and choose the one that best suits their requirements. More than 7.5 lakh verified Tutors and Institutes are helping millions of students every day and growing their tutoring business on UrbanPro.com. Whether you are looking for a tutor to learn mathematics, a German language trainer to brush up your German language skills or an institute to upgrade your IT skills, we have got the best selection of Tutors and Training Institutes for you. Read more