Core Components of HDFS

June 17, 2021

There are three components of HDFS:
1. NameNode
2. Secondary NameNode
3. DataNode

NameNode:

NameNode is the master node in the Apache Hadoop HDFS Architecture that maintains and manages the blocks present on the DataNode (slave node). NameNode is a very highly available server that manages the File System Namespace and controls access to files by clients. I will be discussing this High Availability feature of Apache Hadoop HDFS in my next blog. The HDFS architecture is built in such a way that the user data never resides on the NameNode. The data resides on DataNode only.

1. NameNode is the centerpiece of HDFS.
2. NameNode is also known as the Master.
3. NameNode only stores the metadata of HDFS – the directory tree of all files in the file system, and tracks the files across the cluster.
4. NameNode does not store the actual data or the data set. The data itself is actually stored in the DataNode.
5. NameNode knows the list of the blocks and its location for any given file in HDFS. With this information NameNode knows how to construct the file from blocks.
6. NameNode is so critical to HDFS and when the NameNode is down, HDFS/Hadoop cluster is inaccessible and considered down.
7. NameNode is a single point of failure in Hadoop cluster.
8. NameNode is usually configured with a lot of memory (RAM). Because the block locations are help in main memory.

Functions of NameNode:

It is the master daemon that maintains and manages the DataNode (slave node).

It records each change that takes place to the file system metadata. For example, if a file is deleted in HDFS, the NameNode will immediately record this in the EditLog.

It regularly receives a Heartbeat and a block report from all the DataNode in the cluster to ensure that the DataNode is live.

It keeps a record of all the blocks in HDFS and in which nodes these blocks are located.

The NameNode is also responsible to take care of the replication factor of all the blocks which we will discuss in detail later in this HDFS tutorial blog.

In case of the DataNode failure, the NameNode chooses new DataNode for new replicas, balance disk usage and manages the communication traffic to the DataNode.

It records the metadata of all the files stored in the cluster, e.g. The location of blocks stored, the size of the files, permissions, hierarchy, etc.

There are two files associated with the metadata:

FsImage: It contains the complete state of the file system namespace since the start of the NameNode.

EditLog: It contains all the recent modifications made to the file system with respect to the most recent FsImage.

Secondary NameNode:

Apart from these two daemons, there is a third daemon or a process called Secondary NameNode. The Secondary NameNode works concurrently with the primary NameNode as a helper daemon. And don’t be confused about the Secondary NameNode being a backup NameNode because it is not.

Secondary NameNode Function - Apache Hadoop HDFS Architecture - Edureka

Functions of Secondary NameNode:

The Secondary NameNode is one which constantly reads all the file systems and metadata from the RAM of the NameNode and writes it into the hard disk or the file system.
It is responsible for combining the EditLog with FsImage from the NameNode.
It downloads the EditLog from the NameNode at regular intervals and applies to FsImage. The new FsImage is copied back to the NameNode, which is used whenever the NameNode is started the next time.

Hence, Secondary NameNode performs regular checkpoints in HDFS. Therefore, it is also called Check Point Node.

DataNode:

DataNode is the slave node in HDFS. Unlike NameNode, DataNode is a commodity hardware, that is, a non-expensive system which is not of high quality or high-availability. The DataNode is a block server that stores the data in the local file ext3 or ext4.

1. DataNode is responsible for storing the actual data in HDFS.
2. DataNode is also known as the Slave Node.
3. NameNode and DataNode are in constant communication.
4. When a DataNode starts up it announce itself to the NameNode along with the list of blocks it is responsible for.
5. When a DataNode is down, it does not affect the availability of data or the cluster.
6. NameNode will arrange for replication for the blocks managed by the DataNode that is not available.
7. DataNode is usually configured with a lot of hard disk space. Because the actual data is stored in the DataNode.

Functions of DataNode:

These are slave daemons or process which runs on each slave machine.
The actual data is stored on DataNode.
The DataNode perform the low-level read and write requests from the file system’s clients.
They send heartbeats to the NameNode periodically to report the overall health of HDFS, by default, this frequency is set to 3 seconds.

Till now, you must have realized that the NameNode is pretty much important to us. If it fails, we are doomed. But don’t worry, we will be talking about how Hadoop solved this single point of failure problem in the next Apache Hadoop HDFS Architecture blog. So, just relax for now and let’s take one step at a time.

Big Data

Big Data Tutorial