Q/A (Hadoop)
Q - What if Namenode fails in Hadoop?
A - The single point of failure in Hadoop 1x is Namenode. If Namenode gets fail, the whole Hadoop cluster will not work. Actually, there will not any data loss only the cluster work will be shut down, because Namenode is only the point of contact to all Datanodes and if the Namenode fails, all communication will stop.
Q - What is Data Locality?
A - If we bring the data from slave to master, it will cost network congestion + input output channel congestion , and at the same time master node will take a lot of time to process this huge amount of data. We can send this process to data, means we can send the logic to all slaves which contains data and perform processing in the slave itself, result will be sent to name node, will take less time.
Q - An a folder, 100 files are there. Each file size is 1 mb, if block size is 64 mb, total how many blocks will be created?
A - 100 blocks will be created.
Q - Explain what is heartbeat in HDFS?
A - Heartbeat is referred to a signal used between a data node and Name node, and between task tracker and job tracker, if the Name node or job tracker does not respond to the signal, then it is considered there is some issues with data node or task tracker.
Q - What happens when a data node fails?
A - When a data node fails...
- Jobtracker and Namenode detect the failure.
- On the failed node all tasks are re-scheduled.
- Namenode replicates the user's data to another node.
Q - Explain what are the basic parameters of a Mapper?
A - The basic parameters of a Mapper are
- LongWritable and Text
- Text and IntWritable
Q - What Are Hadoop Daemons?
A - Daemons are the processes that run in the background. There are four primary daemons: Namenode, Datanode, Resource Manager (runs on master node for YARN), Node Manager (runs on slave node for YARN).
Q - Why divide the file into blocks?
A - Let’s assume that we don’t divide, now it’s very difficult to store a 100 TB file on a single machine. Even if we store, then each read and write operation on that whole file is going to take very high seek time. But if we have multiple blocks of size 128 MB then its become easy to perform various read and write operations on it compared to doing it on a whole file at once. So we divide the file to have faster data access i.e. reduce seek time.
Q - Why replicate the blocks in data nodes while storing?
A - Let’s assume we don’t replicate and only one yellow block is present on Datanode D1. Now if the data node D1 crashes we will lose the block and which will make the overall data inconsistent and faulty. So we replicate the blocks to achieve fault-tolerance.
Please add more interview questions as it is helping me a lot in my interview.
ReplyDeleteSure will do
Delete