Posts

Q/A (Big Data) 2

Q - Hive Stores metadata in which type of database? A - MySQL Q - Why to analyze data? A - (i) Exponential growth in a machine data over the last years.      (ii) Growing no of machine and usage of IOT devices. Q - From where, large data is generated? Sensor data Machine data Business data Q - What is communication model in hadoop? Hadoop follows RPC communication. Communication happens in every 3 sec (by default). Slaves communicate to master. Q - What are important configuration files in hadoop? Hdfs-site.xml – Block replication Core-site.xml – I/O settings , setting of Hadoop cluster Hadoop-env.sh – Environment setup Master Slave Mapred-site.sh – For setting a map reduce Q - Big Data as an opportunity? Cost Reduction - Cost effective storage system for huge data sets. Next generation products - Automated car, Health care. Faster and better decision making - Provides ways to analyze information quickly and make decisions. Improved services or products - Evaluation of cu...

Linux Commands

  Basic Linux Commands ls                                  -                            List of files lsr                                    -                         List of files recursively mkdir                               -                         Make directory rmdir                                -                         Remove directory touchz               ...

Hadoop Start Commands

Hadoop Commands Hadoop Process Start/Stop: start-all.sh     =>   starting NameNode, DataNode, Secondary NameNode, Resource Manager, Node Manager stop-all.sh    => stopping NameNode, DataNode, Secondary NameNode, Resource Manager, Node Manager start-dfs.sh  =>  starting NameNode, DataNode, Secondary NameNode stop-dfs.sh   => stopping NameNode, DataNode, Secondary NameNode Hadoop 1.x: start-mapred.sh => starting Job Tracker, Task Tracker stop-mapred.sh => stopping Job Tracker, Task Tracker Hadoop 2.x: start-yarn.sh  => starting Resource Manager, Node Manager stop-yarn.sh   => stopping Resource Manager, Node Manager For starting individual service: For hdfs services hadoop-daemon.sh start <process_name> hadoop-daemon.sh stop <process_name> Ex: process_names are `namenode, datanode, secondarynamenode` Ex: For NameNode process hadoop-daemon.sh start namenode hadoop-daemon.sh stop na...

YARN

Image
What is YARN ? Yarn stands for Yet Another Resource Negotiator. It separates   resource management layer from the processing layer. Main components of YARN architecture : Client - MapReduce job is submitted. Resource Manager -  It allocates cluster resources using scheduler and application manager.             a) Scheduler - It performs scheduling based on allocated application and                          available resources.             b) Application Manager -  It is responsible for accepting the application and                          negotiating the first container from the resource manager. It also restarts the               Application Manager container if a task fails. Application Master - It ma...

MapReduce

Image
What is MapReduce ? It is a processing technique and programming model which is used for distributed parallel processing. This algorithm contains two task: Mapper and Reducer, w here mapper does splitting  and mapping of data while Reducer does shuffling and reducing the data. Data goes through the following phases in MapReduce technique: 1. Input Split -  The input data can be in the form of file or directory and is stored in HDFS.  When job is submitted in Hadoop, it splits the input data equal units called chunks. Hadoop consists of a RecordReader that  uses TextInputFormat function to transform input splits into key-value pairs. 2. Mapping - In this step, mapper processes the key-value pairs and produces output of same format. 3. Shuffling - Shuffling contains two phases: Sorting and merging.  In sorting step, the key-value pairs are sorted using the keys. Merging ensures that key-value pairs are combined. The shuffling phase makes easier the removal of dupl...

NameNode Federation

Image
What is NameNode Federation ? The NameNode keeps track of each and every file that is present on its file system. All this information is stored in main memory. As the no of files keep increasing within the cluster, the main memory would get filled up at some point of time. In order to address this issue, what Hadoop does is.. It uses multiple NameNode to manage each portion of the file system. Example: Let us say, we have three NameNode as part of NameNode federation. - First NameNode could be dedicated to Sales data. - Second NameNode could be dedicated to Account data. - And third NameNode could be dedicated to HR data. So, Unique Id will get created for each NameNode called block pool id. If any NameNode goes down, it does not affect other NameNode.

Core Components of HDFS

Image
There are three components of HDFS: 1. NameNode 2. Secondary NameNode 3. DataNode NameNode: NameNode is the master node in the Apache Hadoop HDFS Architecture that maintains and manages the blocks present on the DataNode (slave node).  NameNode is a very highly available server that manages the File System Namespace and controls access to files by clients. I will be discussing this High Availability feature of Apache Hadoop HDFS in my next blog. The HDFS architecture is built in such a way that the user data never resides on the NameNode. The data resides on DataNode only. 1. NameNode is the centerpiece of  HDFS. 2. NameNode is also known as the Master. 3. NameNode only stores the metadata of HDFS – the directory tree of all files in the file system, and tracks the files across the cluster. 4. NameNode does not store the actual data or the data set. The data itself is actually stored in the DataNode. 5. NameNode knows the list of the blocks and its location for any given file...

Rack Awareness Advantages

Advantages of Rack Awareness So , now you will be thinking why do we need a Rack Awareness algorithm? The reasons are: To improve the network performance:   The communication between nodes residing on different racks is directed via switch. In general, you will find  greater network bandwidth  between machines in the same rack than the machines residing in different rack. So, the Rack Awareness helps you to have reduce write traffic in between different racks and thus providing a better write performance. Also, you will be gaining increased read performance because you are using the bandwidth of multiple racks. To prevent loss of data:   We don’t have to worry about the data even if an entire rack fails because of the switch failure or power failure. And if you think about it, it will make sense, as it is said that  never put all your eggs in the same basket. # Rack AwarenessAlgorithm

Q/A (Hadoop)

Q - What if Namenode fails in Hadoop? A - The single point of failure in Hadoop 1x is Namenode. If Namenode gets fail, the whole Hadoop cluster will not work. Actually, there will not any data loss only the cluster work will be shut down, because Namenode is only the point of contact to all Datanodes and if the Namenode fails, all communication will stop. Q - What is Data Locality? A - If we bring the data from slave to master, it will cost network congestion + input output channel congestion , and at the same time master node will take a lot of time to process this huge amount of data. We can send this process to data, means we can send the logic to all slaves which contains data and perform processing in the slave itself, result will be sent to name node, will take less time. Q - An a folder, 100 files are there. Each file size is 1 mb, if block size is 64 mb, total how many blocks will be created? A - 100 blocks will be crea...

Hive

What is Hive ? Hive is a data warehouse tool to analyze structured data in Hadoop.  It was developed by Facebook.  It resides on top of Hadoop and used to abstract Data, and makes querying and analyzing easy.  It is a platform used to develop SQL type scripts to do MapReduce operations. Note - It helps in reading and writing data in Hadoop and process it without writing complex java programs. . Features : It is OLAP (Online Analytical Processing). It is fast, scalable and familiar. It is similar to SQL   language for querying called HQL (Hive Query Language). It supports  Data Manipulation Language and Data definition Language. It works on server-side of HDFS cluster. There are two types of tables in hive : Internal Table and External table Note: Default location is /user/hive/warehouse. Internal Table (Managed table ) :  In Internal Table, Both the table schema and table data are managed by hive. The data will be located in a folder named after the table ...

Rack Awareness Algorithm

Image
Block -   Block is a small chunk of data. It contains minimum amount of data that can be read or write. HDFS stores each file in terms of blocks. Block size in Hadoop 1x is 64 MB. Block size in Hadoop 2x is 128 MB. Files are split into 64 MB or 128 MB blocks depending on Hadoop version and then stored into the Hadoop file system. Why HDFS block size are large in size Reason for having HDFS blocks in large size is to reduce the cost of seek time. In general, the seek time is 10 ms and disk transfer rate is 100 MB/S. To make the seek time 1% of the disk transfer rate, the block size should be 100 MB. The default size HDFS block is 64 MB. Rack  - Rack is a collection of machine which are connected using same network switch. If the network goes down, all the machine in a network will go down. Rack Awareness algorithm came into the picture to overcome this problem. In Rack Awareness, NameNode chooses the DataNode which is closer to the same rack or nearby rack. NameNode maintains...

HDFS

What is HDFS ? Before discussing HDFS, let us first discuss Scalability The primary benefit of Hadoop is its Scalability. One can easily scale the cluster by adding more nodes. There are two types of Scalability in Hadoop: Vertical and Horizontal Vertical Scalability -   It is also referred as “scale up”. In vertical scaling, you can increase the hardware capacity of the individual machine. In other words, you can add more RAM or CPU to your existing system to make it more robust and powerful. Horizontal Scalability -   It is also referred as “scale out” is basically the addition of more machines or setting up the cluster. In horizontal scaling instead of increasing hardware capacity of individual machine you add more nodes to existing cluster and most importantly, you can add more machines without stopping the system. HDFS HDFS stand for Hadoop Distributed File Storage.  HDFS  provides better data throughput than traditional file systems. It provides a way to man...