Hadoop has become a central platform to store big data through its Hadoop Distributed File System (HDFS) as well as to run analytics on this stored big data using its MapReduce component. This article explores the basics of Hadoop.
Many of us would have certainly heard about Big Data, Hadoop and analytics. The industry is now focused primarily on them and Gartner identifies strategic big data and actionable analytics as being among the Top 10 strategic technology trends of 2013.
According to the Gartner website: ‘Big Data is moving from a focus on individual projects to an influence on enterprises strategic information architecture. Dealing with data volume, variety, velocity and complexity is forcing changes to many traditional approaches. This realisation is leading organisations to abandon the concept of a single enterprise data warehouse containing all information needed for decisions. Instead, they are moving towards multiple systems, including content management, data warehouses, data marts and specialised file systems tied together with data services and metadata, which will become the logical enterprise data warehouse.
There are various systems available for big data processing and analytics, alternatives to Hadoop such as HPCC or the newly launched Red Shift by Amazon. However, the success of Hadoop can be gauged by the number of Hadoop distributions available from different technological companies such as IBM InfoSphere BigInsights, Microsoft HDInsight Service on Azure, Cloudera Hadoop, Yahoo’s distribution of Hadoop, and many more. There are basically four reasons behind its success:
- It’s an open source project.
- It can be used in numerous domains.
- It has a lot of scope for improvement with respect to fault tolerance, availability and file systems.
One can write Hadoop jobs in SQL like Hive, Pig, Jaql, etc, instead of using the complex MapReduce.
This enables companies to modify the Hadoop core or any of its distributions to adapt to the company’s own requirements and the project’s requirements. In this article, we will only focus on the basics of Hadoop. However, in forthcoming articles in this series, we will primarily focus on fault tolerance and the availability features of Hadoop.
Formally, Hadoop is an open source, large scale, batch data processing, distributed computing framework for big data storage and analytics. It facilitates scalability and takes care of detecting and handling failures. Hadoop ensures high availability of data by creating multiple copies of the data in different nodes throughout the cluster. By default, the replication factor is set to 3. In Hadoop, the code is moved to the location of the data instead of moving the data towards the code. In the rest of this article, “whenever I mention Hadoop, I refer to the Hadoop Core package available from http://hadoop.apache.org”.
There are five major components of Hadoop:
- MapReduce (a job tracker and task tracker)
- NameNode and Secondary NameNode
- DataNode (that runs on a slave)
- Job Tracker (runs on a master)
- Task Tracker (runs on a slave)
The MapReduce framework has been introduced by Google. According to a definition in a Google paper on MapReduce, MapReduce is, A simple and powerful interface that enables the automatic parallelisation and distribution of large-scale computations, combined with an implementation of this interface that achieves high performance on large clusters of commodity PCs.
It has basically two components: Map and Reduce. The MapReduce component is used for data analytics programming. It completely hides the details of the system from the user.
Hadoop has its own implementation of distributed file systems called Hadoop Distributed File System. It provides a set of commands just like the UNIX file and directory manipulation. One can also mount HDFS as fuse-dfs and use all the UNIX commands. The data block is generally 128 MB; hence, a 300 MB file will be split into 2 x 128 MB and 1 x 44 MB. All these split blocks will be copied ‘N’ times over clusters. N is the replication factor and is generally set to 3.
NameNode contains information regarding the block’s location as well as the information of the entire directory structure and files. It is a single point of failure in the cluster, i.e., if NameNode goes down, the whole file system goes down. Hadoop therefore also contains a secondary NameNode which contains an edit log, which in case of the failure of NameNode, can be used to replay all the actions of the file system and thus restore the state of the file system. A secondary NameNode regularly creates checkpoint images in the form of the edit log of NameNode.
DataNode runs on all the slave machines and actually stores all the data of the cluster. DataNode periodically reports to NameNode with the list of blocks stored.
Job Tracker and Task Tracker
Job Tracker runs on the master node and Task Tracker runs on slave nodes. Each Task Tracker has multiple task-instances running, and every Task Tracker reports to Job Tracker in the form of a heart beat at regular intervals, which also carries details of the current job it is executing and is idle if it has finished executing. Job Tracker schedules jobs and takes care of failed ones by re-executing them on some other nodes. Job Tracker is currently a single point of failure in the Hadoop Cluster.
The overview of the system can be seen in Figure 2.
Hadoop in action
Lets try a simple Hadoop word count example. First of all, you need to set up a Hadoop environment either in your own Linux environment or get a pre-configured virtual machine from https://ccp.cloudera.com/display/SUPPORT/CDH+Downloads. If you are willing to configure Hadoop by yourself, refer to the famous tutorial by Michael Noll, titled ‘Running Hadoop on Ubuntu Linux (Multi-Node Cluster)’. Once you are done with the configuration of Hadoop, you can try out the example given below:
$HADOOP_HOME/bin/hadoop start-all.sh jps mkdir input nano wordsIn \\fill this file with random words and exit or copy a file into input dir in text format $HADOOP_HOME/bin/hadoop dfs -copyFromLocal ./input /user/hduser/input hadoop dfs -cat /user/hduser/input1/* \\ lists the content of file cd $HADOOP_HOME ./bin/hadoop jar hadoop-examples-1.0.4.jar wordcount /user/hduser/input /user/hduser/output ./bin/hadoop dfs -cat /user/hduser/output/* ./bin/stop-all.sh <a href="http://www.opensourceforu.efytimes.com/wp-content/uploads/2013/12/Figure-110.jpg"><img class="alignleft size-large wp-image-13319" alt="Figure-1" src="http://www.opensourceforu.efytimes.com/wp-content/uploads/2013/12/Figure-110-590x444.jpg" width="590" height="444" /></a><a href="http://www.opensourceforu.efytimes.com/wp-content/uploads/2013/12/Figure-28.jpg"><img class="alignleft size-large wp-image-13320" alt="Figure-2" src="http://www.opensourceforu.efytimes.com/wp-content/uploads/2013/12/Figure-28-590x401.jpg" width="590" height="401" /></a>
The first line in the code starts the required services for Hadoop. The jps command lets you query all the Java Virtual Machines running on your system. You should see the following services running on your system.
- Secondary NameNode
- Job Tracker
- Task Tracker
If any of the services listed above is not running, it means that your Hadoop could not start properly. In Line 3, create a local folder to be copied on to HDFS.
In Line 4, make a text file and fill it with some random text. In Line 6, list the contents of the copied file. In Line 8, run the Hadoop word count example that comes with the distribution. In Line 9, list the output of the generated result. Finally, in Line 10, stop all the Hadoop services.
This article has covered various aspects of big data, analytics and Hadoop. I have primarily focused on the Hadoop architecture and pointed out the loopholes of Hadoop in terms of fault tolerance and recovery. We have also specifically seen how NameNode and Job Tracker are bottlenecks in the system. They act as a single point of failure for the whole system. Many of the Hadoop distributions try to solve this problem of fault tolerance and recovery that is found in Hadoop Core. However, implementation of the fault tolerance and recovery algorithm creates performance issues. The thrust of the research in this field is to mitigate the performance and reliability issues to get the best of both worlds.