Hadoop, MapReduce and HDFS

The big data boom has brought many new terms and concepts to the forefront that may be well understood by IT developers, but can be confusing to business analysts and other end-users of big data applications. MapReduce and Hadoop are two commonly used terms when looking at the inner workings of big data analytics. HDFS which stands for Hadoop Distributed File System is another related term. It is difficult for most non-IT staff to fully appreciate the complexity, efficiency and usability of MapReduce, Hadoop and HDFS. However, a basic understanding of these techniques and technologies assists with conceptualizing the sheer magnitude of big data applications.

MapReduce

MapReduce is a technique used to process large volumes of data. In a situation where data is now running into the terabytes and petabytes, organizations may not have the hardware to process such large volumes of data. Since much of today’s data is unstructured, any solution has to be able to process data from various sources and in diverse formats. MapReduce is the distributed, parallel-processing technique that was originally created by Google to handle the rapid processing of large volumes of structured and unstructured data.

MapReduce essentially eases the burden of requiring high-powered hardware to complete a task in a linear and sequential manner. It employs a ‘divide-and-conquer’ strategy for data processing. By distributing smaller manageable chunks of data to different nodes, multiple tasks are run at the same time for faster processing (Map component). The intermediate results from each node can then be combined and consolidated for the final result (Reduce component). MapReduce is a simple way of processing massive amounts of information quickly.

Hadoop

Hadoop is the open-source software that exploits the MapReduce technique in a broader way. It is a software solution for enterprises to implement MapReduce in a more cost-effective and manageable approach. The simplest way to understand Hadoop is that it is a type of management system that also employs a ‘divide-and-conquer’ strategy on massive volumes of data by further breaking up it up to be run as independent tasks. It expands the concept of parallel-processing into a cost-effective, scalable and flexible solution that reduces the risk of a single point of failure.

The Hadoop environment separates different aspects of data processing into three separate layers. One layer provides a framework for applications and end-user access in the way of custom-written software, commercial software or third party tools. The next layer manages the MapReduce workload by scheduling, launching and balancing jobs in a way that is both efficient and reliable. The last layer handles the storage of information in a specialized and efficient distributed file system, mainly the Hadoop Distributed File System (HDFS).

HDFS

In order to support large amounts of structured and unstructured data, new data storage data techniques had to be developed as part of the big data challenge. The Hadoop Distributed File System (HDFS) is one such example that provides the data storage capabilities needed in the Hadoop environment. It extends across all nodes in a Hadoop cluster and ensures reliability by replicating data across multiple nodes. The redundancy of replication also means a higher availability of data, which can be stored and processed locally on a computer in addition to allowing the Hadoop cluster to break up the workload.