Hadoop and MapReduce – Simplified explanation
The most common use case for Hadoop is when you need to process large volumes of data with many machines. Example, lets say we end of with 500 million transactions. Each transaction contains a bunch of information such as date and time; item purchased, location, customer information, payment method, weather, humidity, top headlines that day etc.,
First obvious thought you may have is why not just use relational databases (RDB). Conventionally, relational databases have been used to store data. However, you are going to run into many problems. The most obvious one being cost. As the volume of data increases, the cost rises exponentially.
The second problem you will need to address is dealing with unstructured data. How do we show images, videos etc in a RDB.
This is where technologies such as Hadoop and NoSql come in.
What is Hadoop all about?
At its core, Hadoop is about processing large volumes of data with many machines. You would start with first splitting your data so that every machine gets a subset of the data. Imagine we have a data set of 500 million transactions. We would split the data into 100 million records for each sub set and pass it to the 5 machines.
Question: From the 500 million transactions, find the average sales per day?
To answer this question, the data is split into subsets and sent to individual machines. We need to now take the records available in each machine, identify the specific header to evaluate and do some computations. In technical jargon, we call this Extract, Transform and Load (ETL)
Sometimes, it becomes necessary to shuffle or move data across different machines to arrange them in some order. This may be to do with moving the sales column before the tax column etc., This is called Shuffling where you move the results between intermediate machines
The next step is to aggregate the intermediate results. Results from the different machines are aggregated and stored.
This overall approach allows us to process more data than what can be done in a single machine. In way, this is similar to the divide and conquer mechanism of processing data.
At the heart of Hadoop technology is the MapReduce algorithm. This is a combination of two commonly used functions, map and reduce. These two functions are common to most modern high level languages.
The reduce function in ES6 is used when you want to find a cumulative or concatenated value based on elements across the array. Like map() it traverses the array from left to right invoking a callback function on each element. The value returned is the cumulative value passed from callback to callback. After all elements have been traversed reduce() returns the cumulative value
In our example of processing 500 million transactions, the extract layer addresses the “Map” portion of the algorithm. The aggregate layer addresses the “reduce” portion of the algorithm.
Illustration of how MapReduce on Hadoop works
Imagine we have a very large file that contains records of items sold in a super market. Each line in the text file represents a given day.
What we would like to find out is how many GE bulbs, how many Hershey’s bar, how many perfumes etc., were sold overall?
First thing to do break the file into lines. Then we are going to extract each of the items and generate key value pairs. The value being the number of times the item appears in a line. This is the “Map” element of the algorithm wherein we step through each item.
The next stage of processing is called the Shuffle. We are going to move all the entries so that we have all similar items in one place. All of Hershey’s bar will be in one place, Steph’s perfume will be in one place etc.,
The final step is to aggregate each set. This is called Reduce. We find that we have 3 Hershey’s bar, 1 Colgate brush, 3 Cleo’s food, 3 Steph’s perfume and 2 Samsung TV’s
Generally speaking, the Map and Reduce part needs to be programmed by us. The Shuffle part is handled by the Hadoop system. This is done by associating the primary key with the data and moving all items with the same primary key to one place.
What this translates to is in the Hadoop ecosystem is that all of Hershey’s bar will be processed in one computer followed by all of Cleo’s food being processed in another computer etc.,
In the Reduce step, the results from the different machines are collected and stored in a separate file. MapReduce is usually considered a brute force approach working on multiple machines.
Let summarize the key takeaways of the MapReduce algorithm
Data is split into pieces
Individual machines (worker nodes) process the pieces in parallel
Each individual machine stores the result locally
Data is aggregated by individual machines
Aggregation is parallelized
With Hadoop, programmers can focus on just the Map, Reduce part through an API or a high level programming language. The Hadoop ecosystem takes care of things like fault tolerance, assigning machines to map and reduce, moving processes, shuffling and dealing with errors.