The most common use case for Hadoop is when you need to process large volumes of data with many machines. Example, lets say we end of with 500 million transactions. Each transaction contains a bunch of information such as date and time; item purchased, location, customer information, payment method, weather, humidity, top headlines that day etc.,
First obvious thought you may have is why not just use relational databases (RDB). Conventionally, relational databases have been used to store data. However, you are going to run into many problems. The most obvious one being cost. As the volume of data increases, the cost rises exponentially.
The second problem you will need to address is dealing with unstructured data. How do we show images, videos etc in a RDB.
This is where technologies such as Hadoop and NoSql come in.
What is Hadoop all about?
At its core, Hadoop is about processing large volumes of data with many machines. You would start with first splitting your data so that every machine gets a subset of the data. Imagine we have a data set of 500 million transactions. We would split the data into 100 million records for each sub set and pass it to the 5 machines.
Question: From the 500 million transactions, find the average sales per day?
To answer this question, the data is split into subsets and sent to individual machines. We need to now take the records available in each machine, identify the specific header to evaluate and do some computations. In technical jargon, we call this Extract, Transform and Load (ETL)
Sometimes, it becomes necessary to shuffle or move data across different machines to arrange them in some order. This may be to do with moving the sales column before the tax column etc., This is called Shuffling where you move the results between intermediate