Map - reduce (EXPLAINED)

  • The model proposed in this paper was originally designed by Google.The problem was raised due to the reason that there was a lot of computation that was needed to be done on tons of data that was available on the internet.

  • Google was looking for a solution / framework such that a non specialist would also be able to run and execute distributed computations.Lets consider an input which is split into smaller inputs wherein each smaller input is computed separately.

  • Each smaller input can be considered as a chunk of a file or webpages from the web. Each smaller input is submitted to the map function , and we can say that these smaller inputs are parallel computed.The whole process of computation is referred as a job.

Large scale data can be considered as large and dynamic datasets of information spanning across different nodes of a distributed system.

Example

  1. Google might want to know the popular searches amongst the users and for that they might have to explore the search logs of that user queries to denote the most searched topics thereby to make user’s searching more efficient and better and to show pertinent topics to that particular user.A log is also maintained for small scale actions a user makes ; like clicking on a button on a site.

  2. Social media giants like Instagram may want to access the user’s connections and their connections to create a social graph to model user’s connections. Netflix would require a lot of computational power (not possible by a single computer) for transcoding its video and audio into different viewable formats for the users.

  3. Amazon would want to track your purchase history based on which it shall recommend further products.So the computation is broken down to several steps .


Flow of Execution

Lets say the user requests for something on google.com. First step will be crawling and creation of the maps.

Google will crawl the pertinent web pages and download the content ( content may include text , images and links).

Each stored content from the user relevant webpage is broken down to smaller forms of data and provided as the input to map function.If say the task is to find word occurrences , the map function would make the word as the key and its occurrences of that word in that part of the data (which was broken down) and keep account of it.

Secondly , different map functions would compute the same task in parallel and now it’s time to merge the results. The process of merging is such a way that the same key’s values from different data chunks are collected and their respective values are merged.Example - if the word is ‘balloon’ and from chunk 1 ,the word occurred 10 times and chunk 2 the word occurred 3 times then the result would be merged to {balloon : 13} . The process of grouping the keys is called grouping.The process wherein the results are merged ie the values of the same key are added up is being handled by the reduce function and the process is referred to as reducing, ie it takes all the instances of the same key from different chunks of data.This finding of word occurrences is just one of many operations that could be done by map - reduce.Another job can be of filtering server data , example - lets say if we want to find log files on the server with an error on it , we would simply get all the logs (as input) and then the input of map function would look like {‘log_file_input_1.txt’,”INFO: NO ERROR”} , as the value is having no error , the intermediate output will not be generated , but for pair like {‘log_file_input_12.txt’,” ERROR: heap issue”} , we would give intermediate output.If we have to create a graph network of backlinks , we can use webpages having links as input.If page 1 goes to 2 , 3 , 4 , the pairs could look like {2,1} , {3,1} , {4,1}.Each word in a job is given to a function called EMIT() which takes 2 arguments ie the key and the value.The reduce processing has an emit function with takes the value from the key-value pair for output.Later on , some improvements were made to this traditional approach as it was seen that a lot of computational work was being done in transferring the intermediate data from map to reduce phase.To improve upon this , a combiner function was introduced which will combine results but not from all the different chunks rather combines results its own chunk (locally).

Example for the string “sir , yes sir “ , the mapping would be {sir:1} {yes:1} {sir:1} , but rather than giving this result as intermediate result, we can simply refine this result in a better way like {sir:2} , {yes:1} and then give it to the reduce function , this computation was done by combiner function locally .A partition function is used to segregate intermediate key value pairs thereby reducing overhead.

Strengths of this approach

  • Easy to use by programmers without experience with parallel and distributed systems, since it hides the details of parallelisation, fault-tolerance, locality optimisation, and load balancing. Scalability: plenty of webpages (like millions) are processed with the help of thousands of machines Speed:As parallel processing occurs across multiple machines , wherein each work is just a fraction of net work , the computation is faster. Efficacy: Rescheduling of task is initiated if one machine fails.

    Shortcomings

SHORTCOMINGS

  • Not every operation possible will be conforming to the procedure of the flow of computation of map - reduce , ie from map to shuffling to reducing.

  • Some processes might want to require the previous results of computation but due to parallel processing this is not possible.Moreover , if the datasets are joined , the workload increases.Secondly , the infrastructure of map reduce is such that it can do batch processing but when it comes to real time data processing , it fails.

  • Apache kafka service on the other hand , comes to the rescue for this issue.So service like map reduce would wait for large chunks of data to work upon , service like kafka would process data as soon as it arrives.The master node is like a single point of failure. Failure of the master would fail the whole MapReduce job.

Thanks for reading :)