MapReduce in Big Data Analytics: Introduction and Origin

Big Data Analytics | MapReduce: In this tutorial, we will learn what is MapReduce in Big Data Analytics, its introduction, and its origin. By IncludeHelp Last updated : June 28, 2023

What is MapReduce in Big Data Analytics?

MapReduce is a programming model and a framework used to process large amounts of data. MapReduce is a combination of two different terms Map and Reduce; Map means splitting and mapping of data while Reduce shuffles and reduces the data. MapReduce is a key component of the Hadoop ecosystem which is a platform for improving big data processing. Other key components of the Hadoop ecosystem are HDFS, Yarn, and Apache Pig are other components of Apache Hadoop.

In the Hadoop ecosystem, the MapReduce component improves large data processing by using scattered and parallel computing methods. This approach is used to analyze a large amount of data generated from internet users in social platforms and e-commerce.

Origin of MapReduce

Doug Cutting[1] and Mike Cafarella[2] started Hadoop in 2002 when they were both working on the Apache Nutch project. During this time, a search engine was built that can gather 1 billion pages. They concluded that the hardware for such a system would cost about $500,000 and that it would cost about $30,000 per month to run. They were looking for a feasible solution that could cut down the cost of execution and make it easier to store and process big datasets.

In 2003, they found a paper that explained the architecture of Google's GFS (Google File System)[3], which was used to store large data sets. The paper was released by Google. This paper had given the idea to solve the problem of where to store the large files that were being made by web crawling and searching. This paper only helped them to solve half of their problem.

In 2004, Google published another study about the method MapReduce, which was a way to manage big data sets. These two methods were not used by Google. From his work on Apache Lucene, Doug Cutting knew that open-source is a great way to get technology to connect more people. He started Google's methods (GFS and MapReduce) into the Apache Nutch project as open-source code with Mike Cafarella.

In 2005, Cutting found out that Nutch can only have 20–40 nodes in a cluster. Soon, he saw that there were two problems: Nutch wouldn't be able to reach its full potential until it worked consistently on larger groups, and it was impossible to do this work with just Doug Cutting and Mike Cafarella.

To implement their idea; Doug Cutting and the Nutch project both joined Yahoo in 2006. With the help of Yahoo[4], they wanted to share an idea with the whole world as an open-source, reliable, and scalable computing platform. So, they first took the distributed computer parts of Nutch and put them into a new project called Hadoop at Yahoo.

Hadoop was named after a yellow toy elephant that belonged to the son of Doug Cutting. It was a unique word that was easy to say. He wanted Hadoop to work well on thousands of computers, or "nodes." So, they began to work on Hadoop with GFS and MapReduce.

In 2007, Yahoo tried Hadoop on a collection of 1,000 computers and found that it worked well.

In January 2008, Yahoo gave Hadoop to the ASF (Apache Software Foundation) as an open-source project. And in July 2008, the Apache Software Foundation tried Hadoop with a 4000-node cluster and found it to work well.

Hadoop was successfully tested in 2009 to sort a PB (PetaByte) of data in less than 17 hours. This was enough time to manage billions of searches and index millions of websites.

After this; Doug Cutting left Yahoo and joined Cloudera to take the initiative to make Hadoop public and popular with other businesses and industries.

In December 2011, the Apache Software Foundation put out version 1.0 of Apache Hadoop.

In August 2013, Version 2.0.6 became available.

As time goes on, new versions with updated features of Apache Hadoop are launching in the market.

Notes and References

  1. Doug Cutting is a software designer, advocate, and creator of open-source search technology. He founded two technology projects, Lucene, and Nutch, with Mike Cafarella. Both projects are now managed through the Apache Software Foundation. Cutting and Cafarella are also the co-founders of Apache Hadoop.
  2. Mike Cafarella is a computer scientist specializing in database management systems. He is a principal research scientist of computer science at MIT Computer Science and Artificial Intelligence Laboratory. Along with Doug Cutting, he is one of the original co-founders of the Hadoop and Nutch open-source projects.
  3. [pdf] Google's GFS (Google File System) - The Google File System, a scalable distributed file system for large distributed data-intensive applications. It provides fault tolerance while running on inexpensive commodity hardware, and it delivers high aggregate performance to a large number of clients.
  4. Yahoo! Inc. (1995–2017) The original incarnation of Yahoo! Inc. was an American multinational technology company headquartered in Sunnyvale, California. Yahoo was founded by Jerry Yang and David Filo in January 1994 and was incorporated on March 2, 1995. Yahoo was one of the pioneers of the early internet era in the 1990s.

Comments and Discussions!

Load comments ↻

Copyright © 2024 All rights reserved.