Full Text

Research Article

Big Data Processing: Advanced MapReduce Algorithm


Abstract
The challenge of computation of large volumes of data within the realm of Big Data is a famous problem. One prevalent strategy for addressing this challenge has been using the MapReduce algorithm, which facilitates the parallel processing of broken-down datasets across distributed systems. However, it is crucial to recognize that data processing requirements have become increasingly complex in today's world. Many of these requirements include complex conditional aspects that can add layers of complexity to handling extensive datasets. Consequently, while the MapReduce4 algorithm's ability to process data in parallel remains valuable, its effectiveness can sometimes be constrained by the conditional processing of complex data.

The Advanced MapReduce Algorithm presents a promising avenue for enhancing computing efficiency. This approach allows for handling complex data operations and effectively addresses some limitations of traditional MapReduce algorithms. By minimizing processing time, the algorithm builds upon its predecessor's strengths and aligns more closely with the sophisticated needs of conditional data processing challenges.

Keywords: Big data, Cloud computing, Map reduce, Data processing

1. Introduction
The MapReduce algorithm uses a simple and effective methodology within a computing cluster3. It harnesses the power of distributed computing by breaking down enormous datasets into smaller, manageable chunks. These smaller datasets are then processed across multiple computers (nodes) in the cluster. The designated node later consolidates the individual results obtained from these nodes into a single final result. However, the algorithm's efficiency may decline when processing involves specific conditions. This can occur in scenarios where a large dataset must be selected from a pool of several massive datasets based on the results of processing another large dataset. In such cases, the MapReduce algorithm must complete the first computation before it can proceed to the second part, which involves selecting the dataset based on the results of the initial computation.

The Advanced MapReduce Algorithm is highly beneficial in this context. This algorithm assigns the initial computation to a specific set of nodes rather than utilizing all available nodes. It then leverages the remaining nodes to begin parallel processing the second part of the computation. The algorithm selects a few datasets that are likely to be chosen using the result of the first computation or it may use all available datasets, depending on the node availability within the cluster3 and the size of the datasets. This approach fully utilizes the computing cluster's power to solve the computational problem more quickly than merely using the MapReduce Algorithm.

2. Map Reduce Algorithm
The MapReduce algorithm is an excellent technique in cluster computing. It divides large datasets into smaller subsets for processing, allowing different nodes in the cluster to compute these subsets simultaneously. This approach enhances efficiency and speeds up data processing. Figure 1 demonstrates how the MapReduce algorithm works.

Figure 1: MapReduce Algorithm.

(Figure 1) shows that the Master Node divides the computation into smaller datasets and distributes them to four selected nodes for processing. This process is known as mapping. After the nodes complete the mapping function, another node gathers and processes the results through the reduce function. Let's represent this in an algorithm.

Consider an example of counting the number of one keyword in a large collection of documents1. The algorithm shall look something like this:

map(String key, String val)

            Int count = 0

For each word w in val

            Increment count if w == key

end

reduce(Array count)

            Int sum = 0

            For each val in count

                        sum = sum + val

end

This approach works well with large datasets that follow a uniform processing pattern. However, when dealing with large datasets that require conditional processing, the computing speed of the MapReduce algorithm may decrease. This slowdown occurs because the processing of specific datasets may depend on the completion of other datasets. In such cases, MapReduce can provide significant assistance.

3. Advanced MapReduce Algorithm
(Figure 2) illustrates how the Advanced MapReduce Algorithm4 operates. In addition to the Master node and worker nodes, this algorithm includes an intelligent node that employs a data-specific algorithm or machine learning model to determine which dataset to select for conditional operations.

Figure 2: Advanced MapReduce Algorithm.

As shown in Figure 2, there are 2 major parallel processing tasks are represented in black and orange lines.
·The first task (marked in black) involves initial data processing. This task divides the large dataset into smaller datasets and supplies them to the four worker nodes for computations (represented in green). After the computations are completed, the results are used to calculate the intermediate results by the worker node executing the reduce function.
·The second major task (highlighted in orange) is initiated simultaneously with the first task by the Smart Node. The Smart Node selects one or more datasets that are potential candidates for processing under complex processing conditions. It then divides the chosen datasets and supplies them to another set of computation nodes for processing. The results from these nodes, along with intermediate results from Task 1, are used by a worker node to derive the final outcome through a final reduce function. 

Advanced MapReduce Algorithm looks like this if we can extend the previous example1:

T1map(String key, String val)

            Int T1count = 0

For each word w in val

            Increment T1count if w == key

end

T1reduce(Array T1count )

            Int T1sum = 0

            For each T1val in T1count

                        T1sum = T1sum + T1val

end

T2map(String key1, String value)

            Int T2count= 0

For each word w in value

            Increment T2count if w == key1

end

reduce(Array T2count, T1sum )

            Int T2sum = 0

            For each T2val in T2count

                        T2sum = T2sum + T2val

            final result = MLProcess(T1sum, T2sum)

end

4. Introducing Smart Node in Advanced MapReduce Algorithm
The Smart Node plays a crucial role in the Advanced MapReduce algorithm. It is designed to use algorithms that identify datasets based on specific application criteria and can be implemented more efficiently using AI or machine learning models. These models can predict which datasets will be needed for processing task 2. This allows task 2 to begin concurrently with task 1, facilitating the effective parallel processing of specific datasets under complex conditions. 

5. Challenges with the Advanced MapReduce Algorithm
Advanced MapReduce Algorithm is undoubtedly an intelligent and effective technique for processing datasets under complex conditions. However, it also has some challenges, which are stated below: 
·                     Complex conditions and large datasets may necessitate using a large number of nodes, as the cluster3 creates an execution plan to process Task 1 and Task 2 in parallel. Therefore, the increased demand for more nodes in the Advanced MapReduce Algorithm is unsurprising.
·                     Sometimes, the model or algorithms may identify multiple datasets that qualify for Task 2 processing. When this occurs, the Smart Node loads all eligible datasets onto the worker nodes to begin processing Task 2. This approach can be costly, as the results from some of the nodes may not contribute to the final result. 

6. Conclusion
The Advanced MapReduce Algorithm significantly enhances the conventional MapReduce framework, particularly addressing the intricacies associated with complex conditional data processing. Such complexities can considerably diminish the algorithm's operational efficiency, resulting in increased latency. 

At the core of this advanced algorithm is the Smart Node, an intelligent component that employs predictive algorithms or trained artificial intelligence or machine learning models. The Smart Node plays a crucial role by facilitating predictive decision-making for conditional operations within the data processing sequence. By anticipating the outcomes of various conditions, the Smart Node allows for simultaneous processing alongside initial tasks, optimizing resource utilization and accelerating overall execution timelines. This ability for parallel processing enhances performance and significantly increases the algorithm's capacity to manage large and complex datasets, ultimately leading to improved effectiveness in challenging data environments.

7. References

1.https://dl.acm.org/doi/10.1145/1327452.1327492
2.https://www.sciencedirect.com/science/article/abs/pii/0167819188901093?via%3Dihub
3.Baker Mark and Rajkumar Buyya. "Cluster computing at a glance." High Performance Cluster Computing: Architectures and Systems.
4.Alam A and Ahmed J. "Hadoop Architecture and Its Issues," 2014 International Conference on Computational Science and Computational Intelligence, Las Vegas, NV, USA, 2014: 288-291.