Ankur Patel

Big Data Environment

Now, the computing environment for big data has expanded to include various systems and networks. Due to scaling up for more powerful servers, the computing resources are provided by clusters for the massive computing units. We are talking about warehouse-sized computer with thousands, if not more, of datacenters. In this article, the frameworks of MapReduce and Apache Spark will be explained. Additionally, the paper of “Scaling Big Data Mining Infrastructure: The Twitter Experience” will be discussed as it exemplifies big data analytics really well.

MapReduce: Simplified Data Processing on Large Clusters

MapReduce is a framework used to process large data sets through 2 tasks- Map and Reduce. A good example to understand the process is making sandwiches. Suppose we want to make 3 sandwiches- ham, turkey, and Italian. A list of the ingredients needed is gathered in the Map phase. The total amount of each ingredient is determined during the Shuffle and Reduce phase. Generally, the input data in Map is stored in the Hadoop File System (HDFS), and the output set after the Reducer process is stored in HDFS. Typically, thousands of machines process terabytes of data with MapReduce. It manages the data-processing details such as issuing tasks, verifying task completion, and copying between nodes and around clusters. The parallelization of the Map tasks in the large distribution ensures that every record (key-value pair) is processed only once. The partitioner ensures similarly that every record is passed to only one Reducer. The Shuffle combines intermediate keys and values from Map tasks and sends to targeted Reduce tasks for further processing, which outputs zero or more key-value pairs. In a multistage computational workflow, the Reduce output may be input to another Map phase, but the output is final in individual MapReduce application. Since it helps process very large data using thousands of machines, it must tolerate machine failures. In Hadoop, the master process will reschedule after a Map task failure, to a machine that contains the replica of the input data; a task can be rescheduled up to 4 times until it is deemed to have failed. As cleaning in coding is troublesome, debugging problems in MapReduce is also troublesome for the distributed system. A combiner is an advantageous function after Map but before the Shuffle phase, since it will reduce the amount of data in Shuffle and the computational load in Reduce phase. It can be used in commutative and associative reduce operations like sum or count (e.g. summing a list of sums), but averaging a list of averages will give an erroneous result. In Hadoop, MapReduce is the core component for processing.

Apache Spark

Apache Spark is an open-source cluster computing framework for big data workloads, such as batch applications, streaming, iterative algorithms, and interactive queries. It is suitable to process real-time data and larger data in the network. When compared to MapReduce, Spark performs 100 times faster for a few applications with in-memory primitives and up to 10 times when accessing. The components of a Spark project are as follows. Spark Core and Resilient Distributed Dataset (RDDs) are the foundations that provide basic input/output and distributed tasks. RDDs are a collection of data partitioned across machine logically, and they can contain any type of Python, Java, or Scala objects. They can be created by parallelizing an existing collection or referencing external datasets; an example of these transformations are reduce, join, filter, and map. The lazy RDD transformation operations are optimized for performance by DAGScheduler, which transforms using stages. Spark SQL resides on top of Spark Core, and it introduces SchemaRDD, which is a new data abstraction and supports semi-structured and structured data. Spark Streaming utilizes the fast scheduling by Spark Core, ingests data in small batches, and performs RDD transformation on them. Another component in Spark is the Machine Learning Library (MLlib), which is a distributed framework and runs 9 times faster than the Hadoop disk-based version of Apache Mahout. Last, but not the least, is GraphX, which is a distributed graph-processing framework that provides an API with optimized runtime for the Pregel abstraction; Pregel is a system for large-scale graph processing. Like RDDs, DataFrames are immutable distributions of data organized into named columns that can filter, group, or compute aggregates, and are designed to process large datasets easier. The DataFrame API is accessible in Python, Java, Scale, and R. Spark SQL is a structured data processing module that runs both SQL queries and DataFrame API. Catalyst Optimizer is the core of the Spark SQL. As it eases the optimization techniques and extends the optimizer, it helps to run queries faster than their RDD counterparts. Since a Dataset is a distributed collection of data, a DataFrame is an organized Dataset and faster. In conclusion, when comparing Hadoop Ecosystem and Apache Spark, every type of data can be processed using Spark. The performances are as following: batch processing, structured data analysis, machine learning analysis, interactive SQL analysis, and real-time analysis. The advantages are: speed, as it supports stream processing and interactive queries; combination, as it covers and combines different workloads and processing types for easy tools management; and Hadoop support, as it can create distributed datasets from any file in the Hadoop Distributed File System (HDFS) or other supported storage systems.

A Unified Engine for Big Data Processing

ctions different processing types was found. In 2009, Apache Spark project designed a unified system for distributed data processing. It can be looked at as an enhanced version of MapReduce that is used as a data-sharing structure called RDD. Previously, separate engines were needed to process a range of distributed workloads. Now, they can be run as libraries with a common engine. Using a unified API for processing different tasks will provide efficiency. It can avoid writing the data to storage and passing it to another engine. A great example is smartphone: it combines the functions of camera, cellphone, and GPS so we can use the one device with 3 functions. The RDDs are created by “transformations”, such as map, filter, and groupBy, to the data and Spark creates an efficient execution plan from the transformations. Another useful trait it had is the fault tolerance, which allows the RDDs to automatically recover from failures by data replication or checkpoints. The dependencies between the RDDs are logged in a graph called “lineage” in Spark. The fault tolerance and the RDD lineage graph are used to reconstruct missing or damaged partitions, hence the name Resilient. That saves both time and storage memory, instead of replicating the data-intensive workloads. The four main libraries included are SQL and DataFrames, Spark Streaming, GraphX, and MLlib. SQL contains the relational queries of analytical databases, and DataFrame contains abstraction for tabular data, both of which use programmatic methods for filtering, manipulating columns, and aggregation. Discretized Stream (DStream) is the fundamental model of Spark Streaming, which is a stream of RDDs of input data being split into small batches. GraphX, as mentioned earlier, is an API for graphs and graph-parallel computation, implementing a multigraph with properties attached to vertices and edges. MLlib is Spark’s library to make machine learning more scalable and efficient with more than 50 common algorithms for distributed data. Batch processing is commonly used on large datasets for converting raw data to structured format and also offline ML training. The processing tasks in the libraries as the data abstractions can be combined in the application. Because of Spark’s wide range of applications, companies from various different industries use it. Spark is used for interactive queries, real-time stream processing, and scientific applications. It is a unified data-processing engine that is deployed in diverse environments. The Apache Spark libraries are open-source [1] for more details.

Scaling Big Data Mining Infrastructure: The Twitter Experience¹

This paper provides knowledge of big data mining infrastructure that Jimmy Lin, an Assistant Professor who spent an extended sabbatical from 2010 to 2012 at Twitter, and Dmitriy Ryaboy, who was a tech lead first then an engineering manager of infrastructure at Twitter, wrote from their Twitter experience. The two topics described were schemas and heterogeneity. Firstly, schemas alone are insufficient for data scientists to get an overall understanding of the data. Scribe is a log-transport mechanism that Twitter uses for “aggregating high volumes of streaming log data in a robust, fault-tolerant, distributed manner”[2]. Another major challenge was the heterogeneity of the various components that must be integrated for production workflow. As Jimmy Lin mentions in a talk[3], data cleaning and data munching take 80% of the time, and since that takes data scientists’ majority of their time, he shares his experience for future data scientists. He says, “Schemas aren’t enough! We need a data discovery service!” Finding data easier, rather than through a complex path, is done by Data Access Layer (DAL), a loader that knows via metadata how to access it. Then, the heterogonous components are synchronized by “plumbing” so the data runs smoothly. The components are wired together via different channels, and the quality must be good to prevent clogging. A successful big data analytics platform is achieved by balancing speed, efficiency, flexibility, scalability, robustness, etc.

[1] http://spark.apache.org/

[2] Scaling Big Data Mining Infrastructure: The Twitter Experience — Jimmy Lin and Dmitriy Ryaboy; December 2012

[3] Jimmy Lin’s presentation of this paper: https://www.youtube.com/watch?v=T5ZjSFnOxys