Parallel Databases and Map Reduce

n 2007 Google handled 400 petabytes of data per month.

It was Map Reduce. And the competitors were Dyrad and Hadoop

Architects were reinventing parallel databases.

Data is partitioned across multiple disks. The parallel I/O enables relational operations like join executed in parallel. Teradata, is one of the highly parallel database machines uses shared-nothing architecture Google called it sharding, infinitely scalable almost simply by adding nodes.

In parallel databases the horizontal partitioning is done so that the tuples of a relation are divided among many disks such that each tuple resides on one disk.

The strategies can be :

a. round robin, the ith tuple inserted in the relation to disk i mod n.
b. hashing - send to the result (ie i [1.. n-1]th disk) of hash function h applied to the partitioning attribute value of a tuple.
c. Range partitioning - for a partitioning vector [range] say [3,9], a tuple with partitioning attribute value of 1 will go to disk 0, a tuple with value 6 will go to disk 1, while a tuple with value 12 will go to disk2 etc.

In these databases, the queries/transactions execute in parallel with one another.Relational queries by SQL is good for parallel execution.

In a normal relational db, the execution plan for a query like this,
SELECT YEAR(date) AS year, AVG(fund)
FROM student

The execution plan is:
projection(year,fund) -> hash aggregation(year,fund).

In parallel dbs it will be,

projection(year,fund) -> partial hash aggregation(year,fund)
-> partitioning(year) -> final aggregation(year,fund).

Each operator produces a new relation, so the operators can be composed into highly parallel dataflow graphs. By streaming the output of one operator into the input of another operator, the two operators can work in series giving pipelined parallelism. By partitioning the input data among multiple processors and memories, an operator can often be split into many independent operators each working on a part of the data. This partitioned data and execution gives partitioned parallelism

image source and reference: link

Parallel databases uses the parallel data flow model makes programming easy as data is not shared.

MapReduce can be programmed in languages like C, Java, Python and Perl and process flat files in a filesystem, ie no need to have database schema definitions.This is an advantage when documents are processed.It uses a set of input key/value pairs, and produces a set of output key/value pairs. Programmer two functions: map and reduce. Map an input pair and produces a set of intermediate key/value pairs. Then intermediate values are grouped together based on the same intermediate key and passes them to the reduce function. The reduce function accepts the intermediate key and a set of values for that key and merges these values together to result smaller set of values, zero or one output value is produced in the process, thus helps in reducing memory and handles a lot of values making the job scalable. link

HadoopDB is a hybrid of DBMS and MapReduce technologies that targets analytical workload designed to run on a shared-nothing cluster.It uses Hadoop to push the data to existing databases engine to process the SQL queries.In order to push more query logic into databases (e.g. joins), hash-partitioning of data needs to be performed. The data is loaded into HDFS. for processing. HadoopDB's API provide the implementation of Map and Reduce functions.This makes hadoopdb to process huge analytical database with scalability using map reduce framework and performance with exisiting db engines.


image source link

No comments:

Post a Comment