Pages

Dissecting Protocol Buffers

There are many ways to serialize data to formats like XML, JSON etc. Google's ProtoBuf is a way of encoding structured data. It is the de facto standard at Google for any server-to-server calls.

The choice of using the method of object serialization depends on speed,size,compatibility, metadata integrity, platform independent etc which make Protocol Buffers a good candidate especially if it is used for Java (the code can also be generated for languages like C,Python etc).XML has been a standard as data interchange and serialization format for most of the applications. Java do have a native serialization mechanism for objects.PB is like IDL describing the entity or data structure. PB can be considered as a high-level language describing the input and output types. Then the compiler-generated code is used to hide the details of encoding/decoding from application code.Isn't this similar to CORBA or EJB? Hoi, its from Google. They have been using this binary encoded structured data as input,output,writable formats for map reduce.



Protocol buffers are advantageous because they support multiple languages (i.e., Python, Java, C++ and others), are cross-platform, flexible, and extensible. They are forward and backward compatible. It uses descriptive message (ie entity or object) definition files (.proto files). The proto files are parsed by the compiler provided (eg:protoc.exe) to generate the java files based on the message definition.This java code will have the "builders" for creating the object.Good thing is that the message definitions can be modified without affecting the parsers derived from an older legacy version of the .proto definition.If we consider XML for persistence, it will need a metadata associated with entity.But, the protocol buffers are self descriptive and deprived of such unecessary details making it smaller in size.

The entity will be defined as a message.The message can refer other messages, but need to be defined or imported.Several data types and repeated data are supported.
The available wire types are as follows:







WireTypeMeaningUsed For
0Varintint32, int64, uint32, uint64, sint32, sint64, bool, enum
164-bitfixed64, sfixed64, double
2Length-delimitedstring, bytes, embedded messages, packed repeated fields
332-bitfixed32, sfixed32, float



It include the concept of optional elements: fields that aren’t currently needed are not included in the binary representation. This is similar to an XML shema for an element with minOccurs=0. XSD/DTDs provide data integrity. But I think portocol buffer can also provide data type definitions and values within itself without compromising integrity.

A key is associated with each data value.For the first 15 values/members in the structure this key is stored in 1 byte; 2 bytes is required for
each key representing the 16th through the 2047th member.

Take an example.Define a proto file,
message student{
optional int32 id = 1;
optional string name = 2;
}


Generate java file. Then save the data.

Student.student.Builder builder = Student.student.newBuilder();
builder.setId(100);
builder.setName("Hary");
builder.build().writeTo(output);


Then view the binary file.If you have an hex editor one can see the hex dump.Use vim, it has xxd.exe. Use it to generate the binary dump.


Byte 1: The key 00001000 give info like :
bits 2-5 ie. 0001 says the field number ie 1
bits 6-8 ie. 000 says the wiretype is 0 ie int32 (see the table)
Byte 2:
The value 01100100 is 100

Byte 3: The key 00010010
bits 2-5 ie. 0010 says the field number ie 2
bits 6-8 ie. 010 says the wiretype is 2 ie string (see the table)

Byte 4: The 00000100 says next 4 bytes for UT-8 character string

....


This is how the binary is file is structured. As protocol buffers include data binding library, it makes easy for encoding/decoding. It is faster (7x) than JSON serialization.Even unknown fields can be set to the object.

References

ProtoBuf HomePage
http://code.google.com/apis/protocolbuffers/docs/overview.html

NetBeans IDE Plugin for code generation
http://code.google.com/p/protobuf-netbeans-plugin/

Performance Using Internet data in Android applications
http://www.ibm.com/developerworks/opensource/library/x-dataAndroid/index.html?ca=dgr-twtrAndroidPArcedth-OS

Google Protocol Buffers - the Good, the Bad and the Ugly
http://www.freshblurbs.com/google-protocol-buffers-good-bad-and-ugly
MapReduce: A Flexible Data Processing Tool
http://cacm.acm.org/magazines/2010/1/55744-mapreduce-a-flexible-data-processing-tool/fulltext

Encoding
http://code.google.com/apis/protocolbuffers/docs/encoding.html

Parallel Databases and Map Reduce

n 2007 Google handled 400 petabytes of data per month.

It was Map Reduce. And the competitors were Dyrad and Hadoop

Architects were reinventing parallel databases.

Data is partitioned across multiple disks. The parallel I/O enables relational operations like join executed in parallel. Teradata, is one of the highly parallel database machines uses shared-nothing architecture Google called it sharding, infinitely scalable almost simply by adding nodes.

In parallel databases the horizontal partitioning is done so that the tuples of a relation are divided among many disks such that each tuple resides on one disk.

The strategies can be :

a. round robin, the ith tuple inserted in the relation to disk i mod n.
b. hashing - send to the result (ie i [1.. n-1]th disk) of hash function h applied to the partitioning attribute value of a tuple.
c. Range partitioning - for a partitioning vector [range] say [3,9], a tuple with partitioning attribute value of 1 will go to disk 0, a tuple with value 6 will go to disk 1, while a tuple with value 12 will go to disk2 etc.

In these databases, the queries/transactions execute in parallel with one another.Relational queries by SQL is good for parallel execution.

In a normal relational db, the execution plan for a query like this,
SELECT YEAR(date) AS year, AVG(fund)
FROM student
GROUP BY year

The execution plan is:
projection(year,fund) -> hash aggregation(year,fund).

In parallel dbs it will be,

projection(year,fund) -> partial hash aggregation(year,fund)
-> partitioning(year) -> final aggregation(year,fund).

Each operator produces a new relation, so the operators can be composed into highly parallel dataflow graphs. By streaming the output of one operator into the input of another operator, the two operators can work in series giving pipelined parallelism. By partitioning the input data among multiple processors and memories, an operator can often be split into many independent operators each working on a part of the data. This partitioned data and execution gives partitioned parallelism



image source and reference: link

Parallel databases uses the parallel data flow model makes programming easy as data is not shared.

MapReduce can be programmed in languages like C, Java, Python and Perl and process flat files in a filesystem, ie no need to have database schema definitions.This is an advantage when documents are processed.It uses a set of input key/value pairs, and produces a set of output key/value pairs. Programmer two functions: map and reduce. Map an input pair and produces a set of intermediate key/value pairs. Then intermediate values are grouped together based on the same intermediate key and passes them to the reduce function. The reduce function accepts the intermediate key and a set of values for that key and merges these values together to result smaller set of values, zero or one output value is produced in the process, thus helps in reducing memory and handles a lot of values making the job scalable. link

HadoopDB is a hybrid of DBMS and MapReduce technologies that targets analytical workload designed to run on a shared-nothing cluster.It uses Hadoop to push the data to existing databases engine to process the SQL queries.In order to push more query logic into databases (e.g. joins), hash-partitioning of data needs to be performed. The data is loaded into HDFS. for processing. HadoopDB's API provide the implementation of Map and Reduce functions.This makes hadoopdb to process huge analytical database with scalability using map reduce framework and performance with exisiting db engines.

refer:link

image source link