Dissecting Protocol Buffers

There are many ways to serialize data to formats like XML, JSON etc. Google's ProtoBuf is a way of encoding structured data. It is the de facto standard at Google for any server-to-server calls.

The choice of using the method of object serialization depends on speed,size,compatibility, metadata integrity, platform independent etc which make Protocol Buffers a good candidate especially if it is used for Java (the code can also be generated for languages like C,Python etc).XML has been a standard as data interchange and serialization format for most of the applications. Java do have a native serialization mechanism for objects.PB is like IDL describing the entity or data structure. PB can be considered as a high-level language describing the input and output types. Then the compiler-generated code is used to hide the details of encoding/decoding from application code.Isn't this similar to CORBA or EJB? Hoi, its from Google. They have been using this binary encoded structured data as input,output,writable formats for map reduce.

Protocol buffers are advantageous because they support multiple languages (i.e., Python, Java, C++ and others), are cross-platform, flexible, and extensible. They are forward and backward compatible. It uses descriptive message (ie entity or object) definition files (.proto files). The proto files are parsed by the compiler provided (eg:protoc.exe) to generate the java files based on the message definition.This java code will have the "builders" for creating the object.Good thing is that the message definitions can be modified without affecting the parsers derived from an older legacy version of the .proto definition.If we consider XML for persistence, it will need a metadata associated with entity.But, the protocol buffers are self descriptive and deprived of such unecessary details making it smaller in size.

The entity will be defined as a message.The message can refer other messages, but need to be defined or imported.Several data types and repeated data are supported.
The available wire types are as follows:

WireTypeMeaningUsed For
0Varintint32, int64, uint32, uint64, sint32, sint64, bool, enum
164-bitfixed64, sfixed64, double
2Length-delimitedstring, bytes, embedded messages, packed repeated fields
332-bitfixed32, sfixed32, float

It include the concept of optional elements: fields that aren’t currently needed are not included in the binary representation. This is similar to an XML shema for an element with minOccurs=0. XSD/DTDs provide data integrity. But I think portocol buffer can also provide data type definitions and values within itself without compromising integrity.

A key is associated with each data value.For the first 15 values/members in the structure this key is stored in 1 byte; 2 bytes is required for
each key representing the 16th through the 2047th member.

Take an example.Define a proto file,
message student{
optional int32 id = 1;
optional string name = 2;

Generate java file. Then save the data.

Student.student.Builder builder = Student.student.newBuilder();

Then view the binary file.If you have an hex editor one can see the hex dump.Use vim, it has xxd.exe. Use it to generate the binary dump.

Byte 1: The key 00001000 give info like :
bits 2-5 ie. 0001 says the field number ie 1
bits 6-8 ie. 000 says the wiretype is 0 ie int32 (see the table)
Byte 2:
The value 01100100 is 100

Byte 3: The key 00010010
bits 2-5 ie. 0010 says the field number ie 2
bits 6-8 ie. 010 says the wiretype is 2 ie string (see the table)

Byte 4: The 00000100 says next 4 bytes for UT-8 character string


This is how the binary is file is structured. As protocol buffers include data binding library, it makes easy for encoding/decoding. It is faster (7x) than JSON serialization.Even unknown fields can be set to the object.


ProtoBuf HomePage

NetBeans IDE Plugin for code generation

Performance Using Internet data in Android applications

Google Protocol Buffers - the Good, the Bad and the Ugly
MapReduce: A Flexible Data Processing Tool


No comments:

Post a Comment