#bytescrolls: May 2011

Using Avro to serialize logs in log4j

I have written about serialization mechanism of Protocol Buffers previously. Similarly, Apache Avro provides a better serialization framework.

It provide features like:

- Independent Schema - use different schemas for serialization and de-serialization
- Binary serialization - compact data encoding, and faster data processing
- Dynamic typing - serialization and deserialization without code generation

We can encode data when serializing with Avro: binary or JSON. In the binary file schema is included at the beginning of file. In JSON, the type is defined along with the data. Switching JSON protocol to a binary format in order to achieve better performance is pretty straightforward with Avro. This means less type information needs to be sent with the data and it stores data with its schema means any program can de-serialize the encoded data, which makes a good candidate for RPC.

In Avro 1.5 we have to use (this is different from previous versions which had no factory for encoders)
- org.apache.avro.io.EncoderFactory.binaryEncoder(OutputStream out, BinaryEncoder reuse) for binary
- org.apache.avro.io.EncoderFactory.jsonEncoder(Schema schema, OutputStream out) for JSON

The values (Avro supported value types) are put for the schema field name as the key
in a set of name-value pairs called GenericData.Record

Avro supported value types are
Primitive Types - null, boolean, int, long, float, double, bytes, string
Complex Types - Records, Enums, Arrays, Maps, Unions, Fixed

you can read more about them here

An encoded schema definition to be provided for the record instance. To read/write data, just use put/get methods

I have used this serialization mechanism to provide a layout for log4j. The logs will be serialized to avro mechanism.

github project is here - https://github.com/harisgx/avro-log4j

Add the libraries to your project and add new properties to log4j.properties

log4j.appender.logger_name.layout=com.avrolog.log4j.layout.AvroLogLayout
log4j.appender.logger_name.layout.Type=json
log4j.appender.logger_name.layout.MDCKeys=mdcKey

Provide the MDC keys as comma seperated values

This is the schema

Bloom Filters

A Bloom filter is a probabilistic data-structure. This can be used to store a set of data in a space-efficient manner. For eg; a distributed cache called Cache Digests shared as summaries between the nodes to have a global image.

The data-structure can be used to provide membership queries ie. checkIfDataPresentInStore() If it is to check an element is already inserted in the filter then it will return true, there are no false negatives. But there can be chance if the element not inserted may return true. But the check for that element can be done in the original store ie. the overhead is associated with the rate of false positives. This is different from dictionary in which the hit/miss is deterministic.

For a set of n elements, a bloom filter can be a vector of size m.Initially, all bits are set to 0. For each element e, k hash functions will set k bits in the bit vector to 1. When a query for membership executed, it will check for the bit positions for the set value. If matches all, the queried element is possibly present in the store else, it is sure not present.Each hash function returns the index to set. This means we have to store these m bits per key. So a total of m * N bits of space required. The use of different hash functions results less collision.

Uses

Design a spell checker.
Database join implementation (Oracle)
Peer to peer (P2P) communication and routing
In HBase, the Bloom filter is stored as meta block in the HFile. When a HFile is opened, the bloom filter is loaded into memory and used to determine if a given key is in that store file. This can avoid the scanning region for the key.
and more

I found a java implementation here
Cassandra's java implementation here

Reference

http://en.wikipedia.org/wiki/Bloom_filter
https://issues.apache.org/jira/browse/HBASE-1200
http://wiki.squid-cache.org/SquidFaq/CacheDigests
http://gsd.di.uminho.pt/members/cbm/ps/dbloom.pdf

Labs

avro-log4j - serialization mechanism to provide a layout for log4j

firetester - A simple RESTful services testing tool written in Groovy Griffon framework

gitter - Publishes github activities to Twitter

jfilemagic (jfm) is an utility for identifying files using magic numbers or signatures

cometd-chat - a comet based chatter for fun

#bytescrolls

Pages

Using Avro to serialize logs in log4j

Bloom Filters

Labs

Popular Posts