Pages

Boyer and Moore's Linear Time Voting Algorithm

This is a simple linear time voting algorithm designed by Robert S Boyer and J Stother Moore in 1980 which is discussed in their paper MJRTY - A Fast Majority Vote Algorithm.

This algorithm decides which element of a sequence is in the majority, provided there is such an element. 

Suppose there are n characters (objects/candidates). When the ith element is visited, the set can be divided into two groups,ca group of k elements in favor of current selected candidate and a group of elements that disagree.After processing all, we can conclude that candidate selected can be considered majority if there's any !


When the pointer forward over an element e:

If the counter is 0, we set the current candidate to e and we set the counter to 1.
If the counter is not 0, we increment or decrement the counter according to whether e is the current candidate.
When we are done, the current candidate is the majority element, if there is a majority.

I have written a simple java implementation.




Sometime ties may occur. But this algorithm doesn't fit as the solution.For an assurance, if the vote is greater than n/2, the candidate which is returned as majority it is announced to be the selected one. This counting phase can be done when the increment for the candidate happens.This algorithm is really effective when the data is read from a tape.The algorithm only works when at least half of the elements constitute the majority.

Reference
http://www.cs.utexas.edu/users/moore/best-ideas/mjrty/example.html

Descending Iterator and Adapter pattern

There is a descending iterator in linked list implementation in Java SDK. A humble private class in LinkedList. A good example of adapter.




calls up


public Iterator<E> descendingIterator() {
return new DescendingIterator();
}


Using Avro to serialize logs in log4j


I have written about serialization mechanism of Protocol Buffers previously. Similarly, Apache Avro provides a better serialization framework. 

It provide features like:

 - Independent Schema -  use different schemas for serialization and de-serialization
 - Binary serialization - compact data encoding, and faster data processing
 - Dynamic typing - serialization and deserialization without code generation

 We can encode data when serializing with Avro: binary or JSON. In the binary file schema is  included at the beginning of file. In JSON, the type is defined along with the data. Switching JSON protocol to a binary format in order to achieve better performance is pretty straightforward with Avro. This means less type information needs to be sent with the data and it stores data with its schema means any program can de-serialize the encoded data, which makes a good candidate for RPC.

 In Avro 1.5 we have to use (this is different from previous versions which had no factory for encoders)
 - org.apache.avro.io.EncoderFactory.binaryEncoder(OutputStream out, BinaryEncoder reuse) for binary
 - org.apache.avro.io.EncoderFactory.jsonEncoder(Schema schema, OutputStream out) for JSON

 The values (Avro supported value types) are put for the schema field name as the key
 in a set of name-value pairs called  GenericData.Record

 Avro supported value types are
  Primitive Types - null, boolean, int, long, float, double, bytes, string
  Complex Types - Records, Enums, Arrays, Maps, Unions, Fixed
 
  you can read more about them  here

  An encoded schema definition to be provided for the record instance. To read/write data, just use put/get methods
 
   I have used this serialization mechanism to provide a layout for log4j. The logs will be serialized to avro mechanism.

github project is here - https://github.com/harisgx/avro-log4j
 
   Add the libraries to your project and add new properties to log4j.properties

   log4j.appender.logger_name.layout=com.avrolog.log4j.layout.AvroLogLayout
   log4j.appender.logger_name.layout.Type=json
   log4j.appender.logger_name.layout.MDCKeys=mdcKey
 
 Provide the MDC keys as comma seperated values
 
 
   This is the schema


 
 

Bloom Filters

A Bloom filter is a probabilistic data-structure. This can be used to store a set of data in a space-efficient manner. For eg; a distributed cache called Cache Digests shared as summaries between the nodes to have a global image. 

The data-structure can be used to provide membership queries ie. checkIfDataPresentInStore() If it is to check an element is already inserted in the filter then it will return true, there are no false negatives. But there can be chance if the element not inserted may return true. But the check for that element can be done in the original store ie. the overhead is associated with the rate of false positives. This is different from dictionary in which the hit/miss is deterministic.

For a set of n elements, a bloom filter can be a vector of size m.Initially, all bits are set to 0. For each element e, k hash functions will set k bits in the bit vector to 1. When a query for membership executed, it will check for the bit positions for the set value. If matches all, the queried element is possibly present in the store else, it is sure not present.Each hash function returns the index to set. This means we have to store these m bits per key. So a total of m * N bits of space required. The use of different hash functions results less collision.


Uses
  • Design a spell checker
  • Database join implementation (Oracle)  
  • Peer to peer (P2P) communication and routing  
  • In HBase, the Bloom filter is stored as meta block in the HFile. When a HFile is opened, the bloom filter is loaded into memory and used to determine if a given key is in that store file. This can avoid the scanning region for the key. 
  • and more

I found a java implementation here
Cassandra's java implementation here

Reference

http://en.wikipedia.org/wiki/Bloom_filter
https://issues.apache.org/jira/browse/HBASE-1200
http://wiki.squid-cache.org/SquidFaq/CacheDigests
http://gsd.di.uminho.pt/members/cbm/ps/dbloom.pdf

Labs

avro-log4j   -  serialization mechanism to provide a layout for log4j

firetester   -  A simple RESTful services testing tool written in Groovy Griffon framework

gitter - Publishes github activities to Twitter

jfilemagic (jfm) is an utility for identifying files using magic numbers or signatures

cometd-chat - a comet based chatter for fun



Interesting uses of sun.misc.Unsafe

Inspired from the question that found in stackoverflow, I started looking up for the uses. I found some pretty interesting ones...

VM "intrinsification." ie CAS (Compare-And-Swap) used in Lock-Free Hash Tables eg:sun.misc.Unsafe.compareAndSwapInt it can make real JNI calls into native code that contains special instructions for CAS

What is intrinsification?
They are optimization done like compiler generating code directly for called method or JVM native optimizations. We know that there are VM downcalls from JDK like wait method etc. Its about low level programming. For eg:- the Atomic classes for numbers, they are pure numbers represented by objects but atomically modified in which the operations are managed natively.


The sun.misc.Unsafe functionality of the host VM can be used to allocate uninitialized objects and then interpret the constructor invocation as any other method call.

One can track the data from the native address.It is possible to retrieve an object’s memory address using the sun.misc.Unsafe class, and operate on its fields directly via unsafe get/put methods!

Compile time optimizations for JVM. HIgh performance VM using "magic", requiring low-level operations. eg: http://en.wikipedia.org/wiki/Jikes_RVM

Allocating memory, sun.misc.Unsafe.allocateMemory eg:- DirectByteBuffer constructor internally calls it when ByteBuffer.allocateDirect is invoked

Tracing the call stack and replaying with values instantiated by sun.misc.Unsafe, useful for instrumentation

sun.misc.Unsafe.arrayBaseOffset and arrayIndexScale can be used to develop arraylets, a technique for efficiently breaking up large arrays into smaller objects to limit the real-time cost of scan, update or move operations on large objects


References


Accessing data from storage system using XAM API



If you are familiar with POSIX which is a collection of standards which enables portability of applications across OS platforms. It provide a standardized API for an OS, eg:- functions like 'open' to open a file, 'fdopen' to open stream for a file descriptor, fork to create a process etc. It defines file descriptor as a per-process, unique, nonnegative integer used to identify an open file for the purpose of file access. The Unix systems programming API is standardized by a large number of POSIX standards. Most operating system APIs share common functionalities like listing directories, renaming file etc.

XAM stands for "eXtensible Access Method". The XAM API will allow application developers to store content on a class of storage systems known as “fixed-content” storage systems.

The data centers have evolved into a complex ecosystem of SANs, fiber channels, iSCSI, caching technologies etc. Certain storage systems are efficiently designed to store the documents that donot change,  which is useful for preservation or archival. Digital imaging storage systems which us used to archive XRAY DICOM images, ie. documents that are not edited and to be kept for a long time need can use fixed storage. The applications accessing these solutions need to be written using only those vendor-specific interfaces. What if one changes their infrastructure? The application has to be rewritten. This scenario is an analogy to the need for JDBC/ODBC driver specifications as a standard API to interact with different vendors. XAM API specified by SNIA is the storage industry's proposed solution to the multi-vendor interoperability problem.

Consider EMC Centera, the data written to it is fixed in nature. Most common systems uses a file location (address) based approach to store and retrieve content. Centera has a flat address scheme called content address. When the BLOB object is stored, it is given "claim" check derived from its content which is a hash value (128-bit) known as Content Address (CA). The application needn't know the physical location of the data stored. The associated metadata like filename etc is added into an XML called C-Clip Descriptor File (CDF) with the file's content address. This makes the system capable of WORM. When one tries to modify/update a file, the new file will be kept separately, enabling versioning and unique access through its fingerprint checksum. Also, there is an essential attribute for retention that tracks the expiry of the file. The file can't be deleted until time surpasses the defined value.

This is a basic vendor specific architecture. They provides the SDK/API to acces the data. So the industry came up with standard in which XAM API provides the capability to access these systems independently.

XAM has concepts like SYSTEMS, XSETS, properties and streams ("blobs"), and XUIDs.

SYSTEM/XSYSTEM - is the repository
XSET - the content
XUID - identifier for the XSET

All these elements include metadata as key value pairs. There are participating and non-participating fields. Participating fields contribute to the naming of the XSET known as XUID. If XSET is retrieved from a SYSTEM and a participating field is modified, it will receive a different XUID. XUID created must be formed by using the content of the property or stream (e.g. run an MD5 hash over the value).

So a vendor specific library implements the XAM Application Programming Interface (API) which is known as XAMLibrary connecting to the Vendor Interface Module (VIM), which internally communicates to the storage system. Usually, VIMs are the native interfaces. Once the application load XAMLibrary and connect to the source, it needs to open the XSET by providing the XUID which will open up the content as a stream, called XStream. The XStream object is used to read the transported data which can be a query result or a file.




A sample code to read the data,

// connect
String xri = “snia-xam://"
XSystem sys = s_xam.connect(xri);
//open the XSet
XSet xset = sys.openXSet( xuid, XSet.MODE_READ_ONLY);
//getting metdata value for the key
String value = xset.getString( “com.example.key” );
XStream stream = xset.openStream( “com.example.data”,
XStream.MODE_READ_ONLY);
stream.close();
xset.close();
sys.close()

XAM fields have type information described using MIME types.
eg:-“application/vnd.snia.xam.xuid" for the an 80-element byte array, xam_xuid.

The XAM URI (XRI) is the XAM resource identifier specification associated to a vendor identifying the storage.

As XAM generates a globally unique name (address), the objects may move around freely in time, changing their physical or technological location so as to enable a transparent information lifecycle management (ILM). So the metadata information and retention can be externalized ie, even if the application is gone ie. retired, the information management can be handled through an API. This makes the data center more adaptive to organizational information management.

And, unlike the file systems, XAM tags the information with metadata and provides a technology independent namespace, allowing software to interpret the content independent of the application.


Architecture specification


Enterprise Information Integration


The relational model given by Codd enable a simple logical model, in mathematical foundation based on set-oriented queries. The relational database evolved in decades. The use cases for accessing information changed. Data marts, reporting, dashboards, security, governance and so on… RDBMS provided an efficient transaction semantics and support. But, world has changed, global economics became complex and data semantics became unmanageable. The corporations grown bigger, swallowing the tiny prawns around the globe. Many database vendors like Oracle, DB2, SQL Server. Many packaged apps SAP, Siebel, homegrown. Structured, unstructured, semi-structured and so many exotic data sources… Federated databases provided a centralized schema mapping logically the schemas from different databases. It faced challenges of schema integration, query translations etc. Virtual Databases, some called them.

Use of XML as standard for data exchange became a boon for all. It provided standards for schema, XSD, querying, XQuery etc. It is not a holy-grail though. The paradigm SOA, integrate systems, applications, processes and businesses. As the enterprise information ecosystem has grown like a behemoth, multiple producers/consumers forced to share data across its domain. Proliferation of data models results data chaos.

EII decouples data sources from data consumers without moving data. It provides universal data access facilitating dynamic dashboards or reports. By decouple application logic from integration; one can easily add/change sources with customization of the content delivery. It acts like a mediation engine for the source and client. Standardized data access APIs like ODBC, JDBC offer a specific set of commands to retrieve and modify data from a generic data source or one can use Web Services.

Then there are SDOs, SDO is an object-based programming specification for uniform access to various forms of data. It is designed to complement existing J2EE data access mechanisms, including EJBs, JDBC, and a number of popular frameworks for persistence. It is also intended to support service-oriented architectures (SOA), including Web services. SDO programmers work with data graphs (collections of tree-structured data) that consist of one or more data objects. The data graph structure is well suited to certain types of Web applications, such as those that display master-detail information and is easily understood by most object-oriented application programmers. It is based on Model Driven Architecture.

An EII system provides a single interface and extensible metadata that can encompass all data sources, speeding the development of applications such as enterprise portals. Development is faster as programmer’s code for access to a single virtual data source regardless of data storage. It is like a metadata-based view of sources. Virtual views optimize execution of distributed queries. So modeled with metadata, makes many data sources look like one single entity. Information modelers like XBRL could be used to publish data for reporting purposes. The EII server creates the query for each data source. It is intelligent to understand multiple data sources. The data sources are mapped to create the view i.e. data to xml binding to metadata. The metadata is stored in a repository. These reusable integrated logical entities or data objects can be exposed as a data service or as jdbc/odbc.

One of the key issues faced in data integration projects is locating and understanding the data to be integrated. Often, one would find that the data needed for a particular integration, application is not even captured in any source in the enterprise . ie the developer has to once again model the domain from different sources. Mapping from multiple systems and schema is really a cumbersome process and a human intervention becomes essential with more headaches. As an EII system do provide writing data, the transaction management across the system to be managed and information retrieval become complex when mapping data from structured and unstructured data. EII is not a perfect solution as it needs to have the performance and scalability similar to the ones expected from RDBMS. Even when it matures and succeeds, EII will not replace the data warehouse. Depending on the problem to be solved, integrated data will be persisted into a warehouse, or virtualized through EII.

Refrences

http://www.ibm.com/developerworks/data/library/techarticle/dm-0407saracco/

http://www.cs.washington.edu/homes/alon/files/eiisigmod05.pdf

http://www.jboss.com/products/platforms/dataservices/