Pages

Showing posts with label distributed systems. Show all posts
Showing posts with label distributed systems. Show all posts

Eventual Consistency

#distributed #nosql

Consistent Hashing

What is a consistent hash function?

A consistent hash function is one which changes minimally as the range of function changes.

What's the advantage of such functions?

This is ideal when set of buckets change over time. Two users with inconsistent but overlapping sets of buckets will map items to the same bucket with high probability. So this eliminates the need of "maintaining" a consistent "state" among all nodes in a network. The algorithm can be used for making consistent assignments or relationships between different sets of data in such a way that if we add or remove items, the algorithm can be recalculated on any machine and produce the same results.

Theory

A view V is a set of buckets where user is aware. A client uses a consistent hash function, f(V,i), maps an object to one of the buckets in the view. Say, assign each of hash buckets to random points on mod 2^n circle (virtually!) where hash key size = n. The hash of object= closest clockwise bucket. These small sets of buckets lie near the object. In this case, all the buckets get roughly same number of items. When kth bucket is added only a 1/k fraction of items move. This means when new node is added only minimum reshuffle is needed, which is the advantage of having a view. There can be a hash structure for the key lookup (a balanced tree) which stores the hash of all nodes (in the view).  When a new node is added its hash value is added to the table.

Suppose there are two nodes A and B three objects 1–3 (mapped to a hash-function’s result range). The objects 3 and 1 are mapped to node A, object 2 to node B. When a node leaves the system, data will get mapped to their adjacent node (in clockwise direction) and when a node enters the system it will get hashed onto the ring and will overtake objects.



As an example, (refer link1, link2), the circle denotes a range of  key values. Say, the points in circle represents 64 bit numbers. Hash the data to get the 64 bit number, which is a point in the circle. Take the IPs of nodes and hash them into 64 bit number and point in the circle. Associate the data to the nodes in the clockwise direction (ie. closest, which can be retrieved from the node in the hash structure).  When a new node is inserted into the hash tree, the data will always be assigned to the closest one only. Everything between this number and one that's next in the ring and that has been picked by a different node previously, is now belong to this node.

The basic idea of consistent hash function is to hash both objects and buckets using the same function. It's one of the best ways to implement APIs that can dynamically scale out and rebalanced. The client applications can calculate which node to contact in order to request or write the data with no metadata server required.


Used by

memcached cluster.
Typically, multiple memcached daemons are started, on different hosts. The clients are passed a list of memcached addresses (IP address and port) and pick one daemon for a given key. This is done via consistent hashing, which always maps the same key K to the same memcached server S. When a server crashes, or a new server is added, consistent hashing makes sure that the ensuing rehashing is minimal. Which means that most keys still map to the same servers, but keys hashing to a removed server are rehashed to a new server. - from A memcached implementation in JGroups


Amazon's Dynamo uses consistent hashing along with replication as a partitioning scheme.
Data is partitioned and replicated using consistent hashing [10], and consistency is facilitated by object versioning [12]. The consistency among replicas during updates is maintained by a quorum-like technique and a decentralized replica synchronization protocol. - from Dynamo: Amazon's Highly Available Key-value Store

Data of a Cassandra table gets partitioned and distributed among the nodes by a consistent hashing function.
Cassandra partitions data across the cluster using consistent hashing [11] but uses an order preserving hash function to do so. In consistent hashing the output range of a hash function is treated as a circular space or "ring" (i.e. the largest hash value wraps around to the smallest hash value). Each node in the system is as-signed a random value within this space which represents its position on the ring. Each data item identified by a key is assigned to a node by hashing the data item's key to yield its position on the ring, and then walking the ring clockwise to fi nd the first node with a position larger than the item's position. This node is deemed the coordinator for this key. The application specifi es this key and the Cassandra uses it to route requests. Thus, each node becomes responsible for the region in the ring between it and its predecessor node on the ring. The principal advantage of consistent hashing is that departure or arrival of a node only aff ects its immediate neighbors and other nodes remain una ffected. - from  Cassandra - A Decentralized Structured Storage System
 Voldemort automatic sharding  of  data. Nodes  can  be added  or  removed  from  a  database  cluster,  and  the system adapts automatically.  Voldemort automatically detects and recovers failed nodes. [refer]

References:
http://www.akamai.com/dl/technical_publications/ConsistenHashingandRandomTreesDistributedCachingprotocolsforrelievingHotSpotsontheworldwideweb.pdf
http://sharplearningcurve.com/blog/2010/09/27/consistent-hashing/
http://weblogs.java.net/blog/tomwhite/archive/2007/11/consistent_hash.html




Bloom Filters

A Bloom filter is a probabilistic data-structure. This can be used to store a set of data in a space-efficient manner. For eg; a distributed cache called Cache Digests shared as summaries between the nodes to have a global image. 

The data-structure can be used to provide membership queries ie. checkIfDataPresentInStore() If it is to check an element is already inserted in the filter then it will return true, there are no false negatives. But there can be chance if the element not inserted may return true. But the check for that element can be done in the original store ie. the overhead is associated with the rate of false positives. This is different from dictionary in which the hit/miss is deterministic.

For a set of n elements, a bloom filter can be a vector of size m.Initially, all bits are set to 0. For each element e, k hash functions will set k bits in the bit vector to 1. When a query for membership executed, it will check for the bit positions for the set value. If matches all, the queried element is possibly present in the store else, it is sure not present.Each hash function returns the index to set. This means we have to store these m bits per key. So a total of m * N bits of space required. The use of different hash functions results less collision.


Uses
  • Design a spell checker
  • Database join implementation (Oracle)  
  • Peer to peer (P2P) communication and routing  
  • In HBase, the Bloom filter is stored as meta block in the HFile. When a HFile is opened, the bloom filter is loaded into memory and used to determine if a given key is in that store file. This can avoid the scanning region for the key. 
  • and more

I found a java implementation here
Cassandra's java implementation here

Reference

http://en.wikipedia.org/wiki/Bloom_filter
https://issues.apache.org/jira/browse/HBASE-1200
http://wiki.squid-cache.org/SquidFaq/CacheDigests
http://gsd.di.uminho.pt/members/cbm/ps/dbloom.pdf

Accessing data from storage system using XAM API



If you are familiar with POSIX which is a collection of standards which enables portability of applications across OS platforms. It provide a standardized API for an OS, eg:- functions like 'open' to open a file, 'fdopen' to open stream for a file descriptor, fork to create a process etc. It defines file descriptor as a per-process, unique, nonnegative integer used to identify an open file for the purpose of file access. The Unix systems programming API is standardized by a large number of POSIX standards. Most operating system APIs share common functionalities like listing directories, renaming file etc.

XAM stands for "eXtensible Access Method". The XAM API will allow application developers to store content on a class of storage systems known as “fixed-content” storage systems.

The data centers have evolved into a complex ecosystem of SANs, fiber channels, iSCSI, caching technologies etc. Certain storage systems are efficiently designed to store the documents that donot change,  which is useful for preservation or archival. Digital imaging storage systems which us used to archive XRAY DICOM images, ie. documents that are not edited and to be kept for a long time need can use fixed storage. The applications accessing these solutions need to be written using only those vendor-specific interfaces. What if one changes their infrastructure? The application has to be rewritten. This scenario is an analogy to the need for JDBC/ODBC driver specifications as a standard API to interact with different vendors. XAM API specified by SNIA is the storage industry's proposed solution to the multi-vendor interoperability problem.

Consider EMC Centera, the data written to it is fixed in nature. Most common systems uses a file location (address) based approach to store and retrieve content. Centera has a flat address scheme called content address. When the BLOB object is stored, it is given "claim" check derived from its content which is a hash value (128-bit) known as Content Address (CA). The application needn't know the physical location of the data stored. The associated metadata like filename etc is added into an XML called C-Clip Descriptor File (CDF) with the file's content address. This makes the system capable of WORM. When one tries to modify/update a file, the new file will be kept separately, enabling versioning and unique access through its fingerprint checksum. Also, there is an essential attribute for retention that tracks the expiry of the file. The file can't be deleted until time surpasses the defined value.

This is a basic vendor specific architecture. They provides the SDK/API to acces the data. So the industry came up with standard in which XAM API provides the capability to access these systems independently.

XAM has concepts like SYSTEMS, XSETS, properties and streams ("blobs"), and XUIDs.

SYSTEM/XSYSTEM - is the repository
XSET - the content
XUID - identifier for the XSET

All these elements include metadata as key value pairs. There are participating and non-participating fields. Participating fields contribute to the naming of the XSET known as XUID. If XSET is retrieved from a SYSTEM and a participating field is modified, it will receive a different XUID. XUID created must be formed by using the content of the property or stream (e.g. run an MD5 hash over the value).

So a vendor specific library implements the XAM Application Programming Interface (API) which is known as XAMLibrary connecting to the Vendor Interface Module (VIM), which internally communicates to the storage system. Usually, VIMs are the native interfaces. Once the application load XAMLibrary and connect to the source, it needs to open the XSET by providing the XUID which will open up the content as a stream, called XStream. The XStream object is used to read the transported data which can be a query result or a file.




A sample code to read the data,

// connect
String xri = “snia-xam://"
XSystem sys = s_xam.connect(xri);
//open the XSet
XSet xset = sys.openXSet( xuid, XSet.MODE_READ_ONLY);
//getting metdata value for the key
String value = xset.getString( “com.example.key” );
XStream stream = xset.openStream( “com.example.data”,
XStream.MODE_READ_ONLY);
stream.close();
xset.close();
sys.close()

XAM fields have type information described using MIME types.
eg:-“application/vnd.snia.xam.xuid" for the an 80-element byte array, xam_xuid.

The XAM URI (XRI) is the XAM resource identifier specification associated to a vendor identifying the storage.

As XAM generates a globally unique name (address), the objects may move around freely in time, changing their physical or technological location so as to enable a transparent information lifecycle management (ILM). So the metadata information and retention can be externalized ie, even if the application is gone ie. retired, the information management can be handled through an API. This makes the data center more adaptive to organizational information management.

And, unlike the file systems, XAM tags the information with metadata and provides a technology independent namespace, allowing software to interpret the content independent of the application.


Architecture specification


An account of Open Source Summit 2008 Hyderabad

During this weekend I attended the Open source Summit held at IIIT Hyderabad which was on 13-14 (I was unable to attend the event on the second day :( ). On Saturday morning I took the MMTS from the station nearby home and came to HafeezPet and from there reached IIIT by auto around 11 am.


The first session I attended was on BeleniX, opensolaris LiveCD project - Moinak Ghosh,I arrived the conference room, while the presentation was halfway around . The presenter was upgrading the openSolaris and while it was going on, other applications were executed !! He was explaining how openSolaris and ZFS is useful in production ready environment. He demonstrated creating separate snapshots. He explained about using DTrace, which can be used to dynamically inject debug codes while the application is running (can be used for debugging kernel). He explained about the difference between zones in Open Solaris and virtualization, concept of RAMDisk etc. The session was good as practical samples are demonstrated.


Next session was more interesting was, by Mahesh Patil from National Ubiquitous Computing Research CenterCDAC, Embedded Linux and Real time Operating systems. I really enjoyed and understood the technology. When I was in my college (MCET Trivandrum) we used to conduct lot of seminars. Sensor networks, nano technology were most presented those days. But this session as a great experience as he had to show something cool. He had a board with ARM processor and he demonstrated loading the Linux OS into it. He explained about ToolChains , how it can be used, packaging the kernel images etc. He described how an embedded OS is different from RTOS and about the preemptive nature of RTOS.RTOS uses the dual kernel approach in which the interrupt handling latency can be reduced by a kernel handling it and other operations by other kernel. The core kernel operations are given a low priority as the other tasks which are to be executed with higher priority in the queue. I came to know that most of the embedded Linux is based on POSIX compliance, but in Japan it is MicroItron. He explained about ECOS a configurable OS which can be configured for embedded or real time. Then about the Smart Dust project, cool futuristic technology; tiny devices floating around which communicate within a small range where they sleep most of the time. I was wondering about how huge the data will be produced by these devices. Think about real-time heat maps of different boxes holding vaccines those are distributed around the world! (Now pharmaceutical companies have a device kept inside the package to record the temperature when it was packed and check the change in temperature when opened) .Also came to know about the 3KB Linux – TinyOS ! Cool and simple… even though I am not from electronics background ...


On to the stage, was a geek – Arun Raghavan from Nvidia.He is a developer in Gentoo Linux community.I hasn't tried this Linux variant before. It's a linux for developers!! Any application can be customized for performance, maintainability creating ebuilds which will make it so flexible for developers. I think it will have a good learning curve as most of the installations and customizations can be done by a user .He demonstrated creating ebuilds for gtwitter a twitter client.He demonstrated the ease of using Portage which is a package management system used by Gentoo Linux.Visit Gentoo.org to know more about this linux.I really liked the article written by Daniel Robbins(Architect of Gentoo Linux) about the birth of it; read here .


I attended another session which was on Hadoop by Venkatesh from Yahoo research team.Hadoop is an opensource project for large data centers .I was looking forward for this presentation as it is about the web 2.0 (cloud computing) and large scale computing (blogged before). It is a framework written in Java that supports data intensive distributed applications.To know more about large scale data processing using Hadoop you can read this paper.It has a file system called HDFS filesystem ( pure-Java file system !!) stores replicated data as chunks across the unRaided SATA disks.There is a name node and a cluster of data nodes like a master-slave system.Name node stores an inverted index of data stored across the file system.Concept is similar to Google File System and its cluster features.More about the concept here (Nutch) , here This framework can be used for processing high volume data integrated with lucene will help to create a quality search engine of our own.This framework is used by Facebook. In one of the Engineering @ Facebook's Notes explains why Hadoop was integrated . Read here. It is used by IBM(IBM MapReduce Tools for Eclipse), Amazon,Powerset (which was acquired by Microsoft recently),Last.fm ... Read more More hadoop in data intensive scalable computing.Related projects Mahout (machine learning libraries),tashi (cluster management system).


So it was worth as I was able to attend these sessions .... Thanks to twnicnling.org