Pages

Some IdentityHashMap gotchas

Most of java guys are familiar with IdentityHashMap, which is mainly used when there is a need of maintaining keys based on reference equality. Lot of people, have written about a lot of collections hashing and so on.. But I thought whats so interesting about this map. Even the algorithm behind it is about 45 years old !

Every object is having an "Identity", i.e., the internal "address" is unique for the lifetime of that object. So map will contain unique objects as the keys. Consider HashMap, HashSet, TreeMap, TreeSet, PriorityQueue etc, they are "equality-dependent". One can override the equals() method in the object stored to keep the object as the keys in these collections.But IdentityHashMap is "equality-independent". Even if the objects residing it changes its comparability or equality, the map will stride fine. But it violates the symmetry property of the equals contract. Symmetry means that for any two references a and b, a.equals(b) must return true if and only if b.equals(a) returns true. That sets the contract for IdentityHashMap at odds with the contract for Map, the interface it implements, which specifies that equality should be used for key comparison.

According to the spec in bold says so... "This class is not a general-purpose Map implementation! While this class implements the Map interface, it intentionally violates Map’s general contract, which mandates the use of the equals() method when comparing objects. This class is designed for use only in the rare cases wherein reference-equality semantics are required."

IdenityHashMap data structure is based on open addressing, which is a method in which when a collision occurs, alternative locations are tried until an empty cell is found. All the items are stored in an array. This map uses linear probing, wherein the next position is found out sequentially.

ie newlocation = (startingValue + stepSize) % arraySize (here the stepsize be 1)

This is basically storing all keys in a single array. The occupied addresses will form clusters (primary clusters). These clusters are distributed and the keys in them are localized to some are. So the collision will occur when the hash function returns the index of insertion wherein the keys are occupied, consuming time for insertion. Basically this is a form file/memory management.We can say that map behaves like of addressing of arrays. It puts the keys and values alternating in a single array (supposedly good for large data sets) and doesn't need to create or reuse entry objects. Even when the map size be large as it grows the insertions are not affected, while other map implementations(which uses chaining) gets slower. However, lookup is much cheaper than insertion, as we be looking items up much more often than you insert them. Using probing is somewhat faster than following a linked list, when a reference to the value can be placed directly in the array,For other hash-based collections, this is not the case because they store the hash code. This will be efficient when a get operation must check whether it has found the right key by reference (equality is an expensive operation, as it checks for the right hash code). So this takes less memory and its faster.

I read that linear probing was first analyzed by Knuth in a 1963 (at the age of 24 !) memorandum now considered to be the birth of the area of analysis of algorithms. Knuth’s analysis, as well as most of the work that has since gone into understanding the properties of linear probing, is based on the assumption that hash function is a truly random function, mapping all keys independently.

The keys are stored in an array Tab. The hash function h will be mapping keys.When a key x is inserted, index = h(x). If Tab[index] is occupied, then we scan sequentially until x is found. If an empty slot found, x is inserted. The next insertion location will be nearest next empty slot from h(x).In IdentityHasMap h(x) is basically System.identityHashCode (as per source some more calculations are done). As this algorithm uses modulo arithmetic it goes around until the size is filled up. So in the implementation MAXIMUM_SIZE is doubled up untill allowed space for object is filled in the permspace.

According to Knuth, it is possible to estimate the average number of probes for a successful search, where l is the load factor as 1/2(1+ 1/(1-l))

The put(k,v) method looks for the position for insertion. As the he index is found out based on calculations involve System.identityHashCode(returning 32-bit integer value), identity hash codes are well distributed, almost like random numbers. Using System.identityHashCode means that for objects that have overridden hashCode and equals, even if two objects are equal() they will be put in a different hash buckets ie it treats them as two distinct objects.

Some popular anomalies:


IdentityHashMap<long,> orderMap1 = new IdentityHashMap<Long,Order>();
orderMap1.put(1L,null);
orderMap1.put(1L,null);

System.out.println(orderMap1.keySet().size());

Prints 1. As the 1L will be autoboxed in Java 1.5. But as the values less than 127 could be pointing to same object there will be a single reference. When use 1000L for the keys, the size will be 2 as it creates distinct objects. Long has a private class for caching these values in consideration to performance as those values are commonly used. This is applicable to Integer,Character etc.


private static class LongCache {
private LongCache(){}

static final Long cache[] = new Long[-(-128) + 127 + 1];

static {
for(int i = 0; i < cache.length; i++)
cache[i] = new Long(i - 128);
}
}

Take another example

IdentityHashMap<String,OrderI> orderMap6 = new IdentityHashMap<String,OrderI>();
OrderI oI2 = new OrderI(1);
OrderI oI3 = new OrderI(2);

orderMap6.put("order1",oI3);
orderMap6.put("order1",oI2);



Here the key size will be one. As the string literals are referred from the String heap itself they point to same location.So the second insertion overrides the previous value.

Where can this map be used.Some says , for caches, as they perform very well. Only need to store references though. For cyclic graphs, we must use identity rather than equality to check whether nodes are the same. Calculating equality between two graph node objects requires calculating the equality of their fields, which in turn means computing all their successors and we are back to the original problem. A IdentityHashMap, by contrast, will report a node as being present only if that same node has previously been put into the map. Thus by saving the size and node traversals.

I just wanted to know about this map thats it...

References

http://www.ece.uwaterloo.ca/~ece250/Lectures/Slides/6.07.LinearProbing.ppt
http://www.owasp.org/index.php/Java_gotchas#Immutable_Objects_.2F_Wrapper_Class_Caching
http://www.siam.org/proceedings/soda/2009/SODA09_072_thorupm.pdf
http://www.cs.unm.edu/~moret/graz01.pdf
http://www.it-c.dk/people/pagh/papers/linear-jour.pdf
http://reference.kfupm.edu.sa/content/l/i/linear_probing_and_graphs_122432.pdf

ETags - Roles in Web Application to Cloud Computing

A web server returns a value in the response header known as ETag (entity tag) helps the client to know if there is any change in content at a given URL which requested.When a page is loaded in the browser, it is cached.It knows the ETag of that page.The browser uses the value of ETag as the value of the header key "If-None-Match".The server reads this http header value and compares with the ETag of the page.If the value are same ie the content is not changed, a status
code 304 is returned ie. 304:Not Modified. These HTTP meta data can be very well used for predicting the page downloads thereby optimizing the bandwidth used.But a combination of a checksum (MD5) of the data as the ETag value and a correct time-stamp of modification could possible give quality result in predicting the re-download. An analysis of the effectiveness of chosing the value of ETag is described in this paper.

According to http://www.mnot.net/cache_docs/

A resource is eligible for caching if:

  • There is caching info in HTTP response headers
  • Non secure response (HTTPS wont be cached)
  • ETag or LastModified header is present
  • Fresh cache representation

Entity tags can be strong or weak validators.The strong validator provide the uniqueness of representation.If we use MD5 or SHA1, entity value changes when one bit of data is changed, while a weak value changes whenever the meaning of an entity(which can be a set of semantically related) changes.

More info on conditional requests explaining strong and weak ETags in here

In Spring MVC, Support for ETags is provided by the servlet filter ShallowEtagHeaderFilter. If you see the source here

String responseETag = generateETagHeaderValue(body);
.... ......

protected String generateETagHeaderValue(byte[] bytes) {
StringBuilder builder = new StringBuilder("\"0");
Md5HashUtils.appendHashString(bytes, builder);
builder.append('"');
return builder.toString();
}


The default implementation generates an MD5 hash for the JSP body it generated.So whenever the same page is requested, this checks for If-None-Match, a 304 is send back.


String requestETag = request.getHeader(HEADER_IF_NONE_MATCH);
if (responseETag.equals(requestETag)) {
if (logger.isTraceEnabled()) {
logger.trace("ETag [" + responseETag + "] equal to If-None-Match, sending 304");
}
response.setStatus(HttpServletResponse.SC_NOT_MODIFIED);
}



This reduces the processing and bandwidth usage.Since it is a plain Servlet Filter, and thus can be used in combination any web framework.A MD5 hash assures that the actual etag is only 32 characters long, while ensuring that they are highly unlikely to collide.A deeper level of ETag implementation penetrating to the model layer for the uniqueness is also possible.It could be realted to the revisions of row data. Matching them for higher predicatability of lesser downloads of data will be an effective solution.

As per JSR 286 portlet specification Portlet should set Etag property (validationtoken) and expiration-time when rendering. New render/resource requests will only be called after expiration-time is reached.New request will be sent the Etag. Portlet should examine it and determine if cache is still good if so, set a new expiration-time and do not render.This specification is implemented in Spring MVC.(see JIRA )

A hypothetical model for REST responses using deeper Etags could be effective while an API is exposed or two applications are integrated.I have seen such an implementation using Python here

When cloud computing is considered, for Amazon S3 receives a PUT request with the Content-MD5 header, Amazon S3 computes the MD5 of the object received and returns a 400 error if it doesn't match the MD5 sent in the header.Here Amazon or Azure uses Content-MD5 which is of 7 bytes.

According to the article here in S3 for some reason the entity was updated with the exact same bits that it previously had, the ETag will not have changed, but then, that's probably ok anyway.

According to S3 REST API,

Amazon S3 returns the first ten megabytes of the file, the Etag of the file, and the total size of the file (20232760 bytes) in the Content-Length field.

To ensure the file did not change since the previous portion was downloaded, specify the if-match request header. Although the if-match request header is not required, it is recommended for content that is likely to change.


The ETag directive in the HTTP specification makes available to developers to implement caching, which could be very effective at the transport level for REST services as well as web applications.The trade-off would be, there may be security implications to having data reside on the transport level.

But in the case of static files which is having a large "Expires" value and clustered files, Etag will not be effective because of the unique checksum for files that are distributed will be transported to client for each GET requests.By removing the ETag header, you disable caches and browsers from being able to validate files, so they are forced to rely on your Cache-Control and Expires header.Thus by reducing the header size which was having the checksum value.