#bytescrolls: cloud computing

Showing posts with label cloud computing. Show all posts

Data and Brain

#bigdata

Came across an interesting presentation on Using Data to Understand Brain.

Using Data to Understand the Brain from jakehofman

Is it possible to read your brain? hmmm

I am a little two-faced with these riddles....

#Nodeable is a good example of generating #insights from #bigdata or the real time trickle feeds. It uses Twitter's Storm for the processing engine Stream reduce. I signed up for a trial account to play around.

The insights like "Most Active" metrics are generated for Amazon Web services status. The reports are generated and tagged in real time. The twitter follow counts are displayed.

It has only some basic set of connectors, but one can create custom connectors using its JSON Schema. The outbound data can be pushed to your own Amazon s3 or Hadoop WebHDFS, which is good for private companies.

The github/rss stream is shown as activity stream.

Sharing an interesting presentation of Storm real-time computation.
ETE 2012 - Nathan Marz on Storm from Chariot Solutions on Vimeo.

ETags - Roles in Web Application to Cloud Computing

A web server returns a value in the response header known as ETag (entity tag) helps the client to know if there is any change in content at a given URL which requested.When a page is loaded in the browser, it is cached.It knows the ETag of that page.The browser uses the value of ETag as the value of the header key "If-None-Match".The server reads this http header value and compares with the ETag of the page.If the value are same ie the content is not changed, a status
code 304 is returned ie. 304:Not Modified. These HTTP meta data can be very well used for predicting the page downloads thereby optimizing the bandwidth used.But a combination of a checksum (MD5) of the data as the ETag value and a correct time-stamp of modification could possible give quality result in predicting the re-download. An analysis of the effectiveness of chosing the value of ETag is described in this paper.

According to http://www.mnot.net/cache_docs/

A resource is eligible for caching if:

There is caching info in HTTP response headers
Non secure response (HTTPS wont be cached)
ETag or LastModified header is present
Fresh cache representation

Entity tags can be strong or weak validators.The strong validator provide the uniqueness of representation.If we use MD5 or SHA1, entity value changes when one bit of data is changed, while a weak value changes whenever the meaning of an entity(which can be a set of semantically related) changes.

More info on conditional requests explaining strong and weak ETags in here

In Spring MVC, Support for ETags is provided by the servlet filter ShallowEtagHeaderFilter. If you see the source here

String responseETag = generateETagHeaderValue(body);
.... ......

protected String generateETagHeaderValue(byte[] bytes) {
StringBuilder builder = new StringBuilder("\"0");
Md5HashUtils.appendHashString(bytes, builder);
builder.append('"');
return builder.toString();
}

The default implementation generates an MD5 hash for the JSP body it generated.So whenever the same page is requested, this checks for If-None-Match, a 304 is send back.

String requestETag = request.getHeader(HEADER_IF_NONE_MATCH);
if (responseETag.equals(requestETag)) {
if (logger.isTraceEnabled()) {
logger.trace("ETag [" + responseETag + "] equal to If-None-Match, sending 304");
}
response.setStatus(HttpServletResponse.SC_NOT_MODIFIED);
}

This reduces the processing and bandwidth usage.Since it is a plain Servlet Filter, and thus can be used in combination any web framework.A MD5 hash assures that the actual etag is only 32 characters long, while ensuring that they are highly unlikely to collide.A deeper level of ETag implementation penetrating to the model layer for the uniqueness is also possible.It could be realted to the revisions of row data. Matching them for higher predicatability of lesser downloads of data will be an effective solution.

As per JSR 286 portlet specification Portlet should set Etag property (validationtoken) and expiration-time when rendering. New render/resource requests will only be called after expiration-time is reached.New request will be sent the Etag. Portlet should examine it and determine if cache is still good if so, set a new expiration-time and do not render.This specification is implemented in Spring MVC.(see JIRA )

A hypothetical model for REST responses using deeper Etags could be effective while an API is exposed or two applications are integrated.I have seen such an implementation using Python here

When cloud computing is considered, for Amazon S3 receives a PUT request with the Content-MD5 header, Amazon S3 computes the MD5 of the object received and returns a 400 error if it doesn't match the MD5 sent in the header.Here Amazon or Azure uses Content-MD5 which is of 7 bytes.

According to the article here in S3 for some reason the entity was updated with the exact same bits that it previously had, the ETag will not have changed, but then, that's probably ok anyway.

According to S3 REST API,

Amazon S3 returns the first ten megabytes of the file, the Etag of the file, and the total size of the file (20232760 bytes) in the Content-Length field.

To ensure the file did not change since the previous portion was downloaded, specify the if-match request header. Although the if-match request header is not required, it is recommended for content that is likely to change.

The ETag directive in the HTTP specification makes available to developers to implement caching, which could be very effective at the transport level for REST services as well as web applications.The trade-off would be, there may be security implications to having data reside on the transport level.

But in the case of static files which is having a large "Expires" value and clustered files, Etag will not be effective because of the unique checksum for files that are distributed will be transported to client for each GET requests.By removing the ETag header, you disable caches and browsers from being able to validate files, so they are forced to rely on your Cache-Control and Expires header.Thus by reducing the header size which was having the checksum value.

An account of Open Source Summit 2008 Hyderabad

During this weekend I attended the Open source Summit held at IIIT Hyderabad which was on 13-14 (I was unable to attend the event on the second day :( ). On Saturday morning I took the MMTS from the station nearby home and came to HafeezPet and from there reached IIIT by auto around 11 am.

The first session I attended was on BeleniX, opensolaris LiveCD project - Moinak Ghosh,I arrived the conference room, while the presentation was halfway around . The presenter was upgrading the openSolaris and while it was going on, other applications were executed !! He was explaining how openSolaris and ZFS is useful in production ready environment. He demonstrated creating separate snapshots. He explained about using DTrace, which can be used to dynamically inject debug codes while the application is running (can be used for debugging kernel). He explained about the difference between zones in Open Solaris and virtualization, concept of RAMDisk etc. The session was good as practical samples are demonstrated.

Next session was more interesting was, by Mahesh Patil from National Ubiquitous Computing Research Center – CDAC, Embedded Linux and Real time Operating systems. I really enjoyed and understood the technology. When I was in my college (MCET Trivandrum) we used to conduct lot of seminars. Sensor networks, nano technology were most presented those days. But this session as a great experience as he had to show something cool. He had a board with ARM processor and he demonstrated loading the Linux OS into it. He explained about ToolChains , how it can be used, packaging the kernel images etc. He described how an embedded OS is different from RTOS and about the preemptive nature of RTOS.RTOS uses the dual kernel approach in which the interrupt handling latency can be reduced by a kernel handling it and other operations by other kernel. The core kernel operations are given a low priority as the other tasks which are to be executed with higher priority in the queue. I came to know that most of the embedded Linux is based on POSIX compliance, but in Japan it is MicroItron. He explained about ECOS a configurable OS which can be configured for embedded or real time. Then about the Smart Dust project, cool futuristic technology; tiny devices floating around which communicate within a small range where they sleep most of the time. I was wondering about how huge the data will be produced by these devices. Think about real-time heat maps of different boxes holding vaccines those are distributed around the world! (Now pharmaceutical companies have a device kept inside the package to record the temperature when it was packed and check the change in temperature when opened) .Also came to know about the 3KB Linux – TinyOS ! Cool and simple… even though I am not from electronics background ...

On to the stage, was a geek – Arun Raghavan from Nvidia.He is a developer in Gentoo Linux community.I hasn't tried this Linux variant before. It's a linux for developers!! Any application can be customized for performance, maintainability creating ebuilds which will make it so flexible for developers. I think it will have a good learning curve as most of the installations and customizations can be done by a user .He demonstrated creating ebuilds for gtwitter a twitter client.He demonstrated the ease of using Portage which is a package management system used by Gentoo Linux.Visit Gentoo.org to know more about this linux.I really liked the article written by Daniel Robbins(Architect of Gentoo Linux) about the birth of it; read here .

I attended another session which was on Hadoop by Venkatesh from Yahoo research team.Hadoop is an opensource project for large data centers .I was looking forward for this presentation as it is about the web 2.0 (cloud computing) and large scale computing (blogged before). It is a framework written in Java that supports data intensive distributed applications.To know more about large scale data processing using Hadoop you can read this paper.It has a file system called HDFS filesystem ( pure-Java file system !!) stores replicated data as chunks across the unRaided SATA disks.There is a name node and a cluster of data nodes like a master-slave system.Name node stores an inverted index of data stored across the file system.Concept is similar to Google File System and its cluster features.More about the concept here (Nutch) , here This framework can be used for processing high volume data integrated with lucene will help to create a quality search engine of our own.This framework is used by Facebook. In one of the Engineering @ Facebook's Notes explains why Hadoop was integrated . Read here. It is used by IBM(IBM MapReduce Tools for Eclipse), Amazon,Powerset (which was acquired by Microsoft recently),Last.fm ... Read more More hadoop in data intensive scalable computing.Related projects Mahout (machine learning libraries),tashi (cluster management system).

So it was worth as I was able to attend these sessions .... Thanks to twnicnling.org

#bytescrolls

Pages

Data and Brain

Nodeable - Realtime Insights

ETags - Roles in Web Application to Cloud Computing

An account of Open Source Summit 2008 Hyderabad

Popular Posts