#bytescrolls: 2012

Data and Brain

#bigdata

Came across an interesting presentation on Using Data to Understand Brain.

Using Data to Understand the Brain from jakehofman

Is it possible to read your brain? hmmm

I am a little two-faced with these riddles....

Unicode features in various languages

Here’s what each language natively supports in its standard distribution.

Unicode	Javascript	ᴘʜᴘ	Go	Ruby	Python	☕ Java	Perl
Internally	UCS‐2 or UTF‐16	UTF‐8⁻	UTF‐8	varies	UCS‐2 or UCS‐4	UTF‐16	UTF‐8⁺
Identiﬁers	─	✔	✔	✔	✅^∓	✔	✔
Casefolding	none	simple	simple	full	none	simple	full
Casemapping	simple	simple	simple^∓	full	simple	full	full
Graphemes	─	✅	─	─	─	─	✔
Normalization	─	✔	─⁺	─	✔	✔	✔
UCA Collation	─	─	─	─	─	─	✔⁺
Named Characters	─	─	─	─	✅	─	✔⁺
Properties	─	two	(non‐regex)⁻	three	(non‐regex)⁻	two⁺	every⁺

from Tom Christiansen Unicode Support Shootout: The Good, the Bad, the Mostly Ugly

Grapheme - A grapheme is the smallest semantically distinguishing unit in a written language, analogous to the phonemes of spoken languages.

Casefolding - Unicode defines case folding through the three case-mapping properties of each character: uppercase, lowercase and titlecase. These properties relate all characters in scripts with differing cases to the other case variants of the character.

Case mapping - is used to handle the mapping of upper-case, lower-case, and title case characters for a given language.

What is the difference between case mapping and case folding? Case mapping or case conversion is a process whereby strings are converted to a particular form—uppercase, lowercase, or titlecase—possibly for display to the user. Case folding is primarily used for caseless comparison of text, such as identifiers in a computer program, rather than actual text transformation. Case folding in Unicode is based on the lowercase mapping, but includes additional changes to the source text to help make it language-insensitive and consistent. As a result, case-folded text should be used solely for internal processing and generally should not be stored or displayed to the end user.

Normalization - courtesy

Normalization - Unicode has encoded many entities that are really variants of existing nominal characters. The visual representations of these characters are typically a subset of the possible visual representations of the nominal character. more -

UCA Collation - Collation is the general term for the process and function of determining the sorting order of strings of characters. It is a key function in computer systems; whenever a list of strings is presented to users, they are likely to want it in a sorted order so that they can easily and reliably find individual strings. Thus it is widely used in user interfaces. It is also crucial for databases, both in sorting records and in selecting sets of records with fields within given bounds.The Unicode collation algorithm (UCA) is an algorithm defined in Unicode Technical Report #10, which defines a customizable method to compare two strings. These comparisons can then be used to collate or sort text in any writing system and language that can be represented with Unicode. more

Named Characters - Unicode characters are assigned a unique Name (na). The name, in English, is composed of A-Z capitals, 0-9 digits, - (hyphen-minus) and .The Unicode Standard specifies notational conventions for referring to sequences of characters (or code points) treated as a unit, using angle brackets surrounding a comma-delimited list of code points, code points plus character names, and so on. For example, both of the designations in Table 1 refer to a combining character sequence consisting of the letter “a” with a circumflex and an acute accent applied to it. more more

Properties - Each Unicode character belongs to a certain category. Unicode assigns character properties to each code point. These properties can be used to handle "characters" (code points) in processes, like in line-breaking, script direction right-to-left or applying controls.more
Perl looks cool!

Machine generated data

At first, the term "machine-generated data" can be confusing. One would think, every data is (or are?) generated from one device or another is provided by an innocent mortal in this so called era of social media and big data. Then, there should be a clear distinction to such definitions. If an user enter some data in a form, then it is not considered machine generated. At the same time, the same application can track the user's location and log it in a remote server. So it becomes the machine generated data.

Wikipedia says,

Machine-generated data (MGD) is the generic term for information which was automatically created from a computer process, application, or other machine without the intervention of a human.

According to Monash Research,

In classical human-generated data, what’s recorded is the direct result of human choices. Somebody buys something, makes an inquiry about it, fills an order from inventory, makes a payment in return for the object, makes a bank deposit to have funds for the next purchase, or promotes a manager who’s been particularly successful at selling stuff. Database updates ensue. Computers memorialize these human actions more quickly and cheaply than humans carry them out. Plenty of difficulties can occur with that kind of automation — applications are commonly too inflexible or confusing — but keeping up with data volumes is generally the least of the problems.

So what are they? Are they stream of logs flowing through the information super waterway?

May be, until they churned into some books or toilet rolls!

Application Logs - Logs generated by web or desktop applications. The server side logs used for debugging and support tickets!

Call Detail Records - The ones recorded your telecom company. They contain useful details of the call or service that passed through the switch etc like the phone number of the calls, its duration etc. Needed for billing.

Web logs - use to count the visitors and similar web analytics done on these data

Database Audit Logs - Enable auditing to audit for suspicious database activity, it is common that not much information is available to target specific users or schema objects

OS logs - tracks crashing or errors

There are many similar generated data by different application and systems like RFIDs, sensors etc. Then these messages can be mashed up. For the machine data, there will be structure or format and semantics based on the domain it relies on.

The growth of such data is fast and continuous. As it is a stream of data and like a history they are not changed. They are like a record of events.

courtesy- link

Anyone tried iPhonetracker?

courtesy- link

Geolocation and LBS does push a load a data. HTML5 do have a geolocation functionality (even though you have the choice not to track). Following a sample code to test it.

Nodeable - Realtime Insights

#Nodeable is a good example of generating #insights from #bigdata or the real time trickle feeds. It uses Twitter's Storm for the processing engine Stream reduce. I signed up for a trial account to play around.

The insights like "Most Active" metrics are generated for Amazon Web services status. The reports are generated and tagged in real time. The twitter follow counts are displayed.

It has only some basic set of connectors, but one can create custom connectors using its JSON Schema. The outbound data can be pushed to your own Amazon s3 or Hadoop WebHDFS, which is good for private companies.

The github/rss stream is shown as activity stream.

Sharing an interesting presentation of Storm real-time computation.
ETE 2012 - Nathan Marz on Storm from Chariot Solutions on Vimeo.

Hadoop meetup @inmobi Bangalore

Had a chance to attend the #hadoop #meetup today at #Inmobi Bangalore.

Arun Murthy and Suresh Srinivasan from Hortonworks made presentations on next gen Hadoop and HDFS Namenode High Availability respectively.

From Inmobi, they had presentations on Real time analytics done on HBase and Ivory, an opensource feed processing platform by Srikanth

Dream On!

Creating index in Hive

Simple:

CREATE INDEX idx ON TABLE tbl(col_name) AS 'Index_Handler_QClass_Name' IN TABLE tbl_idx;

As to make pluggable indexing algorithms, one has to mention the associated class name that handles indexing say for eg:-org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler
The index handler classes implement HiveIndexHandler
Full Syntax:

CREATE INDEX index_name
ON TABLE base_table_name (col_name, ...)
AS 'index.handler.class.name'
[WITH DEFERRED REBUILD]
[IDXPROPERTIES (property_name=property_value, ...)]
[IN TABLE index_table_name]
[PARTITIONED BY (col_name, ...)]
[
[ ROW FORMAT ...] STORED AS ...
| STORED BY ...
]
[LOCATION hdfs_path]
[TBLPROPERTIES (...)]
[COMMENT "index comment"]

WITH DEFERRED REBUILD - for newly created index is initially empty. REBUILD can be used to make the index up to date.

IDXPROPERTIES/TBLPROPERTIES - declaring keyspace properties

PARTITIONED BY - table columns where in the index get partitioned, if not specified index spans all table partitions

ROW FORMAT - custom SerDe or using native SerDe(Serializer/Deserializer for Hive read/write). A native SerDe is used if ROW FORMAT is not specified

STORED AS - index table storage format like RCFILE or SEQUENCFILE.The user has to uniquely specify tabl_idx name is required for a qualified index name across tables, otherwise they are named automatically. STORED BY - can be HBase (I haven't tried it)

The index can be stored in hive table or as RCFILE in an hdfs path etc. In this case, the implemented index handler class usesIndexTable() method will return false.When index is created, the generateIndexBuildTaskList(...) in index handler class will generate a plan for building the index.

Consider CompactIndexHandler from Hive distribution,

It only stores the addresses of HDFS blocks containing that value. The index is stored in hive metastore FieldSchema as _bucketname and _offsets in the index table.

ie the index table contains 3 columns, with _unparsed_column_names_from_field schema (indexed columns), _bucketname(table partition hdfs file having columns),[" _blockoffsets",..."]

See the code from CompactIndexHandler,

What's it about Cascading?

Cascading helps manipulating data in Hadoop. It is a framework written in Java which abstracts map reduce that allows to write scripts to read and modify data inside Hadoop. Provides a programming API for defining and executing fault tolerant data processing workflows and a query processing API in which the developers can go without map reduce. There are quite a number of DSLs built on top of Cascading, most noteably Cascalog (written in Clojure) and Scalding (written in Scala). There is Pig data processing API which is similar but SQLy.

Terminology

Taps - streams of source (input) and sink (output)
Tuple - can be considered as a result set. This is a single row with named columns of data being processed. A series of tuples make a stream.All tuples in a stream have the exact same fields.
Pipes - tie operations together when executed upon a Tap. Pipe Assembly is created when pipes are successuvely executed.Pipe assemblies are Directed Acyclic Graphs.
Flows - reusable combinations of source,sink and pipe assemblies.
Cascade - series of flows

What all operations possible?

Relational - Join, Filter, Aggregate etc
Each - for each row result (tuple)
Group - Groupby
CoGroup - joins for tuples
Every - for every key in group or cogroup, like an aggregate function to all tuples in a group at once
SubAssembly - nesting reusable pipe assemblies into a Pipe

Internally the cascading employs an intelligent planner to convert the pipe assembly to a graph of dependent MapReduce jobs that can be executed on a Hadoop cluster.

What are the advantages from a normal map reduce workflow do this Cascading have? (Need to investigate!)

O Blimey! TED Talk 2023

Prometheus film, going viral... like a fire that danced at the end of the match.

Aha! cybernetic life-forms...

The only "purpose'' (in the biological sense) of this identity is to preserve its own existence in time, that is to survive in current, specific environmental conditions, as well as to produce as many copies of itself as possible. The entire network of negative feedback mechanisms is ultimately directed at the latter task. Within the cybernetic paradigm, however, reproduction is nothing but a positive feedback.

-from Cybernetic Formulation of the Defnition of Life

Tinker, Tailor, Soldier, Spy and The Perspicacious "Collusion"

Collusion!

A secret agreement between two or more parties for a fraudulent, illegal, or deceitful purpose.

In this battleground of privacy wars and illusionary consumer willpower, there comes another wizard to show you the goblins who steal your data.. Collusion from Mozilla.

Collusion is an experimental add-on for Firefox and allows you to see all the third parties that are tracking your movements across the Web. It will show, in real time, how that data creates a spider-web of interaction between companies and other trackers.

Oh yeah, thanks mozilla for helping us to find the hooligans steal our cookies! Yeah we can now haplessly stare at the red devils and haloing thieves

What the heck! We don't have time for tracking everything in our life. Anyway, the stuff looks cool... collusion, interesting word.

The mythical unstructured data!

As semantic web and big data integration gaining its fus-ro-dah, enterprises are finding a way to harness any available form of information swarming the web and the world

I came across some interesting artcles which gives a concise idea of harnessing metadata from unstructured data....

Lee Dallas says

In some respects it is analogous to hieroglyphics where pictographs carry abstract meaning. The data may not be easily interpretable by machines but document recognition and capture technologies improve daily. The fact that an error rate still exists in recognition does not mean that the content lacks structure. Simply that the form it takes is too complex for simple processes to understand.

more here : http://bigmenoncontent.com/2010/09/21/the-myth-of-unstructured-data/

Ram Subramanyam Gopalan says

A lot of data growth is happening around these so-called unstructured data types. Enterprises which manage to automate the collection, organization and analysis of these data types, will derive competitive advantage.
Every data element does mean something, though what it means may not always be relevant for you.

more here : http://bigdataintegration.blogspot.in/2012/02/unstructured-data-is-myth.html

Consistent Hashing

What is a consistent hash function?

A consistent hash function is one which changes minimally as the range of function changes.

What's the advantage of such functions?

This is ideal when set of buckets change over time. Two users with inconsistent but overlapping sets of buckets will map items to the same bucket with high probability. So this eliminates the need of "maintaining" a consistent "state" among all nodes in a network. The algorithm can be used for making consistent assignments or relationships between different sets of data in such a way that if we add or remove items, the algorithm can be recalculated on any machine and produce the same results.

Theory

A view V is a set of buckets where user is aware. A client uses a consistent hash function, f(V,i), maps an object to one of the buckets in the view. Say, assign each of hash buckets to random points on mod 2^n circle (virtually!) where hash key size = n. The hash of object= closest clockwise bucket. These small sets of buckets lie near the object. In this case, all the buckets get roughly same number of items. When kth bucket is added only a 1/k fraction of items move. This means when new node is added only minimum reshuffle is needed, which is the advantage of having a view. There can be a hash structure for the key lookup (a balanced tree) which stores the hash of all nodes (in the view). When a new node is added its hash value is added to the table.

Suppose there are two nodes A and B three objects 1–3 (mapped to a hash-function’s result range). The objects 3 and 1 are mapped to node A, object 2 to node B. When a node leaves the system, data will get mapped to their adjacent node (in clockwise direction) and when a node enters the system it will get hashed onto the ring and will overtake objects.

As an example, (refer link1, link2), the circle denotes a range of key values. Say, the points in circle represents 64 bit numbers. Hash the data to get the 64 bit number, which is a point in the circle. Take the IPs of nodes and hash them into 64 bit number and point in the circle. Associate the data to the nodes in the clockwise direction (ie. closest, which can be retrieved from the node in the hash structure). When a new node is inserted into the hash tree, the data will always be assigned to the closest one only. Everything between this number and one that's next in the ring and that has been picked by a different node previously, is now belong to this node.

The basic idea of consistent hash function is to hash both objects and buckets using the same function. It's one of the best ways to implement APIs that can dynamically scale out and rebalanced. The client applications can calculate which node to contact in order to request or write the data with no metadata server required.

Used by

memcached cluster.

Typically, multiple memcached daemons are started, on different hosts. The clients are passed a list of memcached addresses (IP address and port) and pick one daemon for a given key. This is done via consistent hashing, which always maps the same key K to the same memcached server S. When a server crashes, or a new server is added, consistent hashing makes sure that the ensuing rehashing is minimal. Which means that most keys still map to the same servers, but keys hashing to a removed server are rehashed to a new server. - from A memcached implementation in JGroups

Amazon's Dynamo uses consistent hashing along with replication as a partitioning scheme.

Data is partitioned and replicated using consistent hashing [10], and consistency is facilitated by object versioning [12]. The consistency among replicas during updates is maintained by a quorum-like technique and a decentralized replica synchronization protocol. - from Dynamo: Amazon's Highly Available Key-value Store

Data of a Cassandra table gets partitioned and distributed among the nodes by a consistent hashing function.

Cassandra partitions data across the cluster using consistent hashing [11] but uses an order preserving hash function to do so. In consistent hashing the output range of a hash function is treated as a circular space or "ring" (i.e. the largest hash value wraps around to the smallest hash value). Each node in the system is as-signed a random value within this space which represents its position on the ring. Each data item identified by a key is assigned to a node by hashing the data item's key to yield its position on the ring, and then walking the ring clockwise to fi nd the first node with a position larger than the item's position. This node is deemed the coordinator for this key. The application specifi es this key and the Cassandra uses it to route requests. Thus, each node becomes responsible for the region in the ring between it and its predecessor node on the ring. The principal advantage of consistent hashing is that departure or arrival of a node only aff ects its immediate neighbors and other nodes remain una ffected. - from Cassandra - A Decentralized Structured Storage System

Voldemort automatic sharding of data. Nodes can be added or removed from a database cluster, and the system adapts automatically. Voldemort automatically detects and recovers failed nodes. [refer]

References:
http://www.akamai.com/dl/technical_publications/ConsistenHashingandRandomTreesDistributedCachingprotocolsforrelievingHotSpotsontheworldwideweb.pdf
http://sharplearningcurve.com/blog/2010/09/27/consistent-hashing/
http://weblogs.java.net/blog/tomwhite/archive/2007/11/consistent_hash.html

About Bulk Synchronous Parallel(BSP) model

As an alternative to mapreduce paradigm, there is another parallel computing model called Bulk Synchronous Parallel(BSP). A BSP computer is defined as a set of processors with local memory, interconnected by a communication mechanism (e. g., a network or shared memory) capable of point-to-point communication, and a barrier synchronization mechanism. It differentiates/decouples the use of local memory from that of remote memory. A BSP program consists of a set of BSP processes and a sequence of super-steps—time intervals bounded by the barrier synchronization. Each processor has its own local memory module, and all other memories are non-local where they are accessed by networking. The communication between processors are non-blocking.The essence of the BSP model is super-step. At the start of super step computations are done locally. Then, using the messaging system in the network, the other processes can handle requests for further computation.The communication and synchronization are decoupled. There exists a barrier synchronization in which the processors wait and sync when all communications are completed. When all processes have invoked the sync method and all messages are delivered, the next super-step begins. Then the messages sent during the previous super-step can be accessed by its recipients.The data locality is an inherent part of this model in which the communication is made only when the peer data in necessary. This is different from mapreduce frameworks in which they do not preserve data locality in consecutive operations. During mapreduce processing, it generally passes input data through either many passes of mapreduce or mapreduce iteration in order to derive final results which makes communication cost added on to the processing cost. So BSP is useful with many programs requiring iterations and recursions.

Apache Hama is one such project enabling hadoop to leverage BSP. Google Pregel uses BSP for large scale mining of graphs.

reference:

http://en.wikipedia.org/wiki/Bulk_synchronous_parallel
http://incubator.apache.org/hama/

#bytescrolls

Pages