Pages

Showing posts with label technology. Show all posts
Showing posts with label technology. Show all posts

Creating index in Hive


Simple:
CREATE INDEX idx ON TABLE tbl(col_name) AS 'Index_Handler_QClass_Name' IN TABLE tbl_idx;
As to make pluggable indexing algorithms, one has to mention the associated class name that handles indexing say for eg:-org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler
The index handler classes implement HiveIndexHandler
Full Syntax:
CREATE INDEX index_name
ON TABLE base_table_name (col_name, ...)
AS 'index.handler.class.name'
[WITH DEFERRED REBUILD]
[IDXPROPERTIES (property_name=property_value, ...)]
[IN TABLE index_table_name]
[PARTITIONED BY (col_name, ...)]
[
   [ ROW FORMAT ...] STORED AS ...
   | STORED BY ...
]
[LOCATION hdfs_path]
[TBLPROPERTIES (...)]
[COMMENT "index comment"]
  • WITH DEFERRED REBUILD - for newly created index is initially empty. REBUILD can be used to make the index up to date.
  • IDXPROPERTIES/TBLPROPERTIES - declaring keyspace properties
  • PARTITIONED BY - table columns where in the index get partitioned, if not specified index spans all table partitions
  • ROW FORMAT  - custom SerDe or using native SerDe(Serializer/Deserializer for Hive read/write). A native SerDe is used if ROW FORMAT is not specified
  • STORED AS  - index table storage format like RCFILE or SEQUENCFILE.The user has to uniquely specify tabl_idx name is required for a qualified index name across tables, otherwise they are named automatically. STORED BY - can be HBase (I haven't tried it)

The index can be stored in hive table or as RCFILE in an hdfs path etc. In this case, the implemented  index handler class usesIndexTable() method will return false.When index is created, the generateIndexBuildTaskList(...) in index handler class will generate a plan for building the index.

Consider CompactIndexHandler from Hive distribution,

It  only stores the addresses of HDFS blocks containing that value. The index is stored in hive metastore FieldSchema as _bucketname and _offsets in the index table.

ie the index table contains 3 columns, with _unparsed_column_names_from_field schema (indexed columns), _bucketname(table partition hdfs file having columns),[" _blockoffsets",..."]



See the code from CompactIndexHandler,
   

What i think about event processing...

An amateur thought.

I think the most interesting area of information processing is about event processing.Most of the large scale enterprise applications are based on event driven architecture.Event based information processing is the most advanced area i haven't gone through yet.But reading about it i found it really interesting.The state models,lexical analysis,reactor patterns, callback event models etc are the used behind it.Event driven design is an approach to program design that focuses on events to which a program reacts.According to these events there event handlers registered will respond.This is the fundamental of any GUI based application.An event listener will be attached to a button and handler responds to events.I think it is the basic underlying architecture of any responsive application.If you worked on a 3D application the events on 3D positions of polygons have to be registered.Every movement in space and trigger an event.. good gaming.. If i have to think big , consider the finance stock viewer online.The stock responses are reflected in real-time...most of them know about ajax based technology which is popular behind the dynamic graphs. But what about the complex business logic? Any rule engine will define a set of rules to act according to changes in input.I can compare this system as a stimulus response of an organism.If we take human brain, the predefined genetic rules will be there to adapt to these ever changing environment.There can be sudden stimuli or gradual one, depending on inputs.What about the pattern recognition ? Human brain is highly sophisticated ... mmm i am boring now.If it is about realtime processing, then I like to refer to CEP, Complex Event Processing (CEP) which is a technology for low-latency filtering, correlating, aggregating, and computing on real- world event data. If this complex event processing is enabled in a network...? To a collective intelligence? I read that context based switches are now implemented in CDNs.Whatever.... its really complex and interesting..No wonder the huge amount of data in the web can be used for social "business" intelligence ...CEP actually builds on what business intelligence (BI), services oriented architecture (SOA), cloud computing, business process modeling (BPM) provide.Mashup technologies along with semantic web can provide more granulated data where most of the technology based products moving into.Some people say SOA, some WOA and SaaS,cloud and so on..Consider about NASA satellite data.Huge amount of data from satellites gushing all through the channels are processed using various algorithms of image processing,signal processing algorithms... What about all those RFID based data affecting the supply chain tracking? What if we are going to track every consumption of fule in the world in a realtime using gps trackers and sensors ? What about streams of data processed by supercomputers on weather forecast based on certain models ? They are crucial and brain forging.That`s how information technology becomes the backbone and most sophisticated part of human civilization.

Its all about data and Network is the computer!!

May be we are trying to make an efficient system as fast as our brain.At least the model of all these logical applications are expert systems.Why should I write about stuffs that are very complex to me ... I am not expert in all these.. just blogged in curiosity.There are basics to learn...

There is a good article in wikipedia about CEP

http://en.wikipedia.org/wiki/Complex_Event_Processing

Another article in infoq

others... Link Link .

An article on NASA funded CEP project Link