Pages

Creating index in Hive


Simple:
CREATE INDEX idx ON TABLE tbl(col_name) AS 'Index_Handler_QClass_Name' IN TABLE tbl_idx;
As to make pluggable indexing algorithms, one has to mention the associated class name that handles indexing say for eg:-org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler
The index handler classes implement HiveIndexHandler
Full Syntax:
CREATE INDEX index_name
ON TABLE base_table_name (col_name, ...)
AS 'index.handler.class.name'
[WITH DEFERRED REBUILD]
[IDXPROPERTIES (property_name=property_value, ...)]
[IN TABLE index_table_name]
[PARTITIONED BY (col_name, ...)]
[
   [ ROW FORMAT ...] STORED AS ...
   | STORED BY ...
]
[LOCATION hdfs_path]
[TBLPROPERTIES (...)]
[COMMENT "index comment"]
  • WITH DEFERRED REBUILD - for newly created index is initially empty. REBUILD can be used to make the index up to date.
  • IDXPROPERTIES/TBLPROPERTIES - declaring keyspace properties
  • PARTITIONED BY - table columns where in the index get partitioned, if not specified index spans all table partitions
  • ROW FORMAT  - custom SerDe or using native SerDe(Serializer/Deserializer for Hive read/write). A native SerDe is used if ROW FORMAT is not specified
  • STORED AS  - index table storage format like RCFILE or SEQUENCFILE.The user has to uniquely specify tabl_idx name is required for a qualified index name across tables, otherwise they are named automatically. STORED BY - can be HBase (I haven't tried it)

The index can be stored in hive table or as RCFILE in an hdfs path etc. In this case, the implemented  index handler class usesIndexTable() method will return false.When index is created, the generateIndexBuildTaskList(...) in index handler class will generate a plan for building the index.

Consider CompactIndexHandler from Hive distribution,

It  only stores the addresses of HDFS blocks containing that value. The index is stored in hive metastore FieldSchema as _bucketname and _offsets in the index table.

ie the index table contains 3 columns, with _unparsed_column_names_from_field schema (indexed columns), _bucketname(table partition hdfs file having columns),[" _blockoffsets",..."]



See the code from CompactIndexHandler,
   

What's it about Cascading?




Cascading helps manipulating data in Hadoop. It is a framework written in Java which abstracts map reduce that allows to write scripts to read and modify data inside Hadoop. Provides a programming API for defining and executing fault tolerant data processing workflows and a query processing API in which the developers can go without map reduce. There are quite a number of DSLs built on top of Cascading, most noteably Cascalog (written in Clojure) and Scalding (written in Scala). There is Pig data processing API which is similar but SQLy.








Terminology

Taps - streams of source (input) and sink (output)
Tuple - can be considered as a result set. This is a single row with named columns of data being processed. A series of tuples make a stream.All tuples in a stream have the exact same fields.
Pipes - tie operations together when executed upon a Tap. Pipe Assembly is created when pipes are successuvely executed.Pipe assemblies are Directed Acyclic Graphs.
Flows - reusable combinations of source,sink and pipe assemblies.
Cascade - series of flows

What all operations possible? 

Relational - Join, Filter, Aggregate etc
Each - for each row result (tuple)
Group - Groupby
CoGroup - joins for tuples
Every - for every key in group or cogroup, like an aggregate function to all tuples in a group at once
SubAssembly - nesting reusable pipe assemblies into a Pipe

Internally the cascading employs an intelligent planner to convert the pipe assembly to a graph of dependent MapReduce jobs that can be executed on a Hadoop cluster.
 
What are the advantages from a normal map reduce workflow do this Cascading have? (Need to investigate!)

O Blimey! TED Talk 2023

 Prometheus  film, going viral... like a fire that danced at the end of the match.

Aha! cybernetic life-forms...

The only "purpose'' (in the biological sense) of this identity is to preserve its own existence in time, that is to survive in current, specific environmental conditions, as well as to produce as many copies of itself as possible. The entire network of negative feedback mechanisms is ultimately directed at the latter task. Within the cybernetic paradigm, however, reproduction is nothing but a positive feedback.

 -from Cybernetic Formulation of the Defnition of Life