Pages

Showing posts with label indexing. Show all posts
Showing posts with label indexing. Show all posts

Creating index in Hive


Simple:
CREATE INDEX idx ON TABLE tbl(col_name) AS 'Index_Handler_QClass_Name' IN TABLE tbl_idx;
As to make pluggable indexing algorithms, one has to mention the associated class name that handles indexing say for eg:-org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler
The index handler classes implement HiveIndexHandler
Full Syntax:
CREATE INDEX index_name
ON TABLE base_table_name (col_name, ...)
AS 'index.handler.class.name'
[WITH DEFERRED REBUILD]
[IDXPROPERTIES (property_name=property_value, ...)]
[IN TABLE index_table_name]
[PARTITIONED BY (col_name, ...)]
[
   [ ROW FORMAT ...] STORED AS ...
   | STORED BY ...
]
[LOCATION hdfs_path]
[TBLPROPERTIES (...)]
[COMMENT "index comment"]
  • WITH DEFERRED REBUILD - for newly created index is initially empty. REBUILD can be used to make the index up to date.
  • IDXPROPERTIES/TBLPROPERTIES - declaring keyspace properties
  • PARTITIONED BY - table columns where in the index get partitioned, if not specified index spans all table partitions
  • ROW FORMAT  - custom SerDe or using native SerDe(Serializer/Deserializer for Hive read/write). A native SerDe is used if ROW FORMAT is not specified
  • STORED AS  - index table storage format like RCFILE or SEQUENCFILE.The user has to uniquely specify tabl_idx name is required for a qualified index name across tables, otherwise they are named automatically. STORED BY - can be HBase (I haven't tried it)

The index can be stored in hive table or as RCFILE in an hdfs path etc. In this case, the implemented  index handler class usesIndexTable() method will return false.When index is created, the generateIndexBuildTaskList(...) in index handler class will generate a plan for building the index.

Consider CompactIndexHandler from Hive distribution,

It  only stores the addresses of HDFS blocks containing that value. The index is stored in hive metastore FieldSchema as _bucketname and _offsets in the index table.

ie the index table contains 3 columns, with _unparsed_column_names_from_field schema (indexed columns), _bucketname(table partition hdfs file having columns),[" _blockoffsets",..."]



See the code from CompactIndexHandler,
   

Apache Lucene - Indexing - Part 1

"Information retrieval (IR) is the science of searching for documents, for information within documents and for metadata about documents, as well as that of searching relational databases and the World Wide Web."

Most of the application uses search features.If you are looking to add a powerful text search engine feature to your application then use Lucene, which can add advanced Search Engine capabilities to an application.This is a really powerful Java API which gave birth to powerful tools such as Nutch,Hadoop,Hibernate search and so on.Lucene was started in 1997 and adopted by Apache in 2001.The main functionality Lucene does is the powerful full text indexing of data.
Indexing with Lucene breaks down into three main operations: converting data to text, analyzing it, and saving it to the index.Lucene looks for strings only , so the documents has to be parsed and indexed.
To search large amounts of text quickly, you must first index that text and convert it into a format that will let you search it rapidly, eliminating the slow sequential scanning process. This conversion process is called indexing, and its output is called an index. So the searching is done on this index to find the data related with a cost of space 'storing indexes'.
These index files can be stored in a directory .A lucene index is divided into segments madeup of several index files(Lucene Documents).An index can be related to mutiple documents.So if new documents are indexed , it is added to segments than modifying the existing index file.Lucene uses a feature called incremental indexing ie there will be a global indexing and index those incremental documents so that it is searchable.Regarding the structure of a lucene index, it is an inverted index .While searching, lucene loads the index to memory .It uses a high performance indexing which has an index size roughly 20-30% of the size of text indexed which uses less memory. The documents in an index is a collection of fields which is a named collection of terms like <field,term>.These fields are independent search space defined at run-time.The segments or sub-indexes are independently searchable and the results of these segments are merged.Suppose a wiki article is indexed , we can set the field properties, so that the field objects contain actual indexed article data or stored one.



More about lucene index file formats - here