Pages

Showing posts with label hive. Show all posts
Showing posts with label hive. Show all posts

Simple metastore creation for Hive in MySQL


For Hive, the meta-store is like the system catalog which contains metadata about the tables stored in Hive. This metadata is specified during table creation and reused every time the table is referenced in HiveQL. The database is a namespace for tables, where ‘default’ is used for tables with no user supplied database name. The metadata for table contains list of columns and their types, owner, storage and SerDe information (which I can detail in future posts). It can also contain any user supplied key and value data; which can be used for table statistics. Storage information includes location of the table’s data in the underlying file system, data formats and bucketing information. SerDe (which controls how Hive serializes/deserializes the data in a row) metadata includes the implementation class of serializer and deserializer methods and any supporting information required by that implementation. The partitions can have its own columns and SerDe and storage information which can be used in the future to evolve Hive schema.The metastore uses either a traditional relational database (like MySQL, Oracle) or file system and not HDFS since it is optimized for sequential scans only),thus the fired HiveQL statements are executed slow which only access metadata objects.


its simple to install the metastore.

-install mysql-conector
$ sudo yum install mysql-connector-java
-create a symbolic link in the Hive directory
$ ln -s /usr/share/java/mysql-connector-java.jar /usr/lib/hive/lib/mysqlconnector-java.jar

-create the database for the Hive metastore.cdh4 ships with scripts for derby,mysql,oracle and postgre
$ mysql -u root -p
mysql> CREATE DATABASE hivemetastoredb;
mysql> USE hivemetastoredb;
mysql> SOURCE /usr/lib/hive/scripts/metastore/upgrade/mysql/hive-schema- 0.9.0.mysql.sql;

-create a user for the metastore
mysql>CREATE USER 'hive'@'%' IDENTIFIED BY 'hive';

-grant access for all hosts in the network
mysql> GRANT ALL PRIVILEGES ON hivemetastoredb.* TO hive@'%' WITH GRANT OPTION;
mysql> FLUSH PRIVILEGES;

following entries in the file /etc/hive/conf/hive-sites.xml, if you are trying a jdbc connection
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://localhost/hivemetastoredb</value>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>hive</value>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>hive</value>
</property>
<property>
<name>datanucleus.autoCreateSchema</name>
<value>false</value>
</property>
<property>
<name>datanucleus.fixedDatastore</name>
<value>true</value>
</property>


Creating index in Hive


Simple:
CREATE INDEX idx ON TABLE tbl(col_name) AS 'Index_Handler_QClass_Name' IN TABLE tbl_idx;
As to make pluggable indexing algorithms, one has to mention the associated class name that handles indexing say for eg:-org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler
The index handler classes implement HiveIndexHandler
Full Syntax:
CREATE INDEX index_name
ON TABLE base_table_name (col_name, ...)
AS 'index.handler.class.name'
[WITH DEFERRED REBUILD]
[IDXPROPERTIES (property_name=property_value, ...)]
[IN TABLE index_table_name]
[PARTITIONED BY (col_name, ...)]
[
   [ ROW FORMAT ...] STORED AS ...
   | STORED BY ...
]
[LOCATION hdfs_path]
[TBLPROPERTIES (...)]
[COMMENT "index comment"]
  • WITH DEFERRED REBUILD - for newly created index is initially empty. REBUILD can be used to make the index up to date.
  • IDXPROPERTIES/TBLPROPERTIES - declaring keyspace properties
  • PARTITIONED BY - table columns where in the index get partitioned, if not specified index spans all table partitions
  • ROW FORMAT  - custom SerDe or using native SerDe(Serializer/Deserializer for Hive read/write). A native SerDe is used if ROW FORMAT is not specified
  • STORED AS  - index table storage format like RCFILE or SEQUENCFILE.The user has to uniquely specify tabl_idx name is required for a qualified index name across tables, otherwise they are named automatically. STORED BY - can be HBase (I haven't tried it)

The index can be stored in hive table or as RCFILE in an hdfs path etc. In this case, the implemented  index handler class usesIndexTable() method will return false.When index is created, the generateIndexBuildTaskList(...) in index handler class will generate a plan for building the index.

Consider CompactIndexHandler from Hive distribution,

It  only stores the addresses of HDFS blocks containing that value. The index is stored in hive metastore FieldSchema as _bucketname and _offsets in the index table.

ie the index table contains 3 columns, with _unparsed_column_names_from_field schema (indexed columns), _bucketname(table partition hdfs file having columns),[" _blockoffsets",..."]



See the code from CompactIndexHandler,