Pages

Showing posts with label database. Show all posts
Showing posts with label database. Show all posts

Orient DB - A simple ETL use case note

#orientdb #graph #etl #java #database

As someone who is familiar with graph data structures would like to know how we can map real-world models to a graph and process them. If you are trying to build them programmatically and approach them using traversal algorithms, you are going to have a hard time. If your application use a relational database to store data mapped to these models, then it will become complex while trying to link them with more relationships. How will you design the relationships between domains in a better  semantic way? How would you query them like a sql-like or DSL language? Graph databases should be a right candidate. Here I am trying to test out Orient DB.


In relational databases, we have primary and foreign-key columns references that helps joins that are computed at query time which is memory and compute intensive. Also we use junctions tables for many-to-many relationships with highly normalized tables which will  increase the query execution time and complexity. Graph databases are like relational databases, but with first class support for “relationships” defined by edges (stored as list) connected nodes (vertex/entity). Whenever you run a join operation, the database just uses this materialized list and has direct access to the connected nodes, eliminating the need for a expensive search / match computation.

Consider following tables,

Author Table
id name
1 Stephen King
2 George R. R. Martin
3 John Grisham
Book Table
id author_id title
1 1 Carrie
2 1 The Shining
Buyer Table
id name knows_id book_id
1 Hary 2 2
2 Mary 1 2

In graph database like orient db, we can define the relationships in amore semantic way. Graph databases operate on 3 structures: Vertex(sometimes called Node), Edge(or Arc) and Property(sometimes called Attribute).
  • Vertex. It’s data: Author, Book etc
  • Edge is physical relation between Vertices. Each Edge connects two different vertices, no more, no less. Additionally Edge has label and Direction, so If you label your edge as likes you know that Hary bought the book The Shining. The direction of relationship cane be either Out or In.
  • Property - it’s a value related to Vertex or Edge.
OrientDB comes with an ETL tool to import data. Also, you can use the libraries and write your own code to create nodes in the database. A generic framework for graph databases is available. More on Apache TinkerPop later.
You have to define configuration files for loading certain data into the graph store.
In the above sample configuration, you are defining,
  • “source”: { “file”: { “path”: “csv file location” } } // the source of file input for a model/entity
  • in transformer
    • vertex as the model or table
    • edge will define the edges in and out of the table
  • In the loader definition we define all the entities and constraints
More about the transformation definition can be read here
Import the csv files and configuration from the github repo. Please change the location of files and conf with respective to your environment.

Simply execute the oetl.sh tool from $ORIENTDB_HOME as sh oetl.sh ‘location of conf file’

loading author data
You have to execute all the configurations to load all the data.
After loading all the data you can query out and visualize them in the Orient DB’s web based console.

Here you can see the links between the entities.

how do you find the books bought by your friends?

select expand( both('Knows').out('Bought')) from Buyer where name = 'Hary'

Creating index in Hive


Simple:
CREATE INDEX idx ON TABLE tbl(col_name) AS 'Index_Handler_QClass_Name' IN TABLE tbl_idx;
As to make pluggable indexing algorithms, one has to mention the associated class name that handles indexing say for eg:-org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler
The index handler classes implement HiveIndexHandler
Full Syntax:
CREATE INDEX index_name
ON TABLE base_table_name (col_name, ...)
AS 'index.handler.class.name'
[WITH DEFERRED REBUILD]
[IDXPROPERTIES (property_name=property_value, ...)]
[IN TABLE index_table_name]
[PARTITIONED BY (col_name, ...)]
[
   [ ROW FORMAT ...] STORED AS ...
   | STORED BY ...
]
[LOCATION hdfs_path]
[TBLPROPERTIES (...)]
[COMMENT "index comment"]
  • WITH DEFERRED REBUILD - for newly created index is initially empty. REBUILD can be used to make the index up to date.
  • IDXPROPERTIES/TBLPROPERTIES - declaring keyspace properties
  • PARTITIONED BY - table columns where in the index get partitioned, if not specified index spans all table partitions
  • ROW FORMAT  - custom SerDe or using native SerDe(Serializer/Deserializer for Hive read/write). A native SerDe is used if ROW FORMAT is not specified
  • STORED AS  - index table storage format like RCFILE or SEQUENCFILE.The user has to uniquely specify tabl_idx name is required for a qualified index name across tables, otherwise they are named automatically. STORED BY - can be HBase (I haven't tried it)

The index can be stored in hive table or as RCFILE in an hdfs path etc. In this case, the implemented  index handler class usesIndexTable() method will return false.When index is created, the generateIndexBuildTaskList(...) in index handler class will generate a plan for building the index.

Consider CompactIndexHandler from Hive distribution,

It  only stores the addresses of HDFS blocks containing that value. The index is stored in hive metastore FieldSchema as _bucketname and _offsets in the index table.

ie the index table contains 3 columns, with _unparsed_column_names_from_field schema (indexed columns), _bucketname(table partition hdfs file having columns),[" _blockoffsets",..."]



See the code from CompactIndexHandler,
   

Parallel Databases and Map Reduce

n 2007 Google handled 400 petabytes of data per month.

It was Map Reduce. And the competitors were Dyrad and Hadoop

Architects were reinventing parallel databases.

Data is partitioned across multiple disks. The parallel I/O enables relational operations like join executed in parallel. Teradata, is one of the highly parallel database machines uses shared-nothing architecture Google called it sharding, infinitely scalable almost simply by adding nodes.

In parallel databases the horizontal partitioning is done so that the tuples of a relation are divided among many disks such that each tuple resides on one disk.

The strategies can be :

a. round robin, the ith tuple inserted in the relation to disk i mod n.
b. hashing - send to the result (ie i [1.. n-1]th disk) of hash function h applied to the partitioning attribute value of a tuple.
c. Range partitioning - for a partitioning vector [range] say [3,9], a tuple with partitioning attribute value of 1 will go to disk 0, a tuple with value 6 will go to disk 1, while a tuple with value 12 will go to disk2 etc.

In these databases, the queries/transactions execute in parallel with one another.Relational queries by SQL is good for parallel execution.

In a normal relational db, the execution plan for a query like this,
SELECT YEAR(date) AS year, AVG(fund)
FROM student
GROUP BY year

The execution plan is:
projection(year,fund) -> hash aggregation(year,fund).

In parallel dbs it will be,

projection(year,fund) -> partial hash aggregation(year,fund)
-> partitioning(year) -> final aggregation(year,fund).

Each operator produces a new relation, so the operators can be composed into highly parallel dataflow graphs. By streaming the output of one operator into the input of another operator, the two operators can work in series giving pipelined parallelism. By partitioning the input data among multiple processors and memories, an operator can often be split into many independent operators each working on a part of the data. This partitioned data and execution gives partitioned parallelism



image source and reference: link

Parallel databases uses the parallel data flow model makes programming easy as data is not shared.

MapReduce can be programmed in languages like C, Java, Python and Perl and process flat files in a filesystem, ie no need to have database schema definitions.This is an advantage when documents are processed.It uses a set of input key/value pairs, and produces a set of output key/value pairs. Programmer two functions: map and reduce. Map an input pair and produces a set of intermediate key/value pairs. Then intermediate values are grouped together based on the same intermediate key and passes them to the reduce function. The reduce function accepts the intermediate key and a set of values for that key and merges these values together to result smaller set of values, zero or one output value is produced in the process, thus helps in reducing memory and handles a lot of values making the job scalable. link

HadoopDB is a hybrid of DBMS and MapReduce technologies that targets analytical workload designed to run on a shared-nothing cluster.It uses Hadoop to push the data to existing databases engine to process the SQL queries.In order to push more query logic into databases (e.g. joins), hash-partitioning of data needs to be performed. The data is loaded into HDFS. for processing. HadoopDB's API provide the implementation of Map and Reduce functions.This makes hadoopdb to process huge analytical database with scalability using map reduce framework and performance with exisiting db engines.

refer:link

image source link

An interesting talk by the creator of CouchDB

I watched this presentation which is really inspiring, by Damien Katz who developed a database using Erlang(a functional language).He talks about the circumstances and hardships he had while developing a new db even there are others in industry which has become integral part of infrastructure.Cool people make cool stuffs !!

http://www.infoq.com/presentations/katz-couchdb-and-me

What so interesting about CouchDB ?
In CouchDB, the data is a collection of JSON documents.It is more of object database than a relational db.I shows how powerful is javascript in server side -->Views are created by javascript like a map reduce.It is a very good choice to use for scalabale RESTful applications.Currently this project is in alpha stage.

More about CouchDB http://couchdb.apache.org/docs/intro.html

Applications using CouchDB http://wiki.apache.org/couchdb/CouchDB_in_the_wild

Object Relational Mapping Tools – An overview

Data is critical part of any application.Engineers need to get involved in the tedious work dealing with data.And we all know that the most flexible paradigm way of programming is based on OOPs.We have to simplify by the creation of data access layers, automate data access, or generate data access code.Various design patterns are applied in developing these.There are various OOP languages and an array of relational databases. Any applicaton will need enhancements and refactoring.The databases may change due to requirements.So a technique is need for a flexible architecture.ORM/Object Relational Mapping is therefore a programming technique for converting data between incompatible type systems in relational databases and object-oriented programming languages.This is like a virtual database used within programming language.Most data manipulation is done through Value Objects .These objects can be persisted in database tables.The data types in the database and programming language can be different.The ORM layer could do the possible type mapping.It generates mapped queries for manipulating data in database.The data us represented as objects in application and its stored in RDBMS in accoradance with relational theory.Mapping determines how objects and their relationships are persisted in permanent data storage. In object-relational mapping products, the ability to directly manipulate data stored in a relational database using an object programming language is called transparent persistence. With transparent persistence, the manipulation and traversal of persistent objects is performed directly by the object programming language in the same manner as in-memory, non-persistent objects. This is achieved through the use of intelligent caching mechanisms.Thus a one-to-one object relationship maps to a one-to-one data relationship, a one-to-many maps to a one-to-many, and a many-to-many maps to a many-to-many.

A list of ORM vendors and products
ActiveRecord
Apache DB Project
(ObJectRelationalBridge)
Applied Information Technologies, Inc. (<SQLTags:>)
AlachiSoft (TierDeveloper)
Art Technology Group (ATG Repository)
ARTech (DeKlarit)
The BBA Data Objects Project (BBA Data Objects)
The Castor Project (Castor)
CodeFutures (FireStorm/DAO)
DADO Solution (DADO Database Mapper)
D2O Pty Ltd. (GURA)
Developer Express, Inc. (eXpress Persistent Objects for .NET)
EntitySpaces, LLC (EntitySpaces Architecture for .NET 2.0)
elephant
SQLAlchemy open source
SQLObject , open source
Evaluant RC, S.A. (Data Tier Modeler for .Net)
FireStar Software, Inc. (ObjectSpark)

JPOX
CoughPHP
Open Source ORM for PHP 5 featuring code generation (FreeBSD license)
Doctrine
Open Source ORM for PHP 5.2, free software (GNU LGPL)
ezPDO A Simple Solution for PHP Object Relational Mapping and Data Persistence
Junction
Open-source, object persistence layer (MIT)
ORMer
Open Source, foreign key based ORM for PHP 5, free software (MIT License)
Propel
ORM and Query-Toolkit for PHP 5, inspired by Apache Torque, free software (GNU LGPL)
SilverStripe

The GLORP Project (Object-relational persistence for Smalltalk)
The Hibernate Project (Hibernate)
The Hydrate Project (Hydrate)
The iBATIS Project (iBATIS Data Mapper for Java and .NET)
IdeaBlade (IdeaBlade)
The JC Framework Project (JC Framework)
The Last Component (Persistent Datasets for .NET)
Kodo
Linq
Lattice Business Software International, Inc.
(Lattice.DataMapper for .NET)
LightSpeed
MetaMatrix, Inc.
(MetaBase and MetaMatrix Server)
Metastorage (Metastorage - PHP object-relational mapping code generator)
Michael Campbell Associates, Ltd. (JCodeBox)
MicroDoc Computersysteme, GmbH (MicroDoc Persistence Frameworks for Java)
MyGeneration Software (MyGeneration)
.NET Data Objects (NDO)
The .NET Entity Objects Project (.NET Entity Objects - NEO)
NHibernate
NConstruct
Neo
ObjectMapper .NET Project
(ObjectMapper .NET)
Object Persistent Framework for .Net 2.0 (Opf3)
ObjectFrontier, Inc. (FrontierSuite and FrontierSuite for JDO)
Objectmatter, Inc. (Visual BSF)
ObjectStyle Group (Cayenne Project)
Olero Software, Inc. (ORM.NET)
OmniSphere Information Systems Corp. (OmniBuilder)
OpenJPA
Oracle Corporation
(OracleAS TopLink)
OOmega.net (OOmega JAVA Connector, XML Connector, and DBMS Connector)
The PerlORM Project (PerlORM - Object relational mapping for Perl)
PHP Object Generator (POG)
Premisa d.o.o. (PADO)
postmodern
Persistor.NET
POTIS Software Development Tools
(POTIS Object-Relational Toolkit - PORT)
The Propel Project (Propel: PHP Object Persistence)
PrismTech (OpenFusion Java Data Objects)
Progress Software Corporation (DataXtend CE)
RogueWave Software, Inc. (SourcePro DB)
SharpPower Corporation (RapTier)
SimpleORM (SimpleORM)
Software Tree, Inc. (JDX for Java and NJDX for .NET)
SolarMetric, Inc. (Kodo JDO)
Solutions Design (LLBLGen Pro)
Sooda
submarine
Sun Microsystems, Inc.
(Java Studio Standard 5)
Sybase, Inc. (PowerDesigner)
TechTalk, Ltd. (Genome)
Thought, Inc. (CocoBase)
Vanatec, GmbH (Vanatec OpenAccess for .NET)
Visual Paradigm International Ltd. (DB Visual Architect)
WebObjects
WilsonDotNet.com
(OR Mapper)
X-tensive.com Software Development Company (DataObjects.NET)
Xcalia (LiDO)
YTZ Systems, Inc. (SavePoint for Java)