Apr 15, 2016

A simple recommender system for your e-commerce store using a graph database

#graph #recommendation #orientdb #e-commerce #etl

In the last post, I have introduced you to a simple ETL use case for graph database like Orient DB. If you haven’t read it, I suggest you read this - OrientDB A simple use case note.

After loading data, you might want to play around with the graph structure and its possible traversal logic. As it is easy to represent the semantic relationships between them, the queries we will write also be designed based on the logic we come up with. In the last post, I have provided the query to find out the books bought by the buyers he know or befriended. In this post, I will provide some more simple examples to query such a graph in Orient DB. Here I a using the native query supported by the database.

How do we find out the books bought by a buyer named ‘Hary’?

select @rid, title from (select expand(out('Bought')) from Buyer where name='Hary')

Here this query will return the RecordId of the format <<cluster: position>> form. In OrientDB each record has its own self-assigned unique ID within the database called Record ID or RID. cluster-id is the id of the cluster and cluster-position is the position of the record inside the cluster. You can consider a cluster as a Table where each class (say, Buyer) of records are stored. Here the subquery uses expand function to expand the collection in the field and use it as result. It will fetch the records linked to the edge ‘Bought’.

How do we find out the people ‘Hary’ knows?

select expand(out('Knows')) from  Buyer where name='Hary'

Find out the books bought by friends of Hary?

select title from (
select expand(out('Bought')) from (select expand(out('Knows')
) from  Buyer where name='Hary'))

Here we combined both of the queries above it to make a logical decision as the interlinking of vertices is clearly identified.

Find out books bought by Hary but not by his friends, so that we can recommend some?

select title from (select expand(out('Bought')) from Buyer where name='Hary') 
let $temp = (
  select title from (
    select expand(out('Bought')) from (
      select expand(out('Knows')) from  Buyer where name='Hary'
)where title not in $temp

Here we used LET to assign the results of a subquery. In the subquery, we find the books bought by Hary’s friends. Then we find the books bought by Hary but not by friends.

Find out the books who also bought a book like The Shining? This is a common use case for recommend links where we may want to list the similar products bought by people who is about to buy the displayed product.

select expand(inE('Bought').outV().OutE('Bought').inV().title) 
from Book where title not in ['The Shining']

Apr 12, 2016

Orient DB - A simple ETL use case note

#orientdb #graph #etl #java #database

As someone who is familiar with graph data structures would like to know how we can map real-world models to a graph and process them. If you are trying to build them programmatically and approach them using traversal algorithms, you are going to have a hard time. If your application use a relational database to store data mapped to these models, then it will become complex while trying to link them with more relationships. How will you design the relationships between domains in a better  semantic way? How would you query them like a sql-like or DSL language? Graph databases should be a right candidate. Here I am trying to test out Orient DB.

In relational databases, we have primary and foreign-key columns references that helps joins that are computed at query time which is memory and compute intensive. Also we use junctions tables for many-to-many relationships with highly normalized tables which will  increase the query execution time and complexity. Graph databases are like relational databases, but with first class support for “relationships” defined by edges (stored as list) connected nodes (vertex/entity). Whenever you run a join operation, the database just uses this materialized list and has direct access to the connected nodes, eliminating the need for a expensive search / match computation.

Consider following tables,

Author Table
id name
1 Stephen King
2 George R. R. Martin
3 John Grisham
Book Table
id author_id title
1 1 Carrie
2 1 The Shining
Buyer Table
id name knows_id book_id
1 Hary 2 2
2 Mary 1 2

In graph database like orient db, we can define the relationships in amore semantic way. Graph databases operate on 3 structures: Vertex(sometimes called Node), Edge(or Arc) and Property(sometimes called Attribute).
  • Vertex. It’s data: Author, Book etc
  • Edge is physical relation between Vertices. Each Edge connects two different vertices, no more, no less. Additionally Edge has label and Direction, so If you label your edge as likes you know that Hary bought the book The Shining. The direction of relationship cane be either Out or In.
  • Property - it’s a value related to Vertex or Edge.
OrientDB comes with an ETL tool to import data. Also, you can use the libraries and write your own code to create nodes in the database. A generic framework for graph databases is available. More on Apache TinkerPop later.
You have to define configuration files for loading certain data into the graph store.
In the above sample configuration, you are defining,
  • “source”: { “file”: { “path”: “csv file location” } } // the source of file input for a model/entity
  • in transformer
    • vertex as the model or table
    • edge will define the edges in and out of the table
  • In the loader definition we define all the entities and constraints
More about the transformation definition can be read here
Import the csv files and configuration from the github repo. Please change the location of files and conf with respective to your environment.

Simply execute the oetl.sh tool from $ORIENTDB_HOME as sh oetl.sh ‘location of conf file’

loading author data
You have to execute all the configurations to load all the data.
After loading all the data you can query out and visualize them in the Orient DB’s web based console.

Here you can see the links between the entities.

how do you find the books bought by your friends?

select expand( both('Knows').out('Bought')) from Buyer where name = 'Hary'

Dec 4, 2015

Analytics by SQL and Spark using Apache Zeppelin

#spark #hadoop #analytics #apache #zeppelin #scala

I was looking for a cool dashboard based query interface for analytics. I stumbled upon a cool open source project called Apache Zeppelin,

Zeppelin is a modern web-based tool for the data scientists to collaborate over large-scale data exploration and visualization projects. It is a notebook style interpreter that enable collaborative analysis sessions sharing between users. Zeppelin is independent of the execution framework itself. Current version runs on top of Apache Spark but it has pluggable interpreter APIs to support other data processing systems. More execution frameworks could be added at a later date i.e Apache Flink, Crunch as well as SQL-like backends such as Hive, Tajo, MRQL.

As their apache proposal mentioned, it does have good support for pluggable interpreters (a lot), ie. you can query files, databases, hadoop etc using this interface seamlessly. This application is easily executable in you workstation, if you want to try out. Download from the project site and follow the installation guide.

Run the zeppelin server daemon, and access the UI at http://localhost:8088

We can use different interpreters in notebooks and display the results in dashboard. I was interested in plain simple SQL db, like postgre.

create a tables sales and insert some sample data.

create table sales(category varchar, units integer);
insert into sales values('Men-Shirts', 134344);
insert into sales values('Men-Shoes', 56289);
insert into sales values('Men-Wallets', 19377);
insert into sales values('Men-Watches', 345673);
insert into sales values('Women-Shirts', 87477);
insert into sales values('Women-Skirts', 140533);
insert into sales values('Women-Shoes', 77301);
insert into sales values('Electronics-Mobile', 67457);
insert into sales values('Electronics-Tablets', 21983);
insert into sales values('Electronics-Accessories', 865390);

Create a notebook,

setup the connection properties in psql interpreter configuration.

and run with %psql interpreter. In the notebook, type in,
%psql  select * from sales

You have the dashboard ready. You can share the graph as a link and run the notebook scheduled.

Then I decided to use the spark code. As it supports jdbc source, use that in the spark context. In Spark, JdbcRDD can be used to connect with a relational data source. RDDs are a unit of compute and storage in Spark but lack any information about the structure of the data i.e. schema. Dataframes combine RDDs with Schema. To support postgre as source, you need the driver loaded to execute the queries or building schema. Copy the driver to $ZEPLLIN_HOME/interpreter/spark and restart the daemon. If you don't do this, you will not be able to source postgre and may get jdbc connection errors like "No suitable driver found" etc.

Use the notebook to provide the spark code,

In the %sql (to be noted, its not %psql) interpreter provide,

%sql select * from sales

You have to schedule the %sql notebook only and the dashboard is updated based on the data inserts when the cron job is triggered.

Sep 10, 2015

Json parsing, Scala way

Most java developers are familiar with json parsing and object mapping using Jackson library's object mapper functionality that enables serializing POJOs to json string and back. In scala, using the play's json inception mechanism provides a subtle way to serialize json. Using the powerful Scala macros, (a macro is a piece of Scala code, executed at compile-time, which manipulates and modifies the AST of a Scala compile-time metaprogramming), it is able to introspect code at compile-time based on Scala reflection API, access all imports, implicits in the current compile context and generate code. This means the case classes are automatically serialized to json. Also, you can explicitly provide the path to json key and map the value to object's field. But, for simple case classes they are just another boiler-plate code. Use it when we need more powerful mapping and logic for serialized fields. So how does this mapping works? The compiler will inject code into compiled scala AST (Absract Syntax Tree) as the macro-compiler replaces, say, Json.reads[T] by injecting into compile chain and eventually writes out the code for mapping fields in json to object. Internally, play's json module use Jackson's object mapper (ref: play.api.libs.json.jackson.JacksonJson). 

You can add dependency in build.sbt in a minimal-scala project which will provide Json APIs from play framework -
  "com.typesafe.play" %% "play-ws" % "2.4.2" withSources()

For eg, if we have to two classes (in this case class),

case class Region(name: String, state: Option[String])
case class Sales(count: Int, region: Region)

You have to add the implicit  methods for reading and writing to and from json and objects. The methods marked implicit will be inserted for you by the compiler and type is inferred from the context. Any compilation will fail if no implicit value of the right type is available in scope.

implicit val readRegion = Json.reads[Region]
implicit val readSales = Json.reads[Sales]
implicit val writeRegion = Json.writes[Region]
implicit val writeSales = Json.writes[Sales]

If you interchange the order, from readRegion and readSales, you will get compilation error.As the compiler creates a Reads[T] by resolving case class fields & required implicits at COMPILE-time, If any missing implicit is discovered, compiler will break with corresponding error.

 Error:(12, 38) No implicit format for test.Region available.
   implicit val readSales = Json.reads[Sales]

Interesting method to try is the validate() method while converting json to object which will help to pin point the path of error.

Executing the following program:


This is testing json..
Test 1
Test 2
Error at JsPath: /region/name
Test 3
Error at JsPath: /count
Error at JsPath: /region/name
Test 4
Process finished with exit code 0


May 5, 2013

Simple metastore creation for Hive in MySQL

For Hive, the meta-store is like the system catalog which contains metadata about the tables stored in Hive. This metadata is specified during table creation and reused every time the table is referenced in HiveQL. The database is a namespace for tables, where ‘default’ is used for tables with no user supplied database name. The metadata for table contains list of columns and their types, owner, storage and SerDe information (which I can detail in future posts). It can also contain any user supplied key and value data; which can be used for table statistics. Storage information includes location of the table’s data in the underlying file system, data formats and bucketing information. SerDe (which controls how Hive serializes/deserializes the data in a row) metadata includes the implementation class of serializer and deserializer methods and any supporting information required by that implementation. The partitions can have its own columns and SerDe and storage information which can be used in the future to evolve Hive schema.The metastore uses either a traditional relational database (like MySQL, Oracle) or file system and not HDFS since it is optimized for sequential scans only),thus the fired HiveQL statements are executed slow which only access metadata objects.

its simple to install the metastore.

-install mysql-conector
$ sudo yum install mysql-connector-java
-create a symbolic link in the Hive directory
$ ln -s /usr/share/java/mysql-connector-java.jar /usr/lib/hive/lib/mysqlconnector-java.jar

-create the database for the Hive metastore.cdh4 ships with scripts for derby,mysql,oracle and postgre
$ mysql -u root -p
mysql> CREATE DATABASE hivemetastoredb;
mysql> USE hivemetastoredb;
mysql> SOURCE /usr/lib/hive/scripts/metastore/upgrade/mysql/hive-schema- 0.9.0.mysql.sql;

-create a user for the metastore
mysql>CREATE USER 'hive'@'%' IDENTIFIED BY 'hive';

-grant access for all hosts in the network
mysql> GRANT ALL PRIVILEGES ON hivemetastoredb.* TO hive@'%' WITH GRANT OPTION;

following entries in the file /etc/hive/conf/hive-sites.xml, if you are trying a jdbc connection