Pages

What's it about Cascading?




Cascading helps manipulating data in Hadoop. It is a framework written in Java which abstracts map reduce that allows to write scripts to read and modify data inside Hadoop. Provides a programming API for defining and executing fault tolerant data processing workflows and a query processing API in which the developers can go without map reduce. There are quite a number of DSLs built on top of Cascading, most noteably Cascalog (written in Clojure) and Scalding (written in Scala). There is Pig data processing API which is similar but SQLy.








Terminology

Taps - streams of source (input) and sink (output)
Tuple - can be considered as a result set. This is a single row with named columns of data being processed. A series of tuples make a stream.All tuples in a stream have the exact same fields.
Pipes - tie operations together when executed upon a Tap. Pipe Assembly is created when pipes are successuvely executed.Pipe assemblies are Directed Acyclic Graphs.
Flows - reusable combinations of source,sink and pipe assemblies.
Cascade - series of flows

What all operations possible? 

Relational - Join, Filter, Aggregate etc
Each - for each row result (tuple)
Group - Groupby
CoGroup - joins for tuples
Every - for every key in group or cogroup, like an aggregate function to all tuples in a group at once
SubAssembly - nesting reusable pipe assemblies into a Pipe

Internally the cascading employs an intelligent planner to convert the pipe assembly to a graph of dependent MapReduce jobs that can be executed on a Hadoop cluster.
 
What are the advantages from a normal map reduce workflow do this Cascading have? (Need to investigate!)

No comments:

Post a Comment