#bytescrolls: big data

Showing posts with label big data. Show all posts

Machine generated data

At first, the term "machine-generated data" can be confusing. One would think, every data is (or are?) generated from one device or another is provided by an innocent mortal in this so called era of social media and big data. Then, there should be a clear distinction to such definitions. If an user enter some data in a form, then it is not considered machine generated. At the same time, the same application can track the user's location and log it in a remote server. So it becomes the machine generated data.

Wikipedia says,

Machine-generated data (MGD) is the generic term for information which was automatically created from a computer process, application, or other machine without the intervention of a human.

According to Monash Research,

In classical human-generated data, what’s recorded is the direct result of human choices. Somebody buys something, makes an inquiry about it, fills an order from inventory, makes a payment in return for the object, makes a bank deposit to have funds for the next purchase, or promotes a manager who’s been particularly successful at selling stuff. Database updates ensue. Computers memorialize these human actions more quickly and cheaply than humans carry them out. Plenty of difficulties can occur with that kind of automation — applications are commonly too inflexible or confusing — but keeping up with data volumes is generally the least of the problems.

So what are they? Are they stream of logs flowing through the information super waterway?

May be, until they churned into some books or toilet rolls!

Application Logs - Logs generated by web or desktop applications. The server side logs used for debugging and support tickets!

Call Detail Records - The ones recorded your telecom company. They contain useful details of the call or service that passed through the switch etc like the phone number of the calls, its duration etc. Needed for billing.

Web logs - use to count the visitors and similar web analytics done on these data

Database Audit Logs - Enable auditing to audit for suspicious database activity, it is common that not much information is available to target specific users or schema objects

OS logs - tracks crashing or errors

There are many similar generated data by different application and systems like RFIDs, sensors etc. Then these messages can be mashed up. For the machine data, there will be structure or format and semantics based on the domain it relies on.

The growth of such data is fast and continuous. As it is a stream of data and like a history they are not changed. They are like a record of events.

courtesy- link

Anyone tried iPhonetracker?

courtesy- link

Geolocation and LBS does push a load a data. HTML5 do have a geolocation functionality (even though you have the choice not to track). Following a sample code to test it.

Creating index in Hive

Simple:

CREATE INDEX idx ON TABLE tbl(col_name) AS 'Index_Handler_QClass_Name' IN TABLE tbl_idx;

As to make pluggable indexing algorithms, one has to mention the associated class name that handles indexing say for eg:-org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler
The index handler classes implement HiveIndexHandler
Full Syntax:

CREATE INDEX index_name
ON TABLE base_table_name (col_name, ...)
AS 'index.handler.class.name'
[WITH DEFERRED REBUILD]
[IDXPROPERTIES (property_name=property_value, ...)]
[IN TABLE index_table_name]
[PARTITIONED BY (col_name, ...)]
[
[ ROW FORMAT ...] STORED AS ...
| STORED BY ...
]
[LOCATION hdfs_path]
[TBLPROPERTIES (...)]
[COMMENT "index comment"]

WITH DEFERRED REBUILD - for newly created index is initially empty. REBUILD can be used to make the index up to date.

IDXPROPERTIES/TBLPROPERTIES - declaring keyspace properties

PARTITIONED BY - table columns where in the index get partitioned, if not specified index spans all table partitions

ROW FORMAT - custom SerDe or using native SerDe(Serializer/Deserializer for Hive read/write). A native SerDe is used if ROW FORMAT is not specified

STORED AS - index table storage format like RCFILE or SEQUENCFILE.The user has to uniquely specify tabl_idx name is required for a qualified index name across tables, otherwise they are named automatically. STORED BY - can be HBase (I haven't tried it)

The index can be stored in hive table or as RCFILE in an hdfs path etc. In this case, the implemented index handler class usesIndexTable() method will return false.When index is created, the generateIndexBuildTaskList(...) in index handler class will generate a plan for building the index.

Consider CompactIndexHandler from Hive distribution,

It only stores the addresses of HDFS blocks containing that value. The index is stored in hive metastore FieldSchema as _bucketname and _offsets in the index table.

ie the index table contains 3 columns, with _unparsed_column_names_from_field schema (indexed columns), _bucketname(table partition hdfs file having columns),[" _blockoffsets",..."]

See the code from CompactIndexHandler,

The mythical unstructured data!

As semantic web and big data integration gaining its fus-ro-dah, enterprises are finding a way to harness any available form of information swarming the web and the world

I came across some interesting artcles which gives a concise idea of harnessing metadata from unstructured data....

Lee Dallas says

In some respects it is analogous to hieroglyphics where pictographs carry abstract meaning. The data may not be easily interpretable by machines but document recognition and capture technologies improve daily. The fact that an error rate still exists in recognition does not mean that the content lacks structure. Simply that the form it takes is too complex for simple processes to understand.

more here : http://bigmenoncontent.com/2010/09/21/the-myth-of-unstructured-data/

Ram Subramanyam Gopalan says

A lot of data growth is happening around these so-called unstructured data types. Enterprises which manage to automate the collection, organization and analysis of these data types, will derive competitive advantage.
Every data element does mean something, though what it means may not always be relevant for you.

more here : http://bigdataintegration.blogspot.in/2012/02/unstructured-data-is-myth.html

#bytescrolls

Pages

Machine generated data

Creating index in Hive

The mythical unstructured data!

Popular Posts