#distributed #nosql
Showing posts with label programming. Show all posts
Showing posts with label programming. Show all posts
Unicode features in various languages
Here’s what each language natively supports in its standard distribution.
from Tom Christiansen Unicode Support Shootout: The Good, the Bad, the Mostly Ugly
Grapheme - A grapheme is the smallest semantically distinguishing unit in a written language, analogous to the phonemes of spoken languages.
Casefolding - Unicode defines case folding through the three case-mapping properties of each character: uppercase, lowercase and titlecase. These properties relate all characters in scripts with differing cases to the other case variants of the character.
Case mapping - is used to handle the mapping of upper-case, lower-case, and title case characters for a given language.
What is the difference between case mapping and case folding? Case mapping or case conversion is a process whereby strings are converted to a particular form—uppercase, lowercase, or titlecase—possibly for display to the user. Case folding is primarily used for caseless comparison of text, such as identifiers in a computer program, rather than actual text transformation. Case folding in Unicode is based on the lowercase mapping, but includes additional changes to the source text to help make it language-insensitive and consistent. As a result, case-folded text should be used solely for internal processing and generally should not be stored or displayed to the end user.
Normalization - Unicode has encoded many entities that are really variants of existing nominal characters. The visual representations of these characters are typically a subset of the possible visual representations of the nominal character. more -
UCA Collation - Collation is the general term for the process and function of determining the sorting order of strings of characters. It is a key function in computer systems; whenever a list of strings is presented to users, they are likely to want it in a sorted order so that they can easily and reliably find individual strings. Thus it is widely used in user interfaces. It is also crucial for databases, both in sorting records and in selecting sets of records with fields within given bounds.The Unicode collation algorithm (UCA) is an algorithm defined in Unicode Technical Report #10, which defines a customizable method to compare two strings. These comparisons can then be used to collate or sort text in any writing system and language that can be represented with Unicode. more
Named Characters - Unicode characters are assigned a unique Name (na). The name, in English, is composed of A-Z capitals, 0-9 digits, - (hyphen-minus) and.The Unicode Standard specifies notational conventions for referring to sequences of characters (or code points) treated as a unit, using angle brackets surrounding a comma-delimited list of code points, code points plus character names, and so on. For example, both of the designations in Table 1 refer to a combining character sequence consisting of the letter “a” with a circumflex and an acute accent applied to it. more more
Properties - Each Unicode character belongs to a certain category. Unicode assigns character properties to each code point. These properties can be used to handle "characters" (code points) in processes, like in line-breaking, script direction right-to-left or applying controls.more
Perl looks cool!
Unicode | Javascript | ᴘʜᴘ | Go | Ruby | Python | ☕ Java | Perl |
---|---|---|---|---|---|---|---|
Internally | UCS‐2 or UTF‐16 |
UTF‐8⁻ | UTF‐8 | varies | UCS‐2 or UCS‐4 |
UTF‐16 | UTF‐8⁺ |
Identifiers | ─ | ✔ | ✔ | ✔ | ✅∓ | ✔ | ✔ |
Casefolding | none | simple | simple | full | none | simple | full |
Casemapping | simple | simple | simple∓ | full | simple | full | full |
Graphemes | ─ | ✅ | ─ | ─ | ─ | ─ | ✔ |
Normalization | ─ | ✔ | ─⁺ | ─ | ✔ | ✔ | ✔ |
UCA Collation | ─ | ─ | ─ | ─ | ─ | ─ | ✔⁺ |
Named Characters | ─ | ─ | ─ | ─ | ✅ | ─ | ✔⁺ |
Properties | ─ | two | (non‐regex)⁻ | three | (non‐regex)⁻ | two⁺ | every⁺ |
from Tom Christiansen Unicode Support Shootout: The Good, the Bad, the Mostly Ugly
Grapheme - A grapheme is the smallest semantically distinguishing unit in a written language, analogous to the phonemes of spoken languages.
Casefolding - Unicode defines case folding through the three case-mapping properties of each character: uppercase, lowercase and titlecase. These properties relate all characters in scripts with differing cases to the other case variants of the character.
Case mapping - is used to handle the mapping of upper-case, lower-case, and title case characters for a given language.
What is the difference between case mapping and case folding? Case mapping or case conversion is a process whereby strings are converted to a particular form—uppercase, lowercase, or titlecase—possibly for display to the user. Case folding is primarily used for caseless comparison of text, such as identifiers in a computer program, rather than actual text transformation. Case folding in Unicode is based on the lowercase mapping, but includes additional changes to the source text to help make it language-insensitive and consistent. As a result, case-folded text should be used solely for internal processing and generally should not be stored or displayed to the end user.
![]() |
Normalization - courtesy |
Normalization - Unicode has encoded many entities that are really variants of existing nominal characters. The visual representations of these characters are typically a subset of the possible visual representations of the nominal character. more -
UCA Collation - Collation is the general term for the process and function of determining the sorting order of strings of characters. It is a key function in computer systems; whenever a list of strings is presented to users, they are likely to want it in a sorted order so that they can easily and reliably find individual strings. Thus it is widely used in user interfaces. It is also crucial for databases, both in sorting records and in selecting sets of records with fields within given bounds.The Unicode collation algorithm (UCA) is an algorithm defined in Unicode Technical Report #10, which defines a customizable method to compare two strings. These comparisons can then be used to collate or sort text in any writing system and language that can be represented with Unicode. more
Named Characters - Unicode characters are assigned a unique Name (na). The name, in English, is composed of A-Z capitals, 0-9 digits, - (hyphen-minus) and
Properties - Each Unicode character belongs to a certain category. Unicode assigns character properties to each code point. These properties can be used to handle "characters" (code points) in processes, like in line-breaking, script direction right-to-left or applying controls.more
Perl looks cool!
Creating index in Hive
Simple:
CREATE INDEX idx ON TABLE tbl(col_name) AS 'Index_Handler_QClass_Name' IN TABLE tbl_idx;As to make pluggable indexing algorithms, one has to mention the associated class name that handles indexing say for eg:-org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler
The index handler classes implement HiveIndexHandler
Full Syntax:
CREATE INDEX index_name
ON TABLE base_table_name (col_name, ...)
AS 'index.handler.class.name'
[WITH DEFERRED REBUILD]
[IDXPROPERTIES (property_name=property_value, ...)]
[IN TABLE index_table_name]
[PARTITIONED BY (col_name, ...)]
[
[ ROW FORMAT ...] STORED AS ...
| STORED BY ...
]
[LOCATION hdfs_path]
[TBLPROPERTIES (...)]
[COMMENT "index comment"]
- WITH DEFERRED REBUILD - for newly created index is initially empty. REBUILD can be used to make the index up to date.
- IDXPROPERTIES/TBLPROPERTIES - declaring keyspace properties
- PARTITIONED BY - table columns where in the index get partitioned, if not specified index spans all table partitions
- ROW FORMAT - custom SerDe or using native SerDe(Serializer/Deserializer for Hive read/write). A native SerDe is used if ROW FORMAT is not specified
- STORED AS - index table storage format like RCFILE or SEQUENCFILE.The user has to uniquely specify tabl_idx name is required for a qualified index name across tables, otherwise they are named automatically. STORED BY - can be HBase (I haven't tried it)
The index can be stored in hive table or as RCFILE in an hdfs path etc. In this case, the implemented index handler class usesIndexTable() method will return false.When index is created, the generateIndexBuildTaskList(...) in index handler class will generate a plan for building the index.
Consider CompactIndexHandler from Hive distribution,
It only stores the addresses of HDFS blocks containing that value. The index is stored in hive metastore FieldSchema as _bucketname and _offsets in the index table.
ie the index table contains 3 columns, with _unparsed_column_names_from_field schema (indexed columns), _bucketname(table partition hdfs file having columns),[" _blockoffsets",..."]
See the code from CompactIndexHandler,
Labels:
big data,
database,
hadoop,
hive,
indexing,
java,
programming,
tech,
technology
Using Avro to serialize logs in log4j
I have written about serialization mechanism of Protocol Buffers previously. Similarly, Apache Avro provides a better serialization framework.
It provide features like:
- Independent Schema - use different schemas for serialization and de-serialization
- Binary serialization - compact data encoding, and faster data processing
- Dynamic typing - serialization and deserialization without code generation
We can encode data when serializing with Avro: binary or JSON. In the binary file schema is included at the beginning of file. In JSON, the type is defined along with the data. Switching JSON protocol to a binary format in order to achieve better performance is pretty straightforward with Avro. This means less type information needs to be sent with the data and it stores data with its schema means any program can de-serialize the encoded data, which makes a good candidate for RPC.
In Avro 1.5 we have to use (this is different from previous versions which had no factory for encoders)
- org.apache.avro.io.EncoderFactory.binaryEncoder(OutputStream out, BinaryEncoder reuse) for binary
- org.apache.avro.io.EncoderFactory.jsonEncoder(Schema schema, OutputStream out) for JSON
The values (Avro supported value types) are put for the schema field name as the key
in a set of name-value pairs called GenericData.Record
Avro supported value types are
Primitive Types - null, boolean, int, long, float, double, bytes, string
Complex Types - Records, Enums, Arrays, Maps, Unions, Fixed
you can read more about them here
An encoded schema definition to be provided for the record instance. To read/write data, just use put/get methods
I have used this serialization mechanism to provide a layout for log4j. The logs will be serialized to avro mechanism.
github project is here - https://github.com/harisgx/avro-log4j
Add the libraries to your project and add new properties to log4j.properties
log4j.appender.logger_name.layout=com.avrolog.log4j.layout.AvroLogLayout
log4j.appender.logger_name.layout.Type=json
log4j.appender.logger_name.layout.MDCKeys=mdcKey
Provide the MDC keys as comma seperated values
This is the schema
Interesting uses of sun.misc.Unsafe
Inspired from the question that found in stackoverflow, I started looking up for the uses. I found some pretty interesting ones...
VM "intrinsification." ie CAS (Compare-And-Swap) used in Lock-Free Hash Tables eg:sun.misc.Unsafe.compareAndSwapInt it can make real JNI calls into native code that contains special instructions for CAS
What is intrinsification?
They are optimization done like compiler generating code directly for called method or JVM native optimizations. We know that there are VM downcalls from JDK like wait method etc. Its about low level programming. For eg:- the Atomic classes for numbers, they are pure numbers represented by objects but atomically modified in which the operations are managed natively.
read more about CAS here http://en.wikipedia.org/wiki/Compare-and-swap
The sun.misc.Unsafe functionality of the host VM can be used to allocate uninitialized objects and then interpret the constructor invocation as any other method call.
One can track the data from the native address.It is possible to retrieve an object’s memory address using the sun.misc.Unsafe class, and operate on its fields directly via unsafe get/put methods!
Compile time optimizations for JVM. HIgh performance VM using "magic", requiring low-level operations. eg: http://en.wikipedia.org/wiki/Jikes_RVM
Allocating memory, sun.misc.Unsafe.allocateMemory eg:- DirectByteBuffer constructor internally calls it when ByteBuffer.allocateDirect is invoked
Tracing the call stack and replaying with values instantiated by sun.misc.Unsafe, useful for instrumentation
sun.misc.Unsafe.arrayBaseOffset and arrayIndexScale can be used to develop arraylets, a technique for efficiently breaking up large arrays into smaller objects to limit the real-time cost of scan, update or move operations on large objects
References
Accessing data from storage system using XAM API
If you are familiar with POSIX which is a collection of standards which enables portability of applications across OS platforms. It provide a standardized API for an OS, eg:- functions like 'open' to open a file, 'fdopen' to open stream for a file descriptor, fork to create a process etc. It defines file descriptor as a per-process, unique, nonnegative integer used to identify an open file for the purpose of file access. The Unix systems programming API is standardized by a large number of POSIX standards. Most operating system APIs share common functionalities like listing directories, renaming file etc.
XAM stands for "eXtensible Access Method". The XAM API will allow application developers to store content on a class of storage systems known as “fixed-content” storage systems.
XAM stands for "eXtensible Access Method". The XAM API will allow application developers to store content on a class of storage systems known as “fixed-content” storage systems.
The data centers have evolved into a complex ecosystem of SANs, fiber channels, iSCSI, caching technologies etc. Certain storage systems are efficiently designed to store the documents that donot change, which is useful for preservation or archival. Digital imaging storage systems which us used to archive XRAY DICOM images, ie. documents that are not edited and to be kept for a long time need can use fixed storage. The applications accessing these solutions need to be written using only those vendor-specific interfaces. What if one changes their infrastructure? The application has to be rewritten. This scenario is an analogy to the need for JDBC/ODBC driver specifications as a standard API to interact with different vendors. XAM API specified by SNIA is the storage industry's proposed solution to the multi-vendor interoperability problem.
Consider EMC Centera, the data written to it is fixed in nature. Most common systems uses a file location (address) based approach to store and retrieve content. Centera has a flat address scheme called content address. When the BLOB object is stored, it is given "claim" check derived from its content which is a hash value (128-bit) known as Content Address (CA). The application needn't know the physical location of the data stored. The associated metadata like filename etc is added into an XML called C-Clip Descriptor File (CDF) with the file's content address. This makes the system capable of WORM. When one tries to modify/update a file, the new file will be kept separately, enabling versioning and unique access through its fingerprint checksum. Also, there is an essential attribute for retention that tracks the expiry of the file. The file can't be deleted until time surpasses the defined value.
This is a basic vendor specific architecture. They provides the SDK/API to acces the data. So the industry came up with standard in which XAM API provides the capability to access these systems independently.
XAM has concepts like SYSTEMS, XSETS, properties and streams ("blobs"), and XUIDs.
SYSTEM/XSYSTEM - is the repository
XSET - the content
XUID - identifier for the XSET
All these elements include metadata as key value pairs. There are participating and non-participating fields. Participating fields contribute to the naming of the XSET known as XUID. If XSET is retrieved from a SYSTEM and a participating field is modified, it will receive a different XUID. XUID created must be formed by using the content of the property or stream (e.g. run an MD5 hash over the value).
So a vendor specific library implements the XAM Application Programming Interface (API) which is known as XAMLibrary connecting to the Vendor Interface Module (VIM), which internally communicates to the storage system. Usually, VIMs are the native interfaces. Once the application load XAMLibrary and connect to the source, it needs to open the XSET by providing the XUID which will open up the content as a stream, called XStream. The XStream object is used to read the transported data which can be a query result or a file.

A sample code to read the data,
// connectString xri = “snia-xam://" XSystem sys = s_xam.connect(xri);//open the XSetXSet xset = sys.openXSet( xuid, XSet.MODE_READ_ONLY);//getting metdata value for the keyString value = xset.getString( “com.example.key” );XStream stream = xset.openStream( “com.example.data”,XStream.MODE_READ_ONLY);stream.close();xset.close();sys.close()
XAM fields have type information described using MIME types.
eg:-“application/vnd.snia.xam.xuid" for the an 80-element byte array, xam_xuid.
The XAM URI (XRI) is the XAM resource identifier specification associated to a vendor identifying the storage.
eg:-“application/vnd.snia.xam.xuid" for the an 80-element byte array, xam_xuid.
The XAM URI (XRI) is the XAM resource identifier specification associated to a vendor identifying the storage.
As XAM generates a globally unique name (address), the objects may move around freely in time, changing their physical or technological location so as to enable a transparent information lifecycle management (ILM). So the metadata information and retention can be externalized ie, even if the application is gone ie. retired, the information management can be handled through an API. This makes the data center more adaptive to organizational information management.
And, unlike the file systems, XAM tags the information with metadata and provides a technology independent namespace, allowing software to interpret the content independent of the application.
Architecture specification
Labels:
API,
architecture,
data access,
distributed systems,
ilm,
information,
programming,
standards,
storage
First law for an unruly software programmer
All I want to do in life is code (!! or ?) mm... eat,sleep,have relationships (perpetual me-self),play a lot of computer games,read books..read read..research.. These thoughts became the fundamental principle for a "programmer" who didn't know about the laws of programming! Ha is there any laws to be followed ? The thermodynamics of a heated discussion on solving a bug within a team would have made revelations to some ... what if we have found out this bug early or why did we care less about the code block.. etc etc. I have a little experience about the spaghetti world of programming languages (being humble :P).
First law of programming says:
Lowering quality lengthens development time.
Yes.Its a good read.
We do a job, but we have to make the work best as we could.To err is human.We are in a rush to finish the task and impress our customer.IMHO , if we are able to solve the bugs step by step along with the development most of the issues will be solved.Ya, test driven development,agile,extreme programming ...
I think it will need experience by coding, code reviews we will be able find out bugs during development it self.Bug repositories, tracking etc could provide a checklist for the desired results.
Some people say that requirements are not fully achieved by the developed product , but some says we needn't give much importance to requirements .More importance to the code and the developed software.Make it smart.But we should know what we are making and what the customer will be expecting from the product.Are these requirements bullshit ? Read here
A good programmer will be a a good hacker.But if that guy is doing the hack-oriented programming, will the estimated scheduled will be met ? His job will be at a stake...? It depends upon a good manager :)
Some time the induced bugs can be lucrative sometime (the devilish side).Bug fixing prices !! I believe that's a bad way of service by providing the successfully "on-time" delivered product that sucks. Jolly customers !!
It all depends on what we are to develop and where we are.Happy programming.
First law of programming says:
Lowering quality lengthens development time.
Yes.Its a good read.
We do a job, but we have to make the work best as we could.To err is human.We are in a rush to finish the task and impress our customer.IMHO , if we are able to solve the bugs step by step along with the development most of the issues will be solved.Ya, test driven development,agile,extreme programming ...
I think it will need experience by coding, code reviews we will be able find out bugs during development it self.Bug repositories, tracking etc could provide a checklist for the desired results.
Some people say that requirements are not fully achieved by the developed product , but some says we needn't give much importance to requirements .More importance to the code and the developed software.Make it smart.But we should know what we are making and what the customer will be expecting from the product.Are these requirements bullshit ? Read here
A good programmer will be a a good hacker.But if that guy is doing the hack-oriented programming, will the estimated scheduled will be met ? His job will be at a stake...? It depends upon a good manager :)
Some time the induced bugs can be lucrative sometime (the devilish side).Bug fixing prices !! I believe that's a bad way of service by providing the successfully "on-time" delivered product that sucks. Jolly customers !!
It all depends on what we are to develop and where we are.Happy programming.
Subscribe to:
Posts (Atom)