Unicode | Javascript | ᴘʜᴘ | Go | Ruby | Python | ☕ Java | Perl |
---|---|---|---|---|---|---|---|
Internally | UCS‐2 or UTF‐16 |
UTF‐8⁻ | UTF‐8 | varies | UCS‐2 or UCS‐4 |
UTF‐16 | UTF‐8⁺ |
Identifiers | ─ | ✔ | ✔ | ✔ | ✅∓ | ✔ | ✔ |
Casefolding | none | simple | simple | full | none | simple | full |
Casemapping | simple | simple | simple∓ | full | simple | full | full |
Graphemes | ─ | ✅ | ─ | ─ | ─ | ─ | ✔ |
Normalization | ─ | ✔ | ─⁺ | ─ | ✔ | ✔ | ✔ |
UCA Collation | ─ | ─ | ─ | ─ | ─ | ─ | ✔⁺ |
Named Characters | ─ | ─ | ─ | ─ | ✅ | ─ | ✔⁺ |
Properties | ─ | two | (non‐regex)⁻ | three | (non‐regex)⁻ | two⁺ | every⁺ |
from Tom Christiansen Unicode Support Shootout: The Good, the Bad, the Mostly Ugly
Grapheme - A grapheme is the smallest semantically distinguishing unit in a written language, analogous to the phonemes of spoken languages.
Casefolding - Unicode defines case folding through the three case-mapping properties of each character: uppercase, lowercase and titlecase. These properties relate all characters in scripts with differing cases to the other case variants of the character.
Case mapping - is used to handle the mapping of upper-case, lower-case, and title case characters for a given language.
What is the difference between case mapping and case folding? Case mapping or case conversion is a process whereby strings are converted to a particular form—uppercase, lowercase, or titlecase—possibly for display to the user. Case folding is primarily used for caseless comparison of text, such as identifiers in a computer program, rather than actual text transformation. Case folding in Unicode is based on the lowercase mapping, but includes additional changes to the source text to help make it language-insensitive and consistent. As a result, case-folded text should be used solely for internal processing and generally should not be stored or displayed to the end user.
Normalization - courtesy |
Normalization - Unicode has encoded many entities that are really variants of existing nominal characters. The visual representations of these characters are typically a subset of the possible visual representations of the nominal character. more -
UCA Collation - Collation is the general term for the process and function of determining the sorting order of strings of characters. It is a key function in computer systems; whenever a list of strings is presented to users, they are likely to want it in a sorted order so that they can easily and reliably find individual strings. Thus it is widely used in user interfaces. It is also crucial for databases, both in sorting records and in selecting sets of records with fields within given bounds.The Unicode collation algorithm (UCA) is an algorithm defined in Unicode Technical Report #10, which defines a customizable method to compare two strings. These comparisons can then be used to collate or sort text in any writing system and language that can be represented with Unicode. more
Named Characters - Unicode characters are assigned a unique Name (na). The name, in English, is composed of A-Z capitals, 0-9 digits, - (hyphen-minus) and
Properties - Each Unicode character belongs to a certain category. Unicode assigns character properties to each code point. These properties can be used to handle "characters" (code points) in processes, like in line-breaking, script direction right-to-left or applying controls.more
Perl looks cool!
No comments:
Post a Comment