By the way, BreakIterator is not only able to handles internalization character, it also supports word and sentence. Same implementation could be applied for deleting, inserting, and truncating string which containing surrogate characters. The encoding is variable-length, as code points are encoded with one or two 16-bit code units. An emoji code point can be a low or high number depending on. UTF-16(16-bitUnicodeTransformation Format) is a character encodingcapable of encoding all 1,112,064 valid code pointsof Unicode (in fact this number of code points is dictated by the design of UTF-16). High surrogate character is followed * by low surrogate character in char array. Java and JavaScript are UTF-16 based, so they measure length in code units and not code points. The maximum value of a Unicode high-surrogate code unit in the UTF-16 encoding. But Java source file must saved * in Unicode encoding format.**/ System. The maximum value of a Unicode code point, constant U+10FFFF. Which means that Java uses 2 char for a surrogate pair, 1 for the high surrogate character, the other 1 for the low surrogate character, to make up the character from Supplementary range. Hence, Java also adopted the surrogate character mechanism. Isolated high surrogate or low surrogate code point have no meaning and it does not map to any character.Ī char primitive data type in Java is 16-bits, which is not able to fit supplementary code point. Unicode allocated 2 code point areas in BMP as below:Ī high surrogate code point followed by low surrogate code point is mapped to a supplementary code point. On the other hand, the Supplementary range consists of Supplementary Multilingual Plane (SMP), Supplementary Ideographic Plane (SIP), unassigned, Supplementary Special-purpose Plane (SSP) and Supplementary Private Use Area (SPUA).Īs Unicode encoding is still maintained in 16-bits basis, it introduced surrogate character mechanism, which allow Unicode based encoding scheme to encode/decode Supplementary code points. The only plane in the Basic range called Basic Multilingual Plane (BMP) basically maintain the code points in the original code space. This evolvement had splitted Unicode characters into 2 major ranges, Basic (U+0000 - U+FFFF), and Supplementary (U+10000 - U+10FFFF) range, where supplementary is then further divided into different planes. When the second version of Unicode was published, its code space was expanded to approximately 21-bits with 1,114,112 code points to cover those characters that never intended for Unicode before. This is good enough to meet its intention which is to cover all characters from modern languages of all around the world. Maximum value has reached 1,114,111 (0x10FFFF). But, later on, lots of other characters introduced and added to the UTF character set. When Unicode was first introduced, it has 16-bits code space with 65335 code points in total. Working with CodePoints (UTF - Unicode 32-bit values) Java provides supports for a wide range of characters set within range till 65,535 (0xFFFF).
0 Comments
Leave a Reply. |