Java - what are characters, code points and surrogates? What difference is there between them? -
i'm trying find explanation of terms "character", "code point" , "surrogate", , while these terms aren't limited java, if there language-specific differences i'd explanation relates java.
i've found information differences between characters , code points, characters being displayed human users, , code points being value encoding specific character, have no idea surrogates. surrogates, , how different characters , code points? have right definitions characters , code points?
in another thread stepping through string array of characters, specific comment prompted question "note technique gives characters, not code points, meaning may surrogates." didn't understand, , rather create long series of comments on 5-year-old question thought best ask clarification in new question.
to represent text in computers, have solve 2 things: first, have map symbols numbers, then, have represent sequence of numbers bytes.
a code point number identifies symbol. 2 well-known standards assigning numbers symbols ascii , unicode. ascii defines 256 symbols. unicode defines 109384 symbols, that's way more 2^16.
furthermore, ascii specifies number sequences represented 1 byte per number, while unicode specifies several possibilities, such utf-8, utf-16, , utf-32.
when try use encoding uses less bits per character needed represent possible values (such utf-16, uses 16 bit), need workaround.
thus, surrogates 16bit values indicate symbols not fit single two-byte value.
java uses utf-16.
in particular, char
(character) unsigned two-byte value contains utf-16 value.
if want learn more java , unicode, can recommend newsletter: part 1, part 2
Comments
Post a Comment