Java - what are characters, code points and surrogates? What difference is there between them? -


i'm trying find explanation of terms "character", "code point" , "surrogate", , while these terms aren't limited java, if there language-specific differences i'd explanation relates java.

i've found information differences between characters , code points, characters being displayed human users, , code points being value encoding specific character, have no idea surrogates. surrogates, , how different characters , code points? have right definitions characters , code points?

in another thread stepping through string array of characters, specific comment prompted question "note technique gives characters, not code points, meaning may surrogates." didn't understand, , rather create long series of comments on 5-year-old question thought best ask clarification in new question.

to represent text in computers, have solve 2 things: first, have map symbols numbers, then, have represent sequence of numbers bytes.

a code point number identifies symbol. 2 well-known standards assigning numbers symbols ascii , unicode. ascii defines 256 symbols. unicode defines 109384 symbols, that's way more 2^16.

furthermore, ascii specifies number sequences represented 1 byte per number, while unicode specifies several possibilities, such utf-8, utf-16, , utf-32.

when try use encoding uses less bits per character needed represent possible values (such utf-16, uses 16 bit), need workaround.

thus, surrogates 16bit values indicate symbols not fit single two-byte value.

java uses utf-16.

in particular, char (character) unsigned two-byte value contains utf-16 value.

if want learn more java , unicode, can recommend newsletter: part 1, part 2


Comments

Popular posts from this blog

commonjs - How to write a typescript definition file for a node module that exports a function? -

openid - Okta: Failed to get authorization code through API call -

thorough guide for profiling racket code -