java - Converting string to UTF-8 using buffer -


i need convert (possibly large) string utf-8, don't want create byte array containing full encoding. idea use charsetencoder this, charsetencoder acts on charbuffer, means supplemental characters (outside unicode range 0x0000 0xffff) should considered.

now method was using charbuffer.wrap(string.substring(start, start + block_size)), , bytebuffer created using bytebuffer.allocate((int) math.ceil(encoder.maxbytesperchar() * block_size)). however, charbuffer contain block_size code points, not code units (characters); think actual amount of characters 2 times block_size maximum. means bytebuffer 2 times small well.

how can calculate correct amount of bytes bytebuffer? double in case each , every character supplemental character, seems bit much. other reasonable option seems iterate on code units (characters) or code points, @ least looks suboptimal.

any hints on what's efficient approach encode strings piecemeal? should use buffer, iteration string.codepointat(location), or there encoding routine directly handles code points?


additional requirement: invalid character encodings should result in exception, default substitution or skipping of invalid characters cannot allowed.

it seems easier wrap whole string, , blindly read characters until none remaining. no need cut string in parts, encoder read bytes until output buffer filled up:

final charsetencoder encoder = standardcharsets.utf_8.newencoder(); final charbuffer buffer = charbuffer.wrap(input); final bytebuffer encodedbuffer = bytebuffer.allocate(buffer_size); coderresult coderresult;  while (buffer.hasremaining()) {     coderresult = encoder.encode(buffer, encodedbuffer, false);     if (coderresult.iserror()) {         throw new illegalargumentexception(                 "invalid code point in input string");     }     encodedbuffer.flip();     // stuff encodedbuffer     encodedbuffer.clear(); }  // required encoder: call encode true indicate end coderresult = encoder.encode(buffer, encodedbuffer, true); if (coderresult.iserror()) {     throw new illegalargumentexception(             "invalid code point in input string"); } encodedbuffer.flip(); // stuff encodedbuffer encodedbuffer.clear(); // if still required 

Comments

Popular posts from this blog

commonjs - How to write a typescript definition file for a node module that exports a function? -

openid - Okta: Failed to get authorization code through API call -

thorough guide for profiling racket code -