java - Converting string to UTF-8 using buffer -
i need convert (possibly large) string utf-8, don't want create byte array containing full encoding. idea use charsetencoder
this, charsetencoder
acts on charbuffer
, means supplemental characters (outside unicode range 0x0000
0xffff
) should considered.
now method was using charbuffer.wrap(string.substring(start, start + block_size))
, , bytebuffer
created using bytebuffer.allocate((int) math.ceil(encoder.maxbytesperchar() * block_size))
. however, charbuffer
contain block_size
code points, not code units (characters); think actual amount of characters 2 times block_size
maximum. means bytebuffer
2 times small well.
how can calculate correct amount of bytes bytebuffer
? double in case each , every character supplemental character, seems bit much. other reasonable option seems iterate on code units (characters) or code points, @ least looks suboptimal.
any hints on what's efficient approach encode strings piecemeal? should use buffer, iteration string.codepointat(location)
, or there encoding routine directly handles code points?
additional requirement: invalid character encodings should result in exception, default substitution or skipping of invalid characters cannot allowed.
it seems easier wrap whole string, , blindly read characters until none remaining. no need cut string in parts, encoder read bytes until output buffer filled up:
final charsetencoder encoder = standardcharsets.utf_8.newencoder(); final charbuffer buffer = charbuffer.wrap(input); final bytebuffer encodedbuffer = bytebuffer.allocate(buffer_size); coderresult coderresult; while (buffer.hasremaining()) { coderresult = encoder.encode(buffer, encodedbuffer, false); if (coderresult.iserror()) { throw new illegalargumentexception( "invalid code point in input string"); } encodedbuffer.flip(); // stuff encodedbuffer encodedbuffer.clear(); } // required encoder: call encode true indicate end coderresult = encoder.encode(buffer, encodedbuffer, true); if (coderresult.iserror()) { throw new illegalargumentexception( "invalid code point in input string"); } encodedbuffer.flip(); // stuff encodedbuffer encodedbuffer.clear(); // if still required
Comments
Post a Comment