unicode - Trouble comparing Java strings (of different encoding) -


i'm writing exif metadata jpeg using apache commons imaging (sanselan), and, @ least in 0.97 release of sanselan, there bugs related charset/encoding. exif 2.2 standard requires encoding of fields of type undefined prefixed 8-byte ascii "signature", describing encoding of following content. field/tag i'm writing usercomment exif tag.

windows expects content encoded in utf16, bytes written jpeg must contain combination of (single byte) ascii characters, followed (double byte) unicode characters. furthermore, although usercomment doesn't seem require it, notice content "null-padded" length.

here's code i'm using create , write tag:

string texttoset = "test"; byte[] asciimarker = new byte[]{ 0x55, 0x4e, 0x49, 0x43, 0x4f, 0x44, 0x45, 0x00 }; // spells out "unicode" byte[] comment = texttoset.getbytes("unicodelittle");   // pad \0 if (total) length odd (or \0 byte automatically added arraycopy?) int pad = (asciimarker.length + comment.length) % 2;  byte[] bytescomment = new byte[asciimarker.length + comment.length + pad]; system.arraycopy(asciimarker, 0, bytescomment, 0, asciimarker.length); system.arraycopy(comment, 0, bytescomment, asciimarker.length, comment.length); if (pad > 0) bytescomment[bytescomment.length-1] = 0x00;  tiffoutputfield exif_comment = new tiffoutputfield(tiffconstants.exif_tag_user_comment,         tifffieldtypeconstants.field_type_undefined, bytescomment.length - pad, bytescomment); 

then when read tag jpeg, following:

string textread; tifffield field = jpegmetadata.findexifvalue(tiffconstants.exif_tag_user_comment); if (field != null) {     textread= new string(field.getbytearrayvalue(), "unicodelittle"); } 

what confuses me this: bytes written jpeg prefixed 8 ascii bytes, need "stripped off" in order compare written read:

if (textread != null) {   if (texttoset.equals(textread)) {  // expecting fail     print "equal";     } else {     print "not equal";     if (texttoset.equals(textread.substring(5))) {  // works       print "equal after all...";     }   } }  

but why substring(5), opposed to... substring(8)? if 4, might think 4 double byte (utf-16) symbols total 8 bytes, works if strip off 5 bytes. indication i'm not creating payload (byte array bytescomment) properly?

ps! update apache commons imaging rc 1.0, came out in 2016 , has fixed these bugs, i'd still understand why works once i've gotten far 0.97 :-)


Comments

Popular posts from this blog

inversion of control - Autofac named registration constructor injection -

verilog - Systemverilog dynamic casting issues -

ios - Change Storyboard View using Seague -