python - Decoding of bytes object results in unexpected + invalid UTF-8 - how can I avoid this? -


the code below (python 3.6) takes bytes object represents multiplication sign in utf-8 (b'\xc3\x97'), decodes string, , writes string file:

# byte sequence corresponds multiplication sign in utf-8 mybytes = b'\xc3\x97' # decode string  mystring = mybytes.decode('utf-8')  # write mystring file open("mystring.txt", "w") ms_file:     ms_file.write(mystring) 

this gives me following result:

bytes written mystring.txt (checked opening file in hex editor): d7

the result expected here 2-byte sequence c3 97, utf-8 representation of multiplication sign. moreover, d7 not valid (one byte) utf-8 sequence (see utf-8 codepage layout). is byte value matches iso/iec 8859-1 (latin) encoding though.

so question how can ensure end valid utf-8 here. overlooking obvious, or bug in python?

some context: ran issue while writing code processes xml files (that use utf-8), parses xml element object lxml, extracts text values of elements subsequently written xml file (which uses utf-8). due issue can end xml files not well-formed.

i'm using python 3.6 under windows 7.

edit: original question/code contained function supposed print hex representation of mystring screen, turns out not behaving expected. since made things unnecessarily confusing (also function not essential question) removed code.


Comments

Popular posts from this blog

inversion of control - Autofac named registration constructor injection -

verilog - Systemverilog dynamic casting issues -

ios - Change Storyboard View using Seague -