python - Decoding of bytes object results in unexpected + invalid UTF-8 - how can I avoid this? -
the code below (python 3.6) takes bytes object represents multiplication sign in utf-8 (b'\xc3\x97'), decodes string, , writes string file:
# byte sequence corresponds multiplication sign in utf-8 mybytes = b'\xc3\x97' # decode string mystring = mybytes.decode('utf-8') # write mystring file open("mystring.txt", "w") ms_file: ms_file.write(mystring) this gives me following result:
bytes written mystring.txt (checked opening file in hex editor): d7
the result expected here 2-byte sequence c3 97, utf-8 representation of multiplication sign. moreover, d7 not valid (one byte) utf-8 sequence (see utf-8 codepage layout). is byte value matches iso/iec 8859-1 (latin) encoding though.
so question how can ensure end valid utf-8 here. overlooking obvious, or bug in python?
some context: ran issue while writing code processes xml files (that use utf-8), parses xml element object lxml, extracts text values of elements subsequently written xml file (which uses utf-8). due issue can end xml files not well-formed.
i'm using python 3.6 under windows 7.
edit: original question/code contained function supposed print hex representation of mystring screen, turns out not behaving expected. since made things unnecessarily confusing (also function not essential question) removed code.
Comments
Post a Comment