Converting character offsets into byte offsets (in Python) -
suppose have bunch of files in utf-8 send external api in unicode. api operates on each unicode string , returns list (character_offset, substr)
tuples.
the output need begin , end byte offset each found substring. if i'm lucky input text contains ascii characters (making character offset , byte offset identical), not case. how can find begin , end byte offsets known begin character offset , substring?
i've answered question myself, forward other solutions problem more robust, more efficient, and/or more readable.
i'd solve using dictionary mapping character offsets byte offsets , looking offsets in that.
def get_char_to_byte_map(unicode_string): """ generates dictionary mapping character offsets byte offsets unicode_string. """ response = {} byte_offset = 0 char_offset, character in enumerate(unicode_string): response[char_offset] = byte_offset byte_offset += len(character.encode('utf-8')) return response char_to_byte_map = get_char_to_byte_map(text) begin_offset, substring in api_response: begin_offset = char_to_byte_map[character_offset] end_offset = char_to_byte_map[character_offset + len(substring)] #
performance of solution compared yours depends lot on size of input , amount of substrings involved. local micro-benchmarking suggests encoding each individual character in text takes 1000 times long encoding entire text @ once.
Comments
Post a Comment