Converting character offsets into byte offsets (in Python) -


suppose have bunch of files in utf-8 send external api in unicode. api operates on each unicode string , returns list (character_offset, substr) tuples.

the output need begin , end byte offset each found substring. if i'm lucky input text contains ascii characters (making character offset , byte offset identical), not case. how can find begin , end byte offsets known begin character offset , substring?

i've answered question myself, forward other solutions problem more robust, more efficient, and/or more readable.

i'd solve using dictionary mapping character offsets byte offsets , looking offsets in that.

def get_char_to_byte_map(unicode_string):     """     generates dictionary mapping character offsets byte offsets unicode_string.     """     response = {}     byte_offset = 0     char_offset, character in enumerate(unicode_string):         response[char_offset] = byte_offset         byte_offset += len(character.encode('utf-8'))     return response  char_to_byte_map = get_char_to_byte_map(text)  begin_offset, substring in api_response:     begin_offset = char_to_byte_map[character_offset]     end_offset = char_to_byte_map[character_offset + len(substring)]     # 

performance of solution compared yours depends lot on size of input , amount of substrings involved. local micro-benchmarking suggests encoding each individual character in text takes 1000 times long encoding entire text @ once.


Comments

Popular posts from this blog

commonjs - How to write a typescript definition file for a node module that exports a function? -

openid - Okta: Failed to get authorization code through API call -

thorough guide for profiling racket code -