vim - Same visible character but different bytes -


i have 2 files each same (hindi) word copied word each file different sources. while words both sources alike visually, bytes different. files here , here. not sure original encoding in both cases opening file utf-8 displays characters correctly.

it interesting when unique using uniq utility 1 entry returned when place them in file , did sort u in vim, both entries.

please explain what's going on.

update:

if not want open links, python literals: '\u091c\u0941\u095c\n' , '\u091c\u0941\u0921\u093c\n' , word looks like

hindi

  • 095c devanagari letter dddha: ड़
  • 0921 devanagari letter dda: ड
  • 093c devanagari sign nukta (dot below character): ़

you can see in python equivalent (python 3 syntax here):

import unicodedata unicodedata.normalize('nfc', '\u0921\u093c') == unicodedata.normalize('nfc', '\u095c') # => true 

you should able use :%!uconv -x any-nfc (with icu installed), or :%!ruby -ne 'puts $_.unicode_normalize(:nfc)' (with ruby installed) normalise file.


Comments