python - Invalid unicode character in xml attribute / tag -


what list of invalid unicode characters in xml attributes (tags)?

as following python3 code illustrates:

import xml.etree.elementtree et io import stringio sio  xml_dec = '<?xml version="1.1" encoding="utf-8"?>' unicode_text = '<root>textº</root>' valid_unicode = '<标签 属性="值">文字</标签>' invalid_unicode_attribute = '<tag attributeº="value">text</tag>' invalid_unicode_tag = '<tagº>text</tagº>'  et.parse(sio(xml_dec + unicode_text)) # works  et.parse(sio(xml_dec + valid_unicode)) # works  et.parse(sio(xml_dec + invalid_unicode_attribute)) # parseerror  et.parse(sio(xml_dec + invalid_unicode_tag)) # parseerror 

the unicode character º, i.e. u+00ba, can parsed if in element text, not in element attribute or tag. on other hand, other unicode characters, such chinese characters, can parsed in element attribute , tag.

i checked xml <?xml version="1.1" encoding="utf-8"?><tagº>text</tagº> in https://validator.w3.org/check, , gives error:

line 1, column 43: character "º" not allowed in attribute specification list

however, in xml recommendation 1.1, §2.2 characters, says allowed:

char ::= [#x1-#xd7ff] | [#xe000-#xfffd] | [#x10000-#x10ffff] /* unicode character, excluding surrogate blocks, fffe, , ffff. */

my question is, can find list of invalid unicode characters in xml attributes / tags?

for characters allowed in tag , attribute names, w3c recommendation (to linked – looking @ definition of can used in text node) states following:

almost characters permitted in names, except either or reasonably used delimiters.

and

document authors encouraged use names meaningful words or combinations of words in natural languages, , avoid symbolic or white space characters in names. note colon, hyphen-minus, full stop (period), low line (underscore), , middle dot explicitly permitted.

the ascii symbols , punctuation marks, along large group of unicode symbol characters, excluded names because more useful delimiters in contexts xml names used outside xml documents; providing group gives contexts hard guarantees cannot part of xml name.

this followed formal definition lists lot of unicode ranges:

namestartchar ::= ":" | [a-z] | "_" | [a-z] | [#xc0-#xd6] | [#xd8-#xf6] |                   [#xf8-#x2ff] | [#x370-#x37d] | [#x37f-#x1fff] |                   [#x200c-#x200d] | [#x2070-#x218f] | [#x2c00-#x2fef] |                   [#x3001-#xd7ff] | [#xf900-#xfdcf] | [#xfdf0-#xfffd] |                   [#x10000-#xeffff] namechar      ::= namestartchar | "-" | "." | [0-9] | #xb7 |                   [#x0300-#x036f] | [#x203f-#x2040] name          ::= namestartchar (namechar)* 

the masculine ordinal indicator º (#xba) not among them, whatever reason (at least, languages use in abbreviations common words, doesn't “delimiter” me).

it's interesting see can use digits, hyphens , periods in tag names, not first character.


Comments

Popular posts from this blog

inversion of control - Autofac named registration constructor injection -

verilog - Systemverilog dynamic casting issues -

ios - Change Storyboard View using Seague -