python - Invalid unicode character in xml attribute / tag -
what list of invalid unicode characters in xml attributes (tags)?
as following python3 code illustrates:
import xml.etree.elementtree et io import stringio sio xml_dec = '<?xml version="1.1" encoding="utf-8"?>' unicode_text = '<root>textº</root>' valid_unicode = '<标签 属性="值">文字</标签>' invalid_unicode_attribute = '<tag attributeº="value">text</tag>' invalid_unicode_tag = '<tagº>text</tagº>' et.parse(sio(xml_dec + unicode_text)) # works et.parse(sio(xml_dec + valid_unicode)) # works et.parse(sio(xml_dec + invalid_unicode_attribute)) # parseerror et.parse(sio(xml_dec + invalid_unicode_tag)) # parseerror the unicode character º, i.e. u+00ba, can parsed if in element text, not in element attribute or tag. on other hand, other unicode characters, such chinese characters, can parsed in element attribute , tag.
i checked xml <?xml version="1.1" encoding="utf-8"?><tagº>text</tagº> in https://validator.w3.org/check, , gives error:
line 1, column 43: character "º" not allowed in attribute specification list
however, in xml recommendation 1.1, §2.2 characters, says allowed:
char ::= [#x1-#xd7ff] | [#xe000-#xfffd] | [#x10000-#x10ffff] /* unicode character, excluding surrogate blocks, fffe, , ffff. */
my question is, can find list of invalid unicode characters in xml attributes / tags?
for characters allowed in tag , attribute names, w3c recommendation (to linked – looking @ definition of can used in text node) states following:
almost characters permitted in names, except either or reasonably used delimiters.
and
document authors encouraged use names meaningful words or combinations of words in natural languages, , avoid symbolic or white space characters in names. note colon, hyphen-minus, full stop (period), low line (underscore), , middle dot explicitly permitted.
the ascii symbols , punctuation marks, along large group of unicode symbol characters, excluded names because more useful delimiters in contexts xml names used outside xml documents; providing group gives contexts hard guarantees cannot part of xml name.
this followed formal definition lists lot of unicode ranges:
namestartchar ::= ":" | [a-z] | "_" | [a-z] | [#xc0-#xd6] | [#xd8-#xf6] | [#xf8-#x2ff] | [#x370-#x37d] | [#x37f-#x1fff] | [#x200c-#x200d] | [#x2070-#x218f] | [#x2c00-#x2fef] | [#x3001-#xd7ff] | [#xf900-#xfdcf] | [#xfdf0-#xfffd] | [#x10000-#xeffff] namechar ::= namestartchar | "-" | "." | [0-9] | #xb7 | [#x0300-#x036f] | [#x203f-#x2040] name ::= namestartchar (namechar)* the masculine ordinal indicator º (#xba) not among them, whatever reason (at least, languages use in abbreviations common words, doesn't “delimiter” me).
it's interesting see can use digits, hyphens , periods in tag names, not first character.
Comments
Post a Comment