Opened 15 years ago
Last modified 8 years ago
#384 new defect
HTMLParser does not work with comments that include non-ascii characters
Reported by: | robert.hoelzl@… | Owned by: | cmlenz |
---|---|---|---|
Priority: | major | Milestone: | 0.9 |
Component: | Parsing | Version: | 0.5.1 |
Keywords: | Cc: |
Description
Hello,
When parsing a a HTML file, that contains a comment with a non-ascii character (like "<!-- \xF6 -->") the HTMLParser() object throws an UnicodeDecodeError?.
The reason for this bug is in module genshi.input.py / class HTMLParser / method handle_comment:
current implementation:
def handle_comment(self, text):
self._enqueue(COMMENT, text)
correct implementation:
def handle_comment(self, text):
if not isinstance(text, unicode):
text = text.decode(self.encoding, 'replace')
self._enqueue(COMMENT, text)
Moved to milestone 0.9.