Opened 16 years ago
Last modified 9 years ago
#384 new defect
HTMLParser does not work with comments that include non-ascii characters
| Reported by: | robert.hoelzl@… | Owned by: | cmlenz |
|---|---|---|---|
| Priority: | major | Milestone: | 0.9 |
| Component: | Parsing | Version: | 0.5.1 |
| Keywords: | Cc: |
Description
Hello,
When parsing a a HTML file, that contains a comment with a non-ascii character (like "<!-- \xF6 -->") the HTMLParser() object throws an UnicodeDecodeError?.
The reason for this bug is in module genshi.input.py / class HTMLParser / method handle_comment:
current implementation:
def handle_comment(self, text):
self._enqueue(COMMENT, text)
correct implementation:
def handle_comment(self, text):
if not isinstance(text, unicode):
text = text.decode(self.encoding, 'replace')
self._enqueue(COMMENT, text)

Moved to milestone 0.9.