Edgewall Software

Opened 15 years ago

Last modified 8 years ago

#384 new defect

HTMLParser does not work with comments that include non-ascii characters

Reported by: robert.hoelzl@… Owned by: cmlenz
Priority: major Milestone: 0.9
Component: Parsing Version: 0.5.1
Keywords: Cc:

Description

Hello,

When parsing a a HTML file, that contains a comment with a non-ascii character (like "<!-- \xF6 -->") the HTMLParser() object throws an UnicodeDecodeError?.

The reason for this bug is in module genshi.input.py / class HTMLParser / method handle_comment:

current implementation:

def handle_comment(self, text):

self._enqueue(COMMENT, text)

correct implementation:

def handle_comment(self, text):

if not isinstance(text, unicode):

text = text.decode(self.encoding, 'replace')

self._enqueue(COMMENT, text)

Change History (1)

comment:1 Changed 8 years ago by hodgestar

  • Milestone changed from 0.7 to 0.9

Moved to milestone 0.9.

Note: See TracTickets for help on using tickets.