Edgewall Software

Opened 14 years ago

Closed 11 years ago

Last modified 11 years ago

#375 closed defect (wontfix)

genshi.input.ParseError: malformed start tag: line ...

Reported by: anatoly techtonik <techtonik@…> Owned by: cmlenz
Priority: major Milestone: 0.7
Component: Parsing Version: 0.5.1
Keywords: Cc:

Description

Genshi fails on malformed documents, but I need it to parse even it malformed - is there any way to continue processing in this case and still get the page title? See attached files.

Attachments (1)

test_view.py (438 bytes) - added by anatoly techtonik <techtonik@…> 14 years ago.

Download all attachments as: .zip

Change History (12)

Changed 14 years ago by anatoly techtonik <techtonik@…>

comment:1 Changed 14 years ago by anatoly techtonik <techtonik@…>

Akismet says new attachment is spam, so placing it at http://pastebin.com/FdZPXRgt

comment:2 Changed 14 years ago by anatoly techtonik <techtonik@…>

Ot better use this script to tests error:

Traceback (most recent call last):
  File "test_view.py", line 16, in <module>
    print genshi_parse("http://bugs.farmanager.com/view.php?id=1288")
  File "C:\~env\Python26\lib\site-packages\genshi-0.5.1-py2.6-win32.egg\genshi\core.py", line 243, in __str__
  File "C:\~env\Python26\lib\site-packages\genshi-0.5.1-py2.6-win32.egg\genshi\core.py", line 179, in render
  File "C:\~env\Python26\lib\site-packages\genshi-0.5.1-py2.6-win32.egg\genshi\output.py", line 60, in encode
  File "C:\~env\Python26\lib\site-packages\genshi-0.5.1-py2.6-win32.egg\genshi\output.py", line 210, in __call__
  File "C:\~env\Python26\lib\site-packages\genshi-0.5.1-py2.6-win32.egg\genshi\output.py", line 592, in __call__
  File "C:\~env\Python26\lib\site-packages\genshi-0.5.1-py2.6-win32.egg\genshi\output.py", line 698, in __call__
  File "C:\~env\Python26\lib\site-packages\genshi-0.5.1-py2.6-win32.egg\genshi\output.py", line 532, in __call__
  File "C:\~env\Python26\lib\site-packages\genshi-0.5.1-py2.6-win32.egg\genshi\core.py", line 283, in _ensure
  File "C:\~env\Python26\lib\site-packages\genshi-0.5.1-py2.6-win32.egg\genshi\path.py", line 134, in _generate
  File "C:\~env\Python26\lib\site-packages\genshi-0.5.1-py2.6-win32.egg\genshi\core.py", line 283, in _ensure
  File "C:\~env\Python26\lib\site-packages\genshi-0.5.1-py2.6-win32.egg\genshi\input.py", line 438, in _coalesce
  File "C:\~env\Python26\lib\site-packages\genshi-0.5.1-py2.6-win32.egg\genshi\input.py", line 346, in _generate
genshi.input.ParseError: malformed start tag: line 33, column 735
import genshi
import urllib2

def genshi_parse(url):
    """parse url to get bug title"""
    title_path = "head/title/text()"
    mbt_file = urllib2.urlopen(url)
    #mbt_genshi = genshi.input.HTML(mbt_file)
    mbt_genshi = genshi.input.HTMLParser(mbt_file)
    title = mbt_genshi.parse().select(title_path)#.render().decode("utf-8")
    #except genshi.ParseError:
    #    pass
    return title

print genshi_parse("http://bugs.farmanager.com/view.php?id=1288")

comment:3 Changed 14 years ago by anatoly techtonik <techtonik@…>

  • Component changed from General to Parsing

comment:4 follow-up: Changed 14 years ago by Carsten Klein <carsten.klein@…>

Have you ever looked at the offending part in your HTML input file?

<span class=\"italic\">

I would like to see the document where this is considered HTML markup at all.

comment:5 in reply to: ↑ 4 ; follow-up: Changed 14 years ago by Carsten Klein <carsten.klein@…>

Replying to Carsten Klein <carsten.klein@…>:

Have you ever looked at the offending part in your HTML input file?

<span class=\"italic\">

I would like to see the document where this is considered HTML markup at all.

Besides that, I do not know of any correcting parser out there that would make this well formed markup. Using the '\' escape symbol on the surrounding quotationmarks for the attribute's value seems rather odd and is not a general use case for correcting parsers.

comment:6 in reply to: ↑ 5 ; follow-up: Changed 14 years ago by anatoly techtonik <techtonik@…>

Replying to Carsten Klein <carsten.klein@…>:

Have you ever looked at the offending part in your HTML input file?

<span class=\"italic\">

I would like to see the document where this is considered HTML markup at all.

http://bugs.farmanager.com/view.php?id=1288 This link is actually present at the bottom of comment:2.

Replying to Carsten Klein <carsten.klein@…>:

Besides that, I do not know of any correcting parser out there that would make this well formed markup. Using the '\' escape symbol on the surrounding quotationmarks for the attribute's value seems rather odd and is not a general use case for correcting parsers.

What correcting parsers have you tried? I am pretty sure tidy can turn this into well formed markup. Do not know about Beautiful Soup. In either case the failing line doesn't have any importance to me. All I need is to extract some info from the header of this document, but I can't because of Genshi exception. The scraping script was deployed on automatic public service and it took some months until the error was discovered and reported.

comment:7 in reply to: ↑ 6 Changed 14 years ago by Carsten Klein <carsten.klein@…>

Replying to anatoly techtonik <techtonik@…>:

Hm, the use case of genshi parsing an existing html page output by some site php is definitely new to me.

You know, genshi is a templating engine and not some kind of correcting parser engine that would take your php generated page and then would correct the errors you introduced when rendering that page.

In fact, genshi requires input to be well formed since it is either auto generated by some application, or previously made to be well formed by some external application.

And even if you would provide a template that had \ escaping the enclosing quotation marks when assigning attribute values, then genshi would still fail since it requires all input to be well formed.

This is a fundamental design decision, see for example existing XML parsers and what they might say about well formedness, or even the documents on the XML standard.

comment:8 Changed 11 years ago by anonymous

http://trac-hacks.org/ticket/10651 also reports such error. Though I convert to unicode before calling HTML(), look here.

comment:9 Changed 11 years ago by anonymous

I wish there was more robustness, I mean not giving up with Parse Error but skipping the bad tags instead, by inserting a error message text at the broken part of the page. This gives the end user a chance to recognize what the problem is. Tracebacks are always sent to the developers although the reason is in on user side. See the example of TH:SimpleMultiProjectPlugin.

comment:10 Changed 11 years ago by hodgestar

  • Resolution set to wontfix
  • Status changed from new to closed

I agree with Carsten's comments from comment:6

You know, genshi is a templating engine and not some kind of correcting parser engine that would take your php generated page and then would correct the errors you introduced when rendering that page.

and

This is a fundamental design decision, see for example existing XML parsers and what they might say about well formedness, or even the documents on the XML standard.

If you need to clean up your HTML before passing it to Genshi, use some other library for that (quite a few have been mentioned in this ticket already).

Closing as wontfix.

comment:11 Changed 11 years ago by anatoly techtonik <techtonik@…>

Then I believe Genshi should not pretend then that it has HTML parser, because by definition HTML is not well-formed. At least HTML5 for sure: http://stackoverflow.com/questions/3583332/html5-and-well-formedness

Alternative way is to document this and actually point to the packages that can sanitize HTML for Genshi.

Note: See TracTickets for help on using tickets.