Edgewall Software

Ticket #375 (new defect)

Opened 23 months ago

Last modified 21 months ago

genshi.input.ParseError: malformed start tag: line ...

Reported by: anatoly techtonik <techtonik@…> Owned by: cmlenz
Priority: major Milestone: 0.7
Component: Parsing Version: 0.5.1
Keywords: Cc:

Description

Genshi fails on malformed documents, but I need it to parse even it malformed - is there any way to continue processing in this case and still get the page title? See attached files.

Attachments

test_view.py Download (438 bytes) - added by anatoly techtonik <techtonik@…> 23 months ago.

Change History

Changed 23 months ago by anatoly techtonik <techtonik@…>

  Changed 23 months ago by anatoly techtonik <techtonik@…>

Akismet says new attachment is spam, so placing it at  http://pastebin.com/FdZPXRgt

  Changed 23 months ago by anatoly techtonik <techtonik@…>

Ot better use this script to tests error:

Traceback (most recent call last):
  File "test_view.py", line 16, in <module>
    print genshi_parse("http://bugs.farmanager.com/view.php?id=1288")
  File "C:\~env\Python26\lib\site-packages\genshi-0.5.1-py2.6-win32.egg\genshi\core.py", line 243, in __str__
  File "C:\~env\Python26\lib\site-packages\genshi-0.5.1-py2.6-win32.egg\genshi\core.py", line 179, in render
  File "C:\~env\Python26\lib\site-packages\genshi-0.5.1-py2.6-win32.egg\genshi\output.py", line 60, in encode
  File "C:\~env\Python26\lib\site-packages\genshi-0.5.1-py2.6-win32.egg\genshi\output.py", line 210, in __call__
  File "C:\~env\Python26\lib\site-packages\genshi-0.5.1-py2.6-win32.egg\genshi\output.py", line 592, in __call__
  File "C:\~env\Python26\lib\site-packages\genshi-0.5.1-py2.6-win32.egg\genshi\output.py", line 698, in __call__
  File "C:\~env\Python26\lib\site-packages\genshi-0.5.1-py2.6-win32.egg\genshi\output.py", line 532, in __call__
  File "C:\~env\Python26\lib\site-packages\genshi-0.5.1-py2.6-win32.egg\genshi\core.py", line 283, in _ensure
  File "C:\~env\Python26\lib\site-packages\genshi-0.5.1-py2.6-win32.egg\genshi\path.py", line 134, in _generate
  File "C:\~env\Python26\lib\site-packages\genshi-0.5.1-py2.6-win32.egg\genshi\core.py", line 283, in _ensure
  File "C:\~env\Python26\lib\site-packages\genshi-0.5.1-py2.6-win32.egg\genshi\input.py", line 438, in _coalesce
  File "C:\~env\Python26\lib\site-packages\genshi-0.5.1-py2.6-win32.egg\genshi\input.py", line 346, in _generate
genshi.input.ParseError: malformed start tag: line 33, column 735
import genshi
import urllib2

def genshi_parse(url):
    """parse url to get bug title"""
    title_path = "head/title/text()"
    mbt_file = urllib2.urlopen(url)
    #mbt_genshi = genshi.input.HTML(mbt_file)
    mbt_genshi = genshi.input.HTMLParser(mbt_file)
    title = mbt_genshi.parse().select(title_path)#.render().decode("utf-8")
    #except genshi.ParseError:
    #    pass
    return title

print genshi_parse("http://bugs.farmanager.com/view.php?id=1288")

  Changed 23 months ago by anatoly techtonik <techtonik@…>

  • component changed from General to Parsing

follow-up: ↓ 5   Changed 21 months ago by Carsten Klein <carsten.klein@…>

Have you ever looked at the offending part in your HTML input file?

<span class=\"italic\">

I would like to see the document where this is considered HTML markup at all.

in reply to: ↑ 4 ; follow-up: ↓ 6   Changed 21 months ago by Carsten Klein <carsten.klein@…>

Replying to Carsten Klein <carsten.klein@…>:

Have you ever looked at the offending part in your HTML input file? {{{ <span class=\"italic\"> }}} I would like to see the document where this is considered HTML markup at all.

Besides that, I do not know of any correcting parser out there that would make this well formed markup. Using the '\' escape symbol on the surrounding quotationmarks for the attribute's value seems rather odd and is not a general use case for correcting parsers.

in reply to: ↑ 5 ; follow-up: ↓ 7   Changed 21 months ago by anatoly techtonik <techtonik@…>

Replying to Carsten Klein <carsten.klein@…>:

Have you ever looked at the offending part in your HTML input file? {{{ <span class=\"italic\"> }}} I would like to see the document where this is considered HTML markup at all.

 http://bugs.farmanager.com/view.php?id=1288 This link is actually present at the bottom of comment:2.

Replying to Carsten Klein <carsten.klein@…>:

Besides that, I do not know of any correcting parser out there that would make this well formed markup. Using the '\' escape symbol on the surrounding quotationmarks for the attribute's value seems rather odd and is not a general use case for correcting parsers.

What correcting parsers have you tried? I am pretty sure tidy can turn this into well formed markup. Do not know about Beautiful Soup. In either case the failing line doesn't have any importance to me. All I need is to extract some info from the header of this document, but I can't because of Genshi exception. The scraping script was deployed on automatic public service and it took some months until the error was discovered and reported.

in reply to: ↑ 6   Changed 21 months ago by Carsten Klein <carsten.klein@…>

Replying to anatoly techtonik <techtonik@…>:

Hm, the use case of genshi parsing an existing html page output by some site php is definitely new to me.

You know, genshi is a templating engine and not some kind of correcting parser engine that would take your php generated page and then would correct the errors you introduced when rendering that page.

In fact, genshi requires input to be well formed since it is either auto generated by some application, or previously made to be well formed by some external application.

And even if you would provide a template that had \ escaping the enclosing quotation marks when assigning attribute values, then genshi would still fail since it requires all input to be well formed.

This is a fundamental design decision, see for example existing XML parsers and what they might say about well formedness, or even the documents on the XML standard.

Add/Change #375 (genshi.input.ParseError: malformed start tag: line ...)

Author


E-mail address and user name can be saved in the Preferences.


Change Properties
<Author field>
Action
as new
as The resolution will be set. Next status will be 'closed'
to The owner will change from cmlenz. Next status will be 'new'
The owner will change from cmlenz to anonymous. Next status will be 'assigned'
 
Note: See TracTickets for help on using tickets.