#375 closed defect (wontfix)
genshi.input.ParseError: malformed start tag: line ...
Reported by: | anatoly techtonik <techtonik@…> | Owned by: | cmlenz |
---|---|---|---|
Priority: | major | Milestone: | 0.7 |
Component: | Parsing | Version: | 0.5.1 |
Keywords: | Cc: |
Description
Genshi fails on malformed documents, but I need it to parse even it malformed - is there any way to continue processing in this case and still get the page title? See attached files.
Attachments (1)
Change History (12)
Changed 15 years ago by anatoly techtonik <techtonik@…>
comment:1 Changed 15 years ago by anatoly techtonik <techtonik@…>
comment:2 Changed 15 years ago by anatoly techtonik <techtonik@…>
Ot better use this script to tests error:
Traceback (most recent call last): File "test_view.py", line 16, in <module> print genshi_parse("http://bugs.farmanager.com/view.php?id=1288") File "C:\~env\Python26\lib\site-packages\genshi-0.5.1-py2.6-win32.egg\genshi\core.py", line 243, in __str__ File "C:\~env\Python26\lib\site-packages\genshi-0.5.1-py2.6-win32.egg\genshi\core.py", line 179, in render File "C:\~env\Python26\lib\site-packages\genshi-0.5.1-py2.6-win32.egg\genshi\output.py", line 60, in encode File "C:\~env\Python26\lib\site-packages\genshi-0.5.1-py2.6-win32.egg\genshi\output.py", line 210, in __call__ File "C:\~env\Python26\lib\site-packages\genshi-0.5.1-py2.6-win32.egg\genshi\output.py", line 592, in __call__ File "C:\~env\Python26\lib\site-packages\genshi-0.5.1-py2.6-win32.egg\genshi\output.py", line 698, in __call__ File "C:\~env\Python26\lib\site-packages\genshi-0.5.1-py2.6-win32.egg\genshi\output.py", line 532, in __call__ File "C:\~env\Python26\lib\site-packages\genshi-0.5.1-py2.6-win32.egg\genshi\core.py", line 283, in _ensure File "C:\~env\Python26\lib\site-packages\genshi-0.5.1-py2.6-win32.egg\genshi\path.py", line 134, in _generate File "C:\~env\Python26\lib\site-packages\genshi-0.5.1-py2.6-win32.egg\genshi\core.py", line 283, in _ensure File "C:\~env\Python26\lib\site-packages\genshi-0.5.1-py2.6-win32.egg\genshi\input.py", line 438, in _coalesce File "C:\~env\Python26\lib\site-packages\genshi-0.5.1-py2.6-win32.egg\genshi\input.py", line 346, in _generate genshi.input.ParseError: malformed start tag: line 33, column 735
import genshi import urllib2 def genshi_parse(url): """parse url to get bug title""" title_path = "head/title/text()" mbt_file = urllib2.urlopen(url) #mbt_genshi = genshi.input.HTML(mbt_file) mbt_genshi = genshi.input.HTMLParser(mbt_file) title = mbt_genshi.parse().select(title_path)#.render().decode("utf-8") #except genshi.ParseError: # pass return title print genshi_parse("http://bugs.farmanager.com/view.php?id=1288")
comment:3 Changed 15 years ago by anatoly techtonik <techtonik@…>
- Component changed from General to Parsing
comment:4 follow-up: ↓ 5 Changed 14 years ago by Carsten Klein <carsten.klein@…>
Have you ever looked at the offending part in your HTML input file?
<span class=\"italic\">
I would like to see the document where this is considered HTML markup at all.
comment:5 in reply to: ↑ 4 ; follow-up: ↓ 6 Changed 14 years ago by Carsten Klein <carsten.klein@…>
Replying to Carsten Klein <carsten.klein@…>:
Have you ever looked at the offending part in your HTML input file?
<span class=\"italic\">I would like to see the document where this is considered HTML markup at all.
Besides that, I do not know of any correcting parser out there that would make this well formed markup. Using the '\' escape symbol on the surrounding quotationmarks for the attribute's value seems rather odd and is not a general use case for correcting parsers.
comment:6 in reply to: ↑ 5 ; follow-up: ↓ 7 Changed 14 years ago by anatoly techtonik <techtonik@…>
Replying to Carsten Klein <carsten.klein@…>:
Have you ever looked at the offending part in your HTML input file?
<span class=\"italic\">I would like to see the document where this is considered HTML markup at all.
http://bugs.farmanager.com/view.php?id=1288 This link is actually present at the bottom of comment:2.
Replying to Carsten Klein <carsten.klein@…>:
Besides that, I do not know of any correcting parser out there that would make this well formed markup. Using the '\' escape symbol on the surrounding quotationmarks for the attribute's value seems rather odd and is not a general use case for correcting parsers.
What correcting parsers have you tried? I am pretty sure tidy can turn this into well formed markup. Do not know about Beautiful Soup. In either case the failing line doesn't have any importance to me. All I need is to extract some info from the header of this document, but I can't because of Genshi exception. The scraping script was deployed on automatic public service and it took some months until the error was discovered and reported.
comment:7 in reply to: ↑ 6 Changed 14 years ago by Carsten Klein <carsten.klein@…>
Replying to anatoly techtonik <techtonik@…>:
Hm, the use case of genshi parsing an existing html page output by some site php is definitely new to me.
You know, genshi is a templating engine and not some kind of correcting parser engine that would take your php generated page and then would correct the errors you introduced when rendering that page.
In fact, genshi requires input to be well formed since it is either auto generated by some application, or previously made to be well formed by some external application.
And even if you would provide a template that had \ escaping the enclosing quotation marks when assigning attribute values, then genshi would still fail since it requires all input to be well formed.
This is a fundamental design decision, see for example existing XML parsers and what they might say about well formedness, or even the documents on the XML standard.
comment:8 Changed 12 years ago by anonymous
http://trac-hacks.org/ticket/10651 also reports such error. Though I convert to unicode before calling HTML(), look here.
comment:9 Changed 12 years ago by anonymous
I wish there was more robustness, I mean not giving up with Parse Error but skipping the bad tags instead, by inserting a error message text at the broken part of the page. This gives the end user a chance to recognize what the problem is. Tracebacks are always sent to the developers although the reason is in on user side. See the example of TH:SimpleMultiProjectPlugin.
comment:10 Changed 12 years ago by hodgestar
- Resolution set to wontfix
- Status changed from new to closed
I agree with Carsten's comments from comment:6
You know, genshi is a templating engine and not some kind of correcting parser engine that would take your php generated page and then would correct the errors you introduced when rendering that page.
and
This is a fundamental design decision, see for example existing XML parsers and what they might say about well formedness, or even the documents on the XML standard.
If you need to clean up your HTML before passing it to Genshi, use some other library for that (quite a few have been mentioned in this ticket already).
Closing as wontfix.
comment:11 Changed 12 years ago by anatoly techtonik <techtonik@…>
Then I believe Genshi should not pretend then that it has HTML parser, because by definition HTML is not well-formed. At least HTML5 for sure: http://stackoverflow.com/questions/3583332/html5-and-well-formedness
Alternative way is to document this and actually point to the packages that can sanitize HTML for Genshi.
Akismet says new attachment is spam, so placing it at http://pastebin.com/FdZPXRgt