Edgewall Software

GenshiRecipes/FilterHTMLUsingRegex

Version 1 (modified by anatoly techtonik <techtonik@…>, 4 years ago)

how to filter HTML with regular expression

Genshi XPath is very limited and doesn't allow to do such things as selecting an empty row in a table. This recipe shows how to select and remove HTML elements using regular expressions in Transformers.

from genshi.input import HTML
from genshi.filters.transform import Transformer, StreamBuffer
import re

html2 = HTML('''
<table> 
 <tr><th></th><td></td><td></td></tr> 
 <tr><th>not empty</th><td></td><td></td></tr> 
</table>
''')

buffer = StreamBuffer()

def rowfilter(): # attention, closure
  text = buffer.render()
  text = re.sub(r'(?s)<tr>(\s*<t[hd](/>|>\s*</t[hd]>))+\s*</tr>', '', text)
  #print(text)
  return HTML(text)

transtream = html2 | Transformer().select('.')\
          .copy(buffer).replace(rowfilter)
print transtream