Parsing HTML tables with regular expressions

by timvasil 11/12/2007 4:04:00 PM

Grabbing the content of rows in an HTML table seemed like a good job for regular expressions.

My first stab was to use a regex that essentially looked like <tr>(.*)</tr>.  Since * is greedy, this didn't work when there were multiple rows in the table; the capture would contain embedded </tr><tr> tags.

I wanted a way to say .*, but for anything but </tr> tags--basically a way to say [^(</tr>)], but of course [^] works with single characters, not for a whole string.  How could I say "not </tr>"?

Enter:  the zero-width negative lookahead assertion, (?!subexpression).  This tells * not to be so greedy.  Combining this with a named capture of "row" it became pretty easy to process data one HTML row at a time:

Regex ex = new Regex("<tr>(?<row>((?!</tr>).)*)</tr>", RegexOptions.Compiled | RegexOptions.Singleline);
int num = ex.Matches("<tr>row1</tr><tr>row2</tr>").Count;

Currently rated 4.0 by 1 people

  • Currently 4/5 Stars.
  • 1
  • 2
  • 3
  • 4
  • 5

Tags:

.NET Framework | Regex

 

About the author

Tim Vasil Tim Vasil
I'm a software engineer living in Cambridge, MA.

E-mail me Send mail

Search

Calendar

<<  September 2010  >>
MoTuWeThFrSaSu
303112345
6789101112
13141516171819
20212223242526
27282930123
45678910

View posts in large calendar

Recent comments