Parsing HTML tables with regular expressions

by timvasil 11/12/2007 4:04:00 PM

Grabbing the content of rows in an HTML table seemed like a good job for regular expressions.

My first stab was to use a regex that essentially looked like <tr>(.*)</tr>.  Since * is greedy, this didn't work when there were multiple rows in the table; the capture would contain embedded </tr><tr> tags.

I wanted a way to say .*, but for anything but </tr> tags--basically a way to say [^(</tr>)], but of course [^] works with single characters, not for a whole string.  How could I say "not </tr>"?

Enter:  the zero-width negative lookahead assertion, (?!subexpression).  This tells * not to be so greedy.  Combining this with a named capture of "row" it became pretty easy to process data one HTML row at a time:

Regex ex = new Regex("<tr>(?<row>((?!</tr>).)*)</tr>", RegexOptions.Compiled | RegexOptions.Singleline);
int num = ex.Matches("<tr>row1</tr><tr>row2</tr>").Count;

Tags:

.NET Framework | Regex

Search

Calendar

«  May 2013  »
SuMoTuWeThFrSa
2829301234
567891011
12131415161718
19202122232425
2627282930311
2345678

View posts in large calendar

Recent posts

Recent comments

Archive