Screen scraping authenticated HTTPS ASP.NET web pages with view state and ScriptManager

by timvasil 12/20/2010 1:45:00 AM

To screen scrape web pages generated by ASP.NET and delivered via HTTPS, you can still use HttpWebRequest, but there are several "gotchas" to keep in mind:

  1. For basic authentication to work, you have to add the "Authorization" HTTP request header manually. The HttpWebRequest.Credentials property is ignored (.NET bug?).
  2. For authenticated sessions to remain alive, you need to associate a common CookieContainer object with each request.
  3. To handle view state paramters, you can build a regex to grab all hidden field values and then submit them in URL-encoded form along with the next request.
  4. To interpret responses by the ScriptManager (i.e. when used with UpdatePanels), you have to parse the pipe-delimted response which will contain both HTML and updated form field values (including view state).
  5. For ASP.NET to honor the AJAX request, you must specify a User-Agent header it thinks will support AJAX functionality (e.g. Firefox), and an X-Microsoft-Ajax header it thinks was set on the client-side by JavaScript.

Putting it all together, here's how I got it to work:

private static void Main()
{
    request = CreateRequest("https://url-here");
    request.Headers.Add("Authorization", "Basic " + Convert.ToBase64String(ASCIIEncoding.ASCII.GetBytes("username:password")));
    HttpWebResponse resp = (HttpWebResponse)request.GetResponse();
    string respHtml = new StreamReader(resp.GetResponseStream()).ReadToEnd();
    ParseHtml(respHtml);  // TODO:  implement your parser function here
    NameValueCollection form = ExtractHiddenFields(respHtml);

    // For paginated responses, get next page (TODO:  replace my "next page available" logic with your own)
    while (respHtml.IndexOf("<a id=\"ctl00_ContentPlaceHolder2_lbNext\" disabled=\"disabled\">Next</a></li>") < 0)
    {
        request = CreateRequest("https://url-here");
        request.Method = "POST";
        request.Headers.Add("X-MicrosoftAjax", "Delta=true");
        request.ContentType = "application/x-www-form-urlencoded";
        using (StreamWriter w = new StreamWriter(request.GetRequestStream()))
        {
            form["__EVENTTARGET"] = "ctl00$ContentPlaceHolder2$lbNext";  // TODO:  replace my event target with yours
            form["ctl00$ScriptManager1"] = "ctl00$ContentPlaceHolder2$UpdatePanel1|ctl00$ContentPlaceHolder2$lbNext"; // TODO:  same here
            w.Write(GetPayloadString(form));
            w.Flush();
        }
        resp = (HttpWebResponse)request.GetResponse();
        respHtml = new StreamReader(resp.GetResponseStream()).ReadToEnd();
        ParseHtml(respHtml);
        ExtractHiddenFieldsFromAjax(respHtml, form);
    }
}

public static NameValueCollection ExtractHiddenFields(string html)
{
    NameValueCollection form = new NameValueCollection();
    Regex hiddenPattern = new Regex("<input type=\"hidden\" name=\"([^\"]*)\" id=\"[^\"]*\" value=\"([^\"]*)\" />");
    MatchCollection matches = hiddenPattern.Matches(html);
    foreach (Match match in matches)
    {
        form.Add(match.Groups[1].Value, match.Groups[2].Value);
    }
    return form;
}

public static NameValueCollection ExtractHiddenFieldsFromAjax(string html, NameValueCollection form)
{
    form = form ?? new NameValueCollection();
    while (html.Length > 0)
    {
        string[] parts = html.Split(new[] { '|' }, 4);
        int valueLength = int.Parse(parts[0]);
        if (parts[1] == "hiddenField")
        {
            form[parts[2]] = parts[3].Substring(0, valueLength);
        }
        html = html.Substring(parts[0].Length + parts[1].Length + parts[2].Length + 4 + valueLength);
    }
    return form;
}

public static HttpWebRequest CreateRequest(string uri)
{
    HttpWebRequest request = (HttpWebRequest)WebRequest.Create(uri);
    request.CookieContainer = s_cc; // reuse cookie contianer across requests
    request.UserAgent = "Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.1.13) Gecko/20100914 Firefox/3.5.13 (.NET CLR 3.5.30729)";
    return request;
}

public static string GetPayloadString(NameValueCollection form)
{
    if (form == null)
    {
        return String.Empty;
    }

    StringBuilder buff = new StringBuilder();
    buff.Length = 0;
    foreach (string key in form.Keys)
    {
        string[] values = form.GetValues(key);
        if (values != null)
        {
            foreach (string val in values)
            {
                if (buff.Length > 0)
                {
                    buff.Append("&");
                }
                buff.Append(HttpUtility.UrlEncode(key));
                buff.Append("=");
                buff.Append(HttpUtility.UrlEncode(val));
            }
        }
    }
    return buff.ToString();

Tags:

ASP.NET | .NET Framework

Comments (4) -

1/31/2011 10:49:46 PM

Tony Visconti

I am trying to do this with php and curl on the following website.  Do all the steps you mentioned above apply? https://www.fellowshipone.com/

Tony Visconti United States

2/2/2011 12:07:15 AM

timvasil

Tony, yes, that site looks to be an ASP.NET-driven site, complete with __VIEWSTATE and other hidden form fields.  I think the approach I outlined above should work for you.

Good luck!
Tim

timvasil

9/7/2011 2:52:06 PM

John Rincone

The code seems to be incomplete. For example HttpUtility, s_cc are not recognized. Will be great if you provice those also or indicate how to implement them

John Rincone United States

3/6/2012 10:47:52 AM

Tim Vasil

John, you could define s_cc like this:

private static readonly CookieContainer s_cc = new CookieContainer();

Tim Vasil United States

Search

Calendar

«  June 2013  »
SuMoTuWeThFrSa
2627282930311
2345678
9101112131415
16171819202122
23242526272829
30123456

View posts in large calendar

Recent comments

Archive