c# – Trouble Scraping .HTM File – Education Career Blog

I have just begun scraping basic text off web pages, and am currently using the HTMLAgilityPack C# library. I had some success with boxscores off rivals.yahoo.com (sports is my thing so why not scrape something interesting?) but I am stuck on NHL’s game summary pages. I think this is kind of an interesting problem so I would post it here.

The page I am testing is:
http://www.nhl.com/scores/htmlreports/20102011/GS020079.HTM

Upon first glance, it seems like basic text with no ajax or stuff to mess up a basic scraper. Then I realize I can’t right click due to some javascript, so I work around that. I right click in firefox and get the xpath of the home team using XPather and I get:

/html/body/[email protected]='MainTable'/tbody/tr1/td/[email protected]='StdHeader'/tbody/tr/td/table/tbody/tr/td3/[email protected]='Home'/tbody/tr3/td

When I try to grab that node / inner text, htmlagilitypack won’t find it. Does anyone see anything strange in the page’s source code that might be stopping me?

I am new to this and still learning how people might stop me from scraping, any tips or tricks are gladly appreciated!

p.s. I observe all site rules regarding bots, etc, but I noticed this strange behavior and saw it as a challenge.

,

Ok so it appears that my xpaths have tbody’s in them. When I remove these tbodys manually from the xpath, HTMLAgilityPack can handle it fine.

I’d still like to know why I am getting invalid xpaths, but for now I have answered my question.

,

I think unless my xpath knowledge is heaps flawed(probably) the problem is with the /tbody node in your xpath expression.

When I do

 string test = string.Empty;
StreamReader sr = new StreamReader(@"C:\gs.htm");
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.Load(sr);
sr.Close();
sr = null;
string xpath = @"//[email protected]='Home'/tr3/td";
test = doc.DocumentNode.SelectSingleNode(xpath).InnerText;

That works fine.. returns a
“COLUMBUS BLUE JACKETSGame 5 Home Game 3”
which I hope is the string you wanted.

Examining the html I couldn’t find a /tbody.

Leave a Comment