web crawler – Algorithm: Determining type of homepage? – Education Career Blog

I’ve been thinking about this for a while now, so I thought I would ask for suggestions:

I have some crawler which enters the root of some site (could be anything from www.StackOverFlow.com, www.SomeDudesPersonalSite.se or even www.Facebook.com). Then I need to determin what “kind of homepage” I’m visiting.. Different types could for instance be:

  • Forum
  • Blog
  • Link catalog
  • Social media site
  • News site
  • “One man site”

I’ve been brainstorming for a while, and the best solution seems to be some heuristic with a point system. By this I mean different trends gives some points to the different types, and then the program makes a guess afterwards.

But this is where I get stuck.. How do you detect trends?

  • Catalogs could be easy: If sitesIndexed/Outgoing links is very high, catalogs should get several points.
  • News sites/Blogs could be easy: If a high amount of sites indexed has a datetime, those types should get several points..

BUT I can’t really find too many trends.

SO: My question is:
Any ideas on how to do this?

Thanks so much..

,

I believe you are attempting document classification, which is a well-researched topic.

http://en.wikipedia.org/wiki/Document_classification

You will see a considerable list of many different methods. But to suggest any one of those (or neural networks or the like) prior to determining the “trends” as you call them is to suggest it prematurely. I would recommend looking into “web document classification” or the like. It is evidently a considerable subset of document classification, and if you have access to academic journals there are plenty of incomprehensible articles for your enjoyment.

I did also find your idea as a homework assignment — perhaps if you are particularly audacious you could contact the professor.
http://uhaweb.hartford.edu/compsci/ccli/wdc.htm

Lastly, I believe that this is an accessible (if strangely formatted) website that has a general and perhaps outdated discussion:
http://www.webology.ir/2008/v5n1/a52.html

I’m afraid I don’t have much personal knowledge of the topic, so the most I could do was tell you the keyword “document classification” and provide some quick googling. However, if I wanted to play around with this concept, I think simply looking for the rate of certain keywords is a decent starting “trend.” (“Sale” or “purchase” or “customers” are trends for shopping sites, “my,” “opinion,” “comment,” for blogs, and so on)

,

You could train a neural network to recognise them. Give it number/types of links, maybe types of HTML tags as well.

I think otherwise you’re just going to be second-guessing what makes a site what it is.

Leave a Comment