Pages

Friday, December 08, 2006

Random Thought: How to create Technorati

Woke up this morning and suddenly this question pop in my mind "How Technorati works?" Yup I'm a late dreamer. Anyway I dig a little bit and found some clues from David Sifry, the creator of Technorati himself. He describes the process flow as following:

  1. We spider blogs, and match up their links to your blog - to anywhere on your blog
  2. In the inbound blog list, we use the outbound links from the blog homepage, not from the archives
  3. We do process RSS feeds an other metadata, but that doesn't affect your inbound blog stats.
  4. Nightly, we go through the database and re-calculate the number of inbound blogs and links, which helps us double-check our work and also allows us to create the interesting newcomers list, the interesting recent blogs list, etc.
So my next question would be, "How to create a blog spider?" Well I figure that a blog spider must be like a webcrawler or something. It traverses the Internet gathering, filtering, and potentially aggregating information for a user. Well said, but it's now sounded more like How to create a Google bot kind of stuff.

I saw some the spider were written in Ruby, Phyton and Perl. Take a look to this article, "A web crawler in Perl", maybe this is a good lead.

No comments: