I hadn't checked in with Kuro5hin in a while because it's sort of seemed like the place was disintegrating with the guy responsible for the site and Scoop, the CMS that runs it, counted missing. People were understandably upset by this since Kuro5hin implemented a premium membership thing after the aforementioned admin just couldn't afford to put the effort necessary into maintaining it any more. Anyway, slight digression there but there is an interesting article about building your own search engine.

In this case, it's Mozdex. They're using Nutch as the chassis of the thing and a pretty impressive and growing array of machines to run it on. It isn't anywhere near as powerful or resource rich as Google and won't (nor does it really seem intended to) topple Google from the top of the search pile but it does offer an impressive difference to the uber-secret search algorithms that Google uses: this one is going to be wide open and ostensibly influenced by users in a direct way. A couple comments attached to the article are pretty pessimistic and seem to think that search spammers are going to overrun the thing before it even has a chance. This, of course, assumes that the Formula X method of obscurity that Google uses is completely effective which it is obviously is not and needs revision continually to keep some relevance in what it returns on a query. I'm more apt to think of it as open research and development in preparation for Google's post-IPO beginning-to-suck stage if that indeed happens. It's good to see the process made public and accessible to everyone even if that initially makes it more prone to exploitation by scum.

Either way, I played around with it a little bit for sake of comparison to the big boys and it's surprisingly usable even for a beta seeded with links to spider gathered from the Dmoz directory. The results weren't ordered the way that I'd expect them to be but they did seem to be largely relevant to the search terms I used. The quantity returned is still a factor this early in development but then again they're trying to spider at a pretty intense pace so that will no doubt improve over time. I put this site in my .plan file so I'll be sure to check in on it more frequently than other sites that I make the mistake of saying that I'm going to track. I should probably just set up an aggregator to watch those sites and actually keep my promises to myself. I'll put that on the to-do list as well...

