To Crawl or not to Crawl? The Alts Speak Out!

This post requires a little bit of back-tracking.  Well, a lot, actually.

In blogging terms, it began eons ago - in the early part of this monthGreg Lindahl was surprised that very few of the 100+ Alternative Search Engines seemed to be crawling his site.  Rick Skrenta picked up on this in his post of Aug. 5, “The 11 startups actually crawling the web.”  He referenced a post by Don Dodge on the Top 100 (later pointed to my post on AltSearchEngines). The next day Richard MacManus over at Read/WriteWeb pursued the subject in a post entitled, “Why aren’t Alt Search Engines crawling websites?”  That same day Gary Stewart posted on the discussion with his post, “Crawling the web.”  Suddenly, everyone was talking about crawling alts…

That’s a lot of blogging, but that’s what keeps the blogosphere turning on its axis, I guess.

So why am I picking up on it an eternity later - 20 days!  Because I was straightening up my office, that’s why.  I came across the individual answers from the Alts themselves, and before I, um, ”file” them away, I think it’s important that they get in the last word (or is it?):

Besides, the question of crawling the web keeps resurfacing, as it did in Nitin’s post about “What is a search engine?”.  Do you have to crawl the entire web to be a “true” search engine?  







It’s time to shed a little more light upon the subject!



“Although some people think of us mainly as a meta-search search engine, we do crawl the web, but our focus is the databases and not the web pages.” -Rafael Costa

“Fom what we can tell in Europe, the only classifieds/real estate search players with proprietary crawlers are us and Extate….Of course, the information that we have is based on publicly available data, so it’s possible that some of our other competitors have proprietary crawlers, but I doubt it.” -Victor Aloi

“At PodNova we currently only crawl the feeds which we have initially spidered, also people can add feeds manually via /add.  We crawl more often on feeds with a lot of subscribers, channels with popular episodes, and some other variables.” -Robin Jans

“We do not crawl on a regular basis, but only in some particular zone and particular data….Of course the fastest and cheapest way is to use the index of a Google or Yahoo!.  I think most of the startup players use a licensed index.” -Sergey Moskalew

  “Quintura.com is currently powered by…(Yahoo! for web and image search, blinkx for video search, and Amazon for product search), Quintura.ru…from Yandek.ru, and Quintura Kids uses its own index. Quintura recently started crawling web sites to test its own web index.  We are using an open source crawler.  The major issue here is not which crawler to use - either your own or an open source one- it is mostly about how to index the crawled pages and update the index (say, via RSS feeds…)” - Yakov Sadchikov

  FAROO uses a special kind of distributed crawler…when a user opens a page with his browser, it will be automatically inserted into the distributed index of the peer-to-peer network.” -Wolf Garbe

  “They (the larger search engines) do the heavy lifting of going out there and gathering and indexing the data, and all the other (alternative) search engines out there re-work, re-purpose, and re-present the data in different ways.” -Mike Richwalsky

  “…they (Alts) are obviously not going to have the resources to crawl or even process the number of documents that Google, etc. can.” -Marc

  “…Lexxe will start crawling 1 billion web pages from the middle of August.  But we will only crawl the ‘most important’ web pages first.” -Hong Liang Qiao

  “The fact is that alternative search engines do not crawl the same content as generic ones.  For instance, Blogdimension…crawls blogs, RSS feeds, images, multimedia content, directories, etc.  Only the big ones can afford to crawl ‘everything.’” -Henrick Kac

“…(many blog search engines) focus crawling on the RSS feed…using the feed and some smarts we can kep up with a blog’s content without overextending ourselves and without using up the site’s bandwidth.” -Greg Gershman

“…building a high-quality index requires a lot of resources and time.” - Andes Lindman

  “Omgili doesn’t crawl regular sites, it crawls discussion based sites, where ‘many to many’ discussions take place.” -Ran Geva

  “For those os us in the human-powered space, it might be because we actually visit sites ourselves instead of sending a spider out to do the dirty work.” -Adam Jusko

Conclusion

What other conclusion can there be?  Crawling and indexing is hugely expensive, a hurdle that I can’t wait to see Nitin “jump” when he tries to extol the virtues of algorithms next week!  ;-) Most -not all- Alternative Search Engines that we cover seem to be working on their particular Vertical, or their User Interface, even if they have to utlilize a third-party index for a while.  Many will accept manually submitted information.  Of course, if they were to pool their resources…well, that’s another post entirely!  

Sphere: Related Content

7 Responses to “To Crawl or not to Crawl? The Alts Speak Out!”

  1. University Update - Yahoo - To Crawl or not to Crawl? The Alts Speak Out! says:

    [...] To Crawl or not to Crawl? The Alts Speak Out! » This Summary is from an article posted at Alt Search Engines on Sunday, August 26, 2007 This [...]

  2. University Update - Open Source - To Crawl or not to Crawl? The Alts Speak Out! says:

    [...] To Crawl or not to Crawl? The Alts Speak Out! » This Summary is from an article posted at Alt Search Engines on Sunday, August 26, 2007 This [...]

  3. Yakov says:

    To crawl or not to crawl is more a business question that anything else. For those search engines that want to create the most long-term value for their shareholders, it is a ‘must have’ to be independent from the others, i.e. to crawl and index the Web.

  4. Tommy Chieng says:

    Thanks for visiting my blog >> http://www.crispnetworks.com

    Your blog looks interesting. Definitely a source for me to know more about alternate search engines

    Cheers.

  5. Oli says:

    If they’d all crawl, some small sites may get the majority of their traffic from spiders ;-)

    Anyway, IMO it makes more sense for a small SE to just try to directly work on Alexa’s indices/data repositories than crawling _yet again_ and collecting all the data all others are collecting as well (what a waste of bandwidth). With more and more SEs coming out, there must be a market for crawled data, particularly pre-extracted data (e.g. extracted text from pdf, OCRed text, speech-to-text on audio files, other “strange” document formats, etc. which is harder to collect than plain vanilla HTML).

  6. Tom Dibaja says:

    Surprising words from Migoa…. surely they ought to know that we - Properazzi - have a proprietary crawler. Not least because we’re the world’s largest real estate search engine and we’re both based in Barcelona! :)

    Plus, there’s at least half-a-dozen other real estate sites in Europe with crawlers…

  7. Alt Search Engines » Blog Archive » Guest Author: Crawling and Indexing the Web says:

    [...] Today we are very fortunate to have another blogger’s perspective on the post that we did on alternative search engines, asking which of the ”Alts” crawled the web and why:“To crawl or not to crawl? The Alts speak out!”  [...]

 

Leave a Reply

  Entries (RSS)  |  Comments (RSS) altsearchengines.com is proudly powered by WordPress  
© 2008 altsearchengines.com