Google, Kosmix, and The Deep Web – A Love Triangle

anand2

Alon Halevy of Google Labs and Anand Rajaraman of Kosmix went after the Deep Web in their own separate ways last night, at the SDForum Search SIG in Palo Alto.

Alon and Anand are long-time collaborators in solving the Deep Web problem, and their joint presentation last night had the all the easy familiarity and good-natured competition you find with friends who go way back. Years ago, Anand’s VC firm, Cambrian Ventures, funded a company that Alon founded called Transformic Inc.HTML forms Transformic, which built technology to crawl , was later acquired by Google. Alon joined Google Labs, and Anand went on to found Kosmix with his business partner, Venky Harinarayan.

The Deep Web is simply the Web behind HTML forms. If you want to buy a car, for example, you might visit Cars.com and search for a used Toyota Prius, priced at less than $15,000 and located near Palo Alto, California. Cars.com will turn your query into an HTML page to present the results to you. A search engine won’t be able to see the page, however, because it was created just for you from a series of databases. The page becomes “lost” in the Deep Web. Tim Berners-Lee also explains in this TED video how leveraging such hidden data will drive the next innovation on the web.

According to one study, the Deep Web is estimated to be 500 times larger than the surface Web. As the number of dynamic websites and applications increase, this number will only go up. Imagine…all that data is not available to search engines!

Google’s Approach to the Deep Web

Google’s approach to the Deep Web is to find HTML forms, send input to these forms, and index the resulting HTML pages. Simple? Not quite. How do you discover these forms? Which forms do you pick? What inputs do you send to these forms? How do your parse the structured data in the result pages?

Google takes the “Less is More” approach. They drop forms used for transactions such as credit-card purchases, interactions that the computer science community calls “POST”. To send inputs to a form Google first tries well-defined lists such as zip codes, if present. Otherwise, they compile inputs using iterative-probing to discover what to send to a form. In Alon’s experience, only a small percentage of the Deep Web qualifies for indexing. This slice, however, is hugely valuable, as it is helping to answer 1000 queries a second! Google’s approach to the Deep Web is language independent, is fully automated to scale easily, answers body and tail searches, and fits nicely with the crawl infrastructure. For further insights, read Alon’s VLDB paper published in 2008.

Kosmix’s Approach to the Deep Web

After Alon shared Google’s perspective, Anand explained that Kosmix has taken a very different approach to the Deep Web: the federated way.

Unlike Google, Kosmix does not crawl HTML forms. Instead, for any given search query, Kosmix taps into these forms in real-time through API calls, evaluates the results and organizes them into a topic page. If you wanted to look up “Pumpkin Pie” on Kosmix, for example, the system would bring you fresh content from recipe sites like the Food Network, “How To” baking videos, real-time tweets about pumpkin pie from Twitter, and information about the caloric profile of pumpkin pie from diet sites like FatSecret. A query for “AdMob,” on the other hand, will call services like CrunchBase for a company profile and Fool.com for up-to-date investor information. To provide the most relevant topic page and also avoid overwhelming these different services with too many API calls, the Kosmix system is smart enough to know which type of services to call for which query. Thus, the query for “Pumpkin Pie” would never be routed to Crunchbase. A important enabling factor for the federated approach.

So how does Kosmix decide which Web service to route a query too? The answer lies with Kosmix’s categorization technology. Over the past three years, Kosmix has created a taxonomy of several million nodes, which we organized into a graph, using a combination of humans and algorithms. Editors discover, integrate, and tag Web services to taxonomy nodes in a semi-automated fashion. Algorithms route the user’s query through the set of taxonomy nodes, which enable the engine to decide which Web service to call.

After outlining the benefits of this approach, Anand dived deeper into the need to select the right sources, and touched on the challenge of discovering and integrating data sources, layout, rankings, etc -details about which can be found in this year’s VLDB paper. Anand also explained how the federated approach is keeping pace with emerging Web trends like real-time, the explosion of Web APIs, different content types such as videos, maps, etc.

Digging Even Deeper
Last night’s audience—about 50 specialists in the search space from some of the Valley’s leading companies and startups– was some of the most engaged groups I have ever seen. Questions ranged from business models to how to do multi-way join between HTML tables. Some people even were contributing ideas. If the Deep Web is important to you, then this was a place to be.

Both Google and Kosmix have compelling yet contrasting approaches to the Deep Web. It will be interesting to see if there is a winner or simply a combination of the two.

Posted by Abhishek Gattani here

Leave a Reply