
Guest Author Kathleen Dahlgren, PhD
The concept of Web 3.0 (the Semantic Web) includes a vision for the next evolution of on-line Search. It would improve Search results for users by understanding the meaning (semantics) of their queries and the results. For example, a user could ask about a “cheap SUV” and the computer would look for automobiles in the “SUV” category with low prices. In order for this to work, the software would need to understand the concept of “cheap” with respect to SUVs, the concept of “SUV”, and the brands of cars that are in the SUV category. The ideal software for the Semantic Web would be a full Artificial Intelligence system with natural language processing that computes the meaning of words and phrases. To get to such an AI system, which does not currently exist, we must take incremental forward-moving steps to make the computer behave more intelligently without really being intelligent.
In “bottom-up” semantics, individual pages are hand-tagged with semantic categories. In “top-down” semantics, a developer models the semantics of a vertical domain (such as, farm equipment, movie cameras, etc) and the kinds of things users would want to do in that domain, and then links Web pages into a meaningful series of information views and user actions.
As an example of “bottom-up semantics” as used on the Web, pages about “SUVs” would be tagged with the categories “SUV” and “automobile”. Search software could then reason with an ontology relating various Web pages. For example, a page about “automobiles” would be linked by semantically-tagging to pages about “SUV”, “coupe”, “sports-car”, “Porsche”, etc.
Special tagging languages, such as RDF and OWL, are typically employed for hand-tagging. Some say these languages are daunting, but whether they are or not, most Websites have not tagged their pages due to the massive amount of labor involved in the process. This represents a significant barrier to motivating Websites to tag their content. To overcome this obstacle, automated tagging has been explored, but such an automatic tagging algorithm would, by its very nature, be the very kind of intelligent (semantic) system desired which would obviate the need for tagging altogether.
Since “bottom-up” tagging has not been widely adopted, a “top-down” approach has been tried. In this scenario, a semantically coherent domain of sites is targeted, and developers model the typical users’ interactions in that domain. This is an example of “broad semantics” (i.e. an interpretation), in that the user’s knowledge of the domain, needs in the domain and actions in the domain are modeled. Typical actions desired in that domain are hand-categorized and hand-encoded. For example, the vertical Search site Retrevo handles information about electronic devices employing the semantics of that space. Semantics of the space means the way the space, or topic of interest, is categorized cognitively by anyone interacting with it. For example, in interacting with a Website about electronic devices, people want to find out details about particular equipment, while other users are interested in reactions to the equipments, pricing, etc. Using “top-down” semantics, the Website is organized around these user needs into meaningful categories, such as, Most Popular Results, Manufacturer Info, Reviews and Articles, Forums and Blogs, Daily Deals, and Stores. Thus, the user doesn’t have to browse all the results for a given query, but can browse just the information in the desired category.
The drawbacks of this “top-down” approach are that 1) someone has to manually code the relationships and actions that are relevant to each domain, and 2) the understanding of these relationships and actions are limited to whatever the coder decides to represent (i.e. they are subjective). For more on this topic, see Alex Iskold’s article at http://www.readwriteweb.com/archives/the_top-down_semantic_web.php.
An alternative method to “bottom-up” and “top-down” semantics is Semantic Natural Language Processing, such as that employed by Cognition Technologies, in which the computer has been taught the meanings and relationships of all of the words and phrases, and also recovers the meanings of the words and phrases in searched document set. With this approach, no hand- or automatic-tagging is required, and the computer appears to understand what the user and documents mean. (Visit Cognition for a demonstration of this technology or their demo video here.)
Dr. Kathleen Dahlgren is the Founder and Chief Technology Officer of Cognition Technologies. Currently, she is also an adjunct professor of Linguistics at the University of California, Los Angeles.
Vipin Jain comments:
We have seen researchers positioned along the entire spectrum of AI and semantic web. Holy grail being a completely automated bottoms-up approach that requires no structured feed to the learning system – the system is supposed to learn the underlying structure on its own. This becomes extremely challenging when the learning system doesn’t have an identity, intelligence, or motivation of its own. Actually, it is extremely challenging when the learning system does have an identify, intelligence or motivation – just look at how good we are at training and retraining our children or other peoples
.
Anyway back to computers, all learning systems have to be guided and differ only in the amount and means of guidance provided by the “algorithms specialist” or the “programmer”. Practical and successful approaches constrain the learning system and structure the problem such that simple mathematical rules can learn the classification boundaries from available data. Even then you can’t satisfy every user request in a given vertical. You optimize for the most common tasks that make a solution commercially viable from cost and ROI perspective. Our researchers at Retrevo have employed practical approaches of constraining problems by using intelligent crawling, statistics-based feature extraction and selection, and Bayesian learning of classification boundaries. We don’t intend to satisfy every corner case in our vertical equally well but we satisfy common tasks extremely well. Just 2 cents from a company that has applied research to solve practical problems with commercial viability. And we know we still have ways to go!


















Nonostante le proteste degli utenti che reclamavano Virgilio a gran voce (anche con blog e petizioni online) Telecomitalia ha atteso sino al 2007 per riproporne il marchio che peraltro ancora oggi gode di ottimo livello di ricordo nella mente degli utenti. Oggi il portale di Alice si chiama “Virgilio: Powered by Alice”. Dal punto di vista tecnologico Virgilio-Alice è oggi fondamentalmentesolo una semplice vetrina di Telecomitalia e dei suoi servizi, e la visualizzazione è condita con risultati che provengono direttamente da interrogazioni operate su Google in background e personalizzate ad hoc. Il mix serve per evidenziare ad arte certi risultati nelle ricerche… I risultati di Virgilio sono quindi meno “naturali” rispetto a quelli di Google.
La parte tecnologica venne sviluppata assieme alla Università di Pavia e Olivetti Telemedia con un grande investimento. L’algoritmo utilizzato da Arianna per la ricerca era abbastanza efficace e molto veloce per il tempo. Nel 1998 Infostrada acquisì il marchio Arianna che nel 2002 passò nuovamente di mano entrando a far parte del patrimonio di Wind Telecomunicazioni. L’azienda – divenuta “Wind Infostrada” utilizza il logo Arianna per promuovere il proprio nuovo portale “libero.it” dedicato alla promozione dei servizi di Wind da un lato e di Infostrada dall’altro. Nel 2003 il progetto Arianna viene definitivamente abbandonato e oggi alla URL “arianna.libero.it” risponde un motore che riflette solo irisultati di Google, con una minima personalizzazione, a scopo di marketing e vendita di spazi e presentazione servizi.
Virgilio started in 1996 as a search engine and directory, and in its first incarnation was owned by the “Seat pagine gialle” company. At this time Virgilio (”the best of internet”) was really the first and most used search engine. Unfortunately later the Virgilio engine passed into the hands of 2 different companies, TIN at first in 2001, and later Telecom Italia in 2004. In 2005 Telecom Italia killed the brand”Virgilio” in favor of the brand “Alice” (a new brand of Ads connection kit). The strange things is that a lot of users complained about that, even using a dedicated blog and an online petition. In 2007 Telecom Italia agreed that the Virgilio brand was really still alive and strong, and rebranded again the portal “Virgilio powered by Alice.”