Top-Down and Bottom-Up Semantics


Guest Author Kathleen Dahlgren, PhD

The concept of Web 3.0 (the Semantic Web) includes a vision for the next evolution of on-line Search.  It would improve Search results for users by understanding the meaning (semantics) of their queries and the results.  For example, a user could ask about a “cheap SUV” and the computer would look for automobiles in the “SUV” category with low prices.   In order for this to work, the software would need to understand the concept of “cheap” with respect to SUVs, the concept of “SUV”, and the brands of cars that are in the  SUV category.  The ideal software for the Semantic Web would be a full Artificial Intelligence system with natural language processing that computes the meaning of words and phrases.  To get to such an AI system, which does not currently exist, we must take incremental forward-moving steps to make the computer behave more intelligently without really being intelligent.

In “bottom-up” semantics, individual pages are hand-tagged with semantic categories.  In “top-down” semantics, a developer models the semantics of a vertical domain (such as, farm equipment, movie cameras, etc) and the kinds of things users would want to do in that domain, and then links Web pages into a meaningful series of information views and user actions.

As an example of “bottom-up semantics” as used on the Web, pages about “SUVs” would be tagged with the categories “SUV” and “automobile”.   Search software could then reason with an ontology relating various Web pages.  For example, a page about “automobiles” would be linked by semantically-tagging to pages about “SUV”, “coupe”, “sports-car”, “Porsche”, etc.

Special tagging languages, such as RDF and OWL, are typically employed for hand-tagging.   Some say these languages are daunting, but whether they are or not, most Websites have not tagged their pages due to the massive amount of labor involved in the process.  This represents a significant barrier to motivating Websites to tag their content.  To overcome this obstacle, automated tagging has been explored, but such an automatic tagging algorithm would, by its very nature, be the very kind of intelligent (semantic) system desired which would obviate the need for tagging altogether.

Since “bottom-up” tagging has not been widely adopted, a “top-down” approach has been tried.  In this scenario, a semantically coherent domain of sites is targeted, and developers  model the typical users’ interactions in that domain.  This is an example of “broad semantics” (i.e. an interpretation), in that the user’s knowledge of the domain, needs in the domain and actions in the domain are modeled.  Typical actions desired in that domain are hand-categorized and hand-encoded.  For example, the vertical Search site Retrevo handles information about electronic devices employing the semantics of that space.  Semantics of the space means the way the space, or topic of interest, is categorized cognitively by anyone interacting with it.  For example, in interacting with a Website about electronic devices, people want to find out details about particular equipment, while other users are interested in reactions to the equipments, pricing, etc.  Using “top-down” semantics, the Website is organized around these user needs into meaningful categories, such as, Most Popular Results, Manufacturer Info, Reviews and Articles, Forums and Blogs, Daily Deals, and Stores.  Thus, the user doesn’t have to browse all the results for a given query, but can browse just the information in the desired category.

The drawbacks of this “top-down” approach are that 1) someone has to manually code the relationships and actions that are relevant to each domain, and 2) the understanding of these relationships and actions are limited to whatever the coder decides to represent (i.e. they are subjective).  For more on this topic, see Alex Iskold’s article at http://www.readwriteweb.com/archives/the_top-down_semantic_web.php.

An alternative method to “bottom-up” and “top-down” semantics is Semantic Natural Language Processing, such as that employed by Cognition Technologies, in which the computer has been taught the meanings and relationships of all of the words and phrases, and also recovers the meanings of the words and phrases in searched document set.  With this approach, no hand- or automatic-tagging is required, and the computer appears to understand what the user and documents mean.  (Visit Cognition for a demonstration of this technology or their demo video here.)

Dr. Kathleen Dahlgren is the Founder and Chief Technology Officer of Cognition Technologies.  Currently, she is also an adjunct professor of Linguistics at the University of California, Los Angeles.


Vipin Jain comments:

We have seen researchers positioned along the entire spectrum of AI and semantic web. Holy grail being a completely automated bottoms-up approach that requires no structured feed to the learning system – the system is supposed to learn the underlying structure on its own. This becomes extremely challenging when the learning system doesn’t have an identity, intelligence, or motivation of its own. Actually, it is extremely challenging when the learning system does have an identify, intelligence or motivation – just look at how good we are at training and retraining our children or other peoples :-) .

Anyway back to computers, all learning systems have to be guided and differ only in the amount and means of guidance provided by the “algorithms specialist” or the “programmer”. Practical and successful approaches constrain the learning system and structure the problem such that simple mathematical rules can learn the classification boundaries from available data. Even then you can’t satisfy every user request in a given vertical. You optimize for the most common tasks that make a solution commercially viable from cost and ROI perspective. Our researchers at Retrevo have employed practical approaches of constraining problems by using intelligent crawling, statistics-based feature extraction and selection, and Bayesian learning of classification boundaries. We don’t intend to satisfy every corner case in our vertical equally well but we satisfy common tasks extremely well. Just 2 cents from a company that has applied research to solve practical problems with commercial viability. And we know we still have ways to go!

6 Responses to “Top-Down and Bottom-Up Semantics”

  1. Andreas Harth Says:

    I don’t completely agree with the notion of top-down vs. bottom-up semantics. Even tagging individual pages with topics requires a formal specification of the topics, even if the formal spec is very basic, e.g. in the case of taxonomies such as DMOZ. The article also fails to explain that a shared use of URIs forms the basis for deriving meaning on the Semantic Web.

    I’m skeptic of the promise that NLP methods will automatically derive meaning from text. While information extraction with NLP may work in narrowly specified domains (which, however involves a great deal of manual labour), automatically extracting entities, attributes and relationships from the open-domain, multi-lingual Web using NLP techniques has failed, despite millions of dollars of investment (Powerset). Rather, starting with structured datasets and extending and linking them iteratively seems to work (DBpedia, Linked Data).

  2. Hope Leman Says:

    This is a really fascinating article. I was interested that the emphasis was on coders as taggers. But isn’t a huge amount of tagging done by average people with no coding now-how at all? For instance, I help run a site that requires me to determine what categories each grant goes into it. Those are sort of quasi-tags that don’t require any coding know-how on my part and some of the grant listings do seem to get into Google (not as much as I would like, of course!).

    And Blogger has the feature, “Labels for this post,” which is a sort of tagging system (I guess!). And then there is the whole world of del.icio.us and social bookmarking. None of that requires knowledge of tagging languages. And do websites necessarily need to tag their content if enthusiastic laypeople are doing it for them a la Digg?

    Anyway, this was one of the best, clearest overviews of the very abstruse topic of the semantic web I have seen. Thank you. I am in library school and need to grasp all this stuff.

  3. D Ashcart Says:

    “An alternative method to “bottom-up” and “top-down” semantics is Semantic Natural Language Processing, such as that employed by Cognition Technologies, in which the computer has been taught the meanings and relationships of all of the words and phrases, and also recovers the meanings of the words and phrases in searched document set.”

    I’m curious how this is an alternative to top-down. If top-down tagging is susceptible to the subjectivity and knowledge of the tagger, isn’t SNLP susceptible to the subjectivity of your programmers or librarians or ontologists?

    In other words, who “taught the computer”?

  4. Vipin Jain Says:

    We have seen researchers positioned along the entire spectrum of AI and semantic web. Holy grail being a completely automated bottoms-up approach that requires no structured feed to the learning system – the system is supposed to learn the underlying structure on its own. This becomes extremely challenging when the learning system doesn’t have an identity, intelligence, or motivation of its own. Actually, it is extremely challenging when the learning system does have an identify, intelligence or motivation – just look at how good we are at training and retraining our children or other peoples :-) .

    Anyway back to computers, all learning systems have to be guided and differ only in the amount and means of guidance provided by the “algorithms specialist” or the “programmer”. Practical and successful approaches constrain the learning system and structure the problem such that simple mathematical rules can learn the classification boundaries from available data. Even then you can’t satisfy every user request in a given vertical. You optimize for the most common tasks that make a solution commercially viable from cost and ROI perspective. Our researchers at Retrevo have employed practical approaches of constraining problems by using intelligent crawling, statistics-based feature extraction and selection, and Bayesian learning of classification boundaries. We don’t intend to satisfy every corner case in our vertical equally well but we satisfy common tasks extremely well. Just 2 cents from a company that has applied research to solve practical problems with commercial viability. And we know we still have ways to go!

  5. D Ashcart Says:

    @Vipin – what you are saying makes sense. If you dig deeply enough into the origins of any learning system, you will hit the subjectivity barrier. Which is OK, since all forms of expression are ultimately subjective. This is the reason that the Holy Grail you describe is unreachable – you will need an infinite bitspace. You know you are headed into semantic swampland when you start seeing terms like ‘learning’, ‘domain expertise’, ‘heuristic’, ‘practical solution’, ‘intelligent’, etc. creep into the conversation. But there’s no denying that these solutions often offer utility.

    My initial reaction to the article was to question the implied leap of faith from the untrustworthiness of a tagger to the trustworthiness of a corporation.

    Nothing specific against Cognition – the semantic/NLP space is rife with marketing-speak-absolutism that tramples over nuance and accuracy, which occasionally tweaks my purist heart. This is my personal curse.

  6. Vipin Jain Says:

    D,

    No doubt. I still have to see a commercial AI, NLP, semantic system that works across seemingly infinite number of contexts and semantics from both users’ and information perspective. Leave alone that is not subjective :-) .

    Anyways, it was a fun discussion. And it was good to chat!

Leave a Reply