Organizing the Web around Concepts

September 30th, 2009 by Guest Author
Posted in Guest Authors | No Comments »

blog_logo

During the initial days of the web, directories like Yahoo manually organized the web to find the relevant information. As web grew in size and search engine technology evolved, search engines like Google became the main source to query the web. Today, we see the next wave is making web navigation easier by reorganizing the Internet by topic or concept, and increasingly meaningful web (which may lead to Semantic Web) is being built around concepts such as Freebase, Google Squared, DBLife, and Kosmix topic pages.  At Kosmix, we’re often asked about the technical philosophy driving this change.  Here is a brief overview for the geeks among us.

To start with, what do we mean by concepts? A concept is loosely defined as a set of keywords of interest, for example, the name of a restaurant, cuisine, event, name of a movie, etc. There are various websites tailored to a particular kind of concept such as Yelp for restaurants (e.g., Amarin Thai), IMDB for movies (e.g., The Shawshank Redemption), LinkedIn for professional people, Last.fm for music (e.g., U2), etc.

Why should one care about organizing the web around concepts? There are three main kinds of web pages: search pages, topic/concept pages, and articles. Organizing the web around concepts can benefit each one of them.

Search pages. A search results page for a given query consists of various relevant links with snippets, for example, Google search results pages on “Erykah Badu”. Web data around concepts can improve search results in two ways. First, a search page can show a bunch of concepts related to the query, and their relationships to the query. This will help in further refining the query, and enable exploration of concepts related to the query.  Second, a search page can promote the concept page result for a concept closely matching the query.

Concept/Topic pages. A topic page or concept page organizes information around a concept, for example consider this music artist page on “Erykah Badu”. Such pages can utilize attributes of concepts, and show content related to the concept and its attributes, such as, albums, music videos, songs listing, album reviews, concerts, etc.

Articles. Articles can put semantic links to the concepts present in the article, and promote exploration of concepts present in the article, for example, this page on oil prices.

Given so many benefits of arranging the web around concepts, how can we achieve that? Some of the ways to arrange the web around concepts are as follows.

1. Editorial: An editor can pick a set of interested concepts, create attributes of the concepts, and organize the data around the concepts. Many sites like IMDB (for movies) have taken this approach. This approach gives high quality content but it’s not scalable in terms of the number of concepts.

2. Community: Many sites such as Wikipedia and Yelp have taken this approach in which a community of users picks concepts, creates the attributes of the concepts, and organizes the data around the concepts. This process scales as the user community grows, but it is hard to build such community, this approach is susceptible to spam, and scale is limited. For example, Wikipedia has grown to millions of concepts with such a large user base, but it size is still far from the scale of the web.

3. Algorithmic approach:  One way to organize the web around concepts is to mine the web for concepts and their attributes, and link data with concepts. This approach is the most promising in terms of scaling to the size of the web. Various steps in this approach are (a) Concept Extraction, (b) Relationship mining, and (c) Linking data with concepts.

(a) Concept Extraction. There are two main methods for concept extraction from web pages, site-specific and category-specific.

In the site specific method, the structure or semantics of a site is used to extract concepts. Many web sites generate HTML pages from the databases through a program, and such pages have similar structure. One can write site specific rules or wrappers to extract interesting data from such web pages, but writing such wrappers is labor intensive task. Kushmerick et. al. have proposed wrapper induction technique to automatically learn wrapper procedures based upon samples of such web pages. A recent work by Dalvi et. al. extends the wrapper induction technique to dynamic web pages. Another site specific method is to use natural language processing to understand semantic of web pages and to mine concepts from web pages.

In the category specific method, web pages are classified into categories, such as, restaurants, shopping, movies, etc., and category specific extraction rules are applied. For example, extract menu, reviews, cuisine, location for restaurants; extract price, reviews, ratings for shopping category; and extract actors, director, ratings for movies. This method is more scalable in terms of the number of web pages compared to the site specific method, but slightly more error prone since classification and extraction errors accumulate.

(b) Relationship mining. After extracting interesting concepts, one needs to match them with concepts in the database to create attributes, to grow concepts, and to find relationships between concepts. Some web databases like Freebase provide substantial amount of relationships between Wikipedia concepts.

(c) Linking data with concepts. As mentioned earlier, organizing web around concepts can benefit experience with search pages, topic pages, and article pages by linking them with concepts.

The algorithmic approach to organizing the web around concepts is somewhat error prone, though it improves as algorithms for a particular step improves. However, it is most promising in terms of scaling to enormous web that exists.

In short, organizing the web around concepts is a promising area and a stepping stone to bring meaning behind the web data.

References

[1] Nicholas Kushmerick, Daniel S. Weld, Robert B. Doorenbos: Wrapper Induction for Information Extraction. IJCAI (1) 1997.

[2] Nilesh N. Dalvi, Philip Bohannon, Fei Sha: Robust web extraction: an approach based on a probabilistic tree-edit model. SIGMOD Conference 2009.

Posted by Mitul Tiwari in the Kosmix blog here.

You’re invited to try Wowd – The Web You Want.

September 30th, 2009 by Guest Author
Posted in Guest Authors, Newcomers, Realtime | No Comments »

wowd_logo

MORE INVITATION GOODNESS – NEW REAL-TIME SEARCH AND DISCOVERY COMPANY, WOWD.COM, MAKES INVITATION KEYS AVAILABLE TO ALTSEARCHENGINES READERS TODAY

Silicon Valley start-up, Wowd.com has just released its private beta. They are making a limited number of invitations available to AltSearchEngines readers. The company, founded in 2007, has been quietly building a slick, sophisticated and user-friendly real-time search (RTS) tool that promises to be different than a general-purpose search engine and more robust than other real-time publishing platforms.

What is Wowd, and how will it “wow” me?

Wowd is a real-time search engine for discovering popular content on the web right now. What’s novel about Wowd is that it uses the ‘attention frontier’ of its user community to build a real-time index of results from the entire web. Meaning, the surfing behavior of any given Wowd ‘node’ is fed into the cloud (anonymously) so that others may discover interesting topics that are both popular and fresh.

For AltSearchEngines Readers Only – to try Wowd, click here:

2009-09-30_1810
Then download the browser app.

Wowd Hot List[4]

Unlike others in the business of RTS, Wowd knows what sites, news items, or media is the “hottest”. It delivers results that might have otherwise remained buried in a sea of spam, because real people visit pages they find interesting, not bots. In this way, the Wowd index is continuously being updated.

Anyone visiting the Wowd.com site can use Wowd, view the “Hot List” and enter a query. But, to get the most from Wowd, we suggest downloading the free browser application that works on the Mac, PC and Linux operating systems (use invite key above).

Wowd’s founder, Boris Agapiev, has just written a new white paper “The Distributed Cloud, a Foundation for Planetary-scale Computing.” This paper nicely illustrates how Wowd plays in the real-time search market. Agapiev observes “The enormous size of the Web, together with ever-more demanding requirements such as freshness (results in seconds, not days or weeks) means that massive resources are required to handle enormous datasets in a timely fashion.” The paper explains how the Wowd architecture addresses this issue without requiring huge server farms and data centers. (So Wowd is “green” too!)

To learn more, and read the white paper, go to: http://blog.wowd.com

-LaurieAnne “LA” Lassek

The Real-Time People Web

September 30th, 2009 by Guest Author
Posted in Guest Authors, Realtime | No Comments »

aardvark_logo_black

By Damon, Lion Tamer

From the Aardvark Blog here.

In the past several months, there has been a surge of interest in the explosion of real-time information now available online.  With Facebook status updates and Tweets now ubiquitous, we have entered the era of the “real-time web”.  Everywhere from Mountain View to Paris, technologists and pundits are gathering (offline!) to ponder and promote it.  And even mainstream media, ranging from Wired to the New York Times, have noted that a fundamental shift in our information-seeking paradigm is taking place.

Simply put:  It is now possible to see what lots of people are talking about in real-time on the web.

But there’s another side to the real-time web phenomenon that we here at Aardvark think is even more powerful than this, a change of paradigm that is even more fundamental.

What really matters is the increased accessibility of people online, not just information online.

Why is this important?
To understand this, consider the difference between Web Search and Social Search.

With Web Search, it’s possible to find countless long-hidden facts and figures which [a small minority of] internet users have at some point published on the web:  just type in a few keywords, and a Web Search engine will return the top results from among the billions of pages that constitute the web.  That’s great for queries about objective information that isn’t particularly timely, and which doesn’t need to be personally or contextually relevant.

With Social Search, it’s possible to find a person who has the information you are looking for: just type in your question in natural language, and a Social Search engine will connect you to someone with the right knowledge and experience to answer your question.  You get an answer in a few minutes, and can have a quick back-and-forth conversation with the answerer if there’s something you’d like to follow up on.  That’s just what is needed for queries that have a subjective element, or when you want information first-hand from someone you can trust.

The key point here is that often what you’re after isn’t static content — rather, it’s an interaction with someone who can help.  With the Social Search paradigm, online content is used as just an index of what its author knows about; the engine uses this index to find the person you should connect with for your question.  This becomes pretty compelling if we remember that the amount of information in peoples’ heads positively dwarfs the amount of authored information online:  just think what a small fraction of everything you know you have published on the web.  Yet everyone (not just those who blog a lot) have knowledge and experience that is valuable to share.  By using the web as an index of people, Social Search lets you tap into absolutely anything that anyone knows… in the theoretical limit :-)

How does this relate to the Real-Time Web?
There are three touch points for Social Search in the new real-time information landscape:

  1. The index of people is always up-to-date

    What makes Social Search possible is the vast amount of profile and social graph data that people have online (since this is what the engine uses to figure out who would be a good match to answer a question).  With the real-time web, this information can stay up-to-date automatically, so that you can connect with other people to talk about your current experiences.For instance, at Aardvark, users can opt-in to having their profiles updated based upon their real-time activity.  (More on the various information extraction and machine learning algorithms we use to accomplish these feats in a future post.)
  2. More people are online more often

    In the real-time web era, people are increasingly available online:  they are on IM, they are on Facebook, they are Twittering, they are using their iPhones.  This means that a Social Search service like Aardvark can easily see who might be available to answer a question in the moment and reach out to them… on any of these platforms.If you want to tap into your extended social network — tens of thousands of friends-of-friends, school and work connections, and such — we’ve found that it’s useful to have a service play the role of social intermediary here.  (Read about our user experience research work here.)
  3. Filtered channels are high-value

    The flood of online and real-time data has quickly become overwhelming to most people.  If you broadcast a question out to your entire network, that’s a lot of spam you’re creating as you add to the din; and over time, as we waste peoples’ attention, they are less attentive to these noisy broadcast channels.The alternative is to submit your question to a Social Search engine:  it will choose the few people who are most likely to answer, and contact them directly; in essence, it provides a kind of filter for your network.  It’s clear in the feedback we’ve gotten from Aardvark users that people are grateful to have this more personalized filtered channel — and as a consequence, they are much more responsive and thoughtful when they do choose to answer a question.

Put all of this together, and the result is a completely different kind of experience than anything available before.  Real-time information hookups!  With people you trust! Satisfying for the asker, gratifying for the answerer!

And this isn’t some futuristic dream — it’s happening right now.  In the time it took you to read this piece, a huge variety of questions were answered on Aardvark, based on connections made from profile data.  People really like helping other people!

In sum, the Real-Time People Web is the way that the Real-Time Web becomes personal:  Because often you don’t just want to hear what people are saying — you want to hear what someone is saying to you.

Ugandan search engine Mountbatten

September 30th, 2009 by Charles S. Knight
Posted in Global | No Comments »

mb

Mountbatten indexes the Ugandan part of the Internet. There are quite a few webservers inside Uganda that contain information that is very relevant for people in Uganda and that perform significantly faster for visitors that connect through one of the local ISPs.

Mountbatten offers this search service both for Ugandan web surfers as well as for Web site owners with a Ugandan website.

To search, just type in a few words.

• Results only include pages that contain all of the query words.

• Use quotes around words that must occur adjacently, as a phrase, e.g., “Apple”.

• Punctuation between words also triggers phrase matching. So searching for http://www.data.co.ug/ is the same as searching for “http www data co ug”.

• Searches are not case-sensitive, so searching for QuEEN is the same as searching for qUeen.

• You can prohibit a term from resulting pages by putting a minus before it, e.g., searching for soccer -ball will find pages that discuss soccer, but don’t use the word “ball”.

That’s it!

You can add your website on a special page on our Local Hosting Website. If you register there, you can maintain multiple website, and update them.

What is this fuss about Local Hosting?

Good Question! We build a complete website around local hosting, where you can find all your questions answered, as well as tools and downloads regarding local hosting.

Source: local.mountbatten.net

Society of the Query conference

September 30th, 2009 by Charles S. Knight
Posted in Uncategorized | No Comments »

querySociety of the Query conference: 13 – 14 November, Trouw Amsterdam in Amsterdam

With the Society of the Query conference -stop searching, start questioning-, the Institute of Network Cultures aims to critically reflect on the information society and the dominant role of the search engine in our culture. What does the dependency on the engine to manage the complex system of knowledge on the Internet mean? What alternatives exist? How can the increasingly centralized web be regulated? What is the future of interface design? By bringing together researchers, theorists and artists, the conference will examine the key issues that are emerging around web search, and contextualize developments within the fields of knowledge organization and information design.

Introduction to the Society of the Query conference:
Search is the way we now live. At present, the reality of the information society is one in which we are increasingly confined to the use of information retrieval tools to create order and value in the vast amount of online data. Web search has taken over from (directory based) browsing and surfing as the dominant activity on the web. With this development, the search engine has become the main point of reference, one whose emphasis on efficiency and service tends to cloud the nature of both the underlying technology and (corporate) ideologies.

In what might be dubbed the ‘society of the query’, this conference asks what this dependency on tools to manage the complex system of knowledge on the Internet means for our culture. As the idea of a semantic web unfolds, the human versus artificial intelligence controversy is regarded with renewed urgency. The increasingly centralized computing grid invites critical questions about power distribution, governance, and diversity and accessibility of web content, while on the other hand promising alternatives to the dominant paradigm arise in P2P and open source initiatives. With large investments in media literacy, what role might politics and education play in establishing an informed and technologically literate user base?

This two-day Query conference aims to examine the key issues that are emerging around web search, and to contextualize developments within the fields of knowledge organization and information design. The Institute of Network Cultures aims to do so specifically by bringing together researchers, theorists and artists, creating room for speculation and open questions, as well as concrete projects and research.

The questions this conference raises are:

* How does the idea of machine understanding influence the fields of knowledge organization and information retrieval?
* How is the legal framework surrounding search engines changing shape?
* Is Google’s increased ubiquity affecting the production and dissemination of art and cultural practice?
* What influence does the existing hegemony of a few large search engines exert on the traditional flow of knowledge and the diversity and accessibility of web content, and in what way might regulation be possible?
* Considering developments in the fields of art and information architecture, how can we get to more sophisticated ways of interface design and the presentation of search results?
* What alternative ways of search are visible on the software level, the network level and the user level that challenge the engine as the major search paradigm?

Conference themes

* Society of the Query
* Digital Civil Rights and Media Literacy
* Alternative Search (1 and 2)
* Googlization of Everything
* Art and the Engine

Alternative Search 1
In response to a growing interest in alternative methods to search the web, this session will focus on three ‘genres’ of alternatives on the level of the user, the software and the network. The first genre that is attended to will include the upcoming ‘general purpose’-search engine, a search engine designed specifically with large audiences and competition with Google in mind. The second genre will focus on search methods that disregard the ‘engine’ as dominant paradigm. How promising are, for example, peer- to-peer and open source technologies with regards to the current search conditions and which alternatives for commercial and centralizing methods have already emerged? The third and final genre consists of specialized search engines, mostly targeting specific content. What can we learn, for instance, from search methods within certain web spheres, such as the blogosphere, or the flourishing area of mobile search? And, how is the field of visual search developing, looking beyond the tag as systematizing principle?

Moderator: Eric Sieverts
Speakers:

Matthew Fuller (UK)
Dissonance, Double-Accuracy and Parallel Worlds
This presentation will provide a partial survey of tendencies in artists interactions with search engines and in the development of variant conceptions of search.

Cees Snoek (NL)
Concept-Based Video Search
Despite the rise of commercial video search engines like YouTube, Truveo, and Blinkx, searching relevant fragments in video collections is by no means a solved problem. Present day commercial systems are mainly based on textual analysis of speech transcripts or closed captions. Unfortunately this approach is futile when the visual content is not mentioned or unrelated to the words spoken. In this presentation, we discuss a novel means to search in video content using concept detectors. We highlight the academic challenges, problems, and solutions of concept-based video search. We introduce the MediaMill semantic video search engine and discuss its performance in international video retFrieval competitions.

Ingmar Weber (NL/FR)
“It’s Hard to Rank without being Evil”
Google and similar web search engines are known for collecting detailed logs about all incoming requests and for mining this data on a large scale. In this talk I’ll discuss if good ranking is possible without such an approach at all and if peer-to-peer web search engines are not always doomed to present mediocre results. First, I’ll discuss scenarios where ranking is not required at all. Then I’ll give an overview of the sources of information used for ranking by current web search engines. Finally, I’ll try to point out the relative importance of each information source and how easily accessible it is.

Alternative Search 2

Semantic layers are added to the principle architecture of the Web. In this second Alternative Search session, some of the latest technological developments in semantic search functionality, as well as their implementation by W3C and European cultural heritage project Europeana, will be presented and discussed. In addition to being understood as enrichments of existing knowledge structures, these developments need to be critically addressed on both the cultural and the software level. Which ideologies make up the foundations for the concept of ‘ontology’? And what role will human expertise play in the era of ‘machine understanding’?

Moderator: Richard Rogers
Speakers:

Florian Cramer (NL)
Why semantic search is flawed
The “Semantic Web” and “semantic search” are frequently misunderstood concepts because they are described with words like “ontology” whose meanings in computer science diverge from colloquial and humanities understanding. In reality, they simply boil down to structured keyword tagging of information, which for many reasons does not scale beyond very limited collections of information and application scenarios, reveals a sometimes astounding naiveté about issues of culture and ontology in the original sense of the word. Finally, the false hopes for semantic search result from frustrations with design flaws of the World Wide Web that prevent more diverse search methods and technologies.

Antoine Isaac
Semantic Search for Europeana
Europeana is a pan-European initiative to make accessible Europe’s cultural heritage. It aims to aggregate millions of digital items, as provided by museums, libraries, etc. Allowing users to search among such a wide and heterogenous range of cultural resources raises huge challenges; it also brings a unique opportunity to exploit the large body of knowledge that relates to these resources. I will present some of the latest technological developments that are being tested to provide Europeana users with semantic search functionality, using examples from the Europeana Thought Lab (http://www.europeana.eu/portal/thought-lab.html). In particular, I will sketch how re-using and enriching existing knowledge structures provide with new query and exploration possibilities, beyond simple document search.

Steven Pemberton (NL)
Disintermediation through Aggregation: Making your Data your Own
The Sapir-Whorf Hypothesis postulates a link between thought and language: if you haven’t got a word for a concept, you can’t think about it; if you don’t think about it, you won’t invent a word for it. The term “Web 2.0″ is a case in point: it conceptualizes the idea of Web sites that gain value by their users adding data to them. There are inherent dangers in using Web 2.0: it partitions the Web into a number of topical sub-Webs, and locks you in, thereby reducing the value of the network as a whole. It also puts your data, and its ownership at risk. So does this mean that user contributed content is a Bad Thing? Not at all, it is the method of delivery and storage that is wrong. The future lies in better aggregators.

Project Showcase
This segment of the conference will consist of the exhibition of specific projects addressing the theme of the search engine, and will be divided into two parts. During the conference, a display of computers and screens will be available on which the latest generation of search engines is installed. The Institute of Network Cultures seeks to give visitors the opportunity to discover search engines such as Wolfram Alpha, Quaero, Theseus and Autonomy. This will provide them with hands-on experience of the range of search methods discussed in the conference sessions. Furthermore, the Institute of Network Cultures plans to organize a concluding evening program to do justice to the diversity of artistic and activist projects that examine the role of the search engine in contemporary society. The works presented in the evening program will vary from browser extensions, alternative search engines and net art projects to videos and VJ performances. It is aspired that artists and developers will be present during this showcase to discuss and elaborate on their work with the audience.

Ticket information here: