
Dmitri Soubbotin, CEO of Semantic
Engines, the maker of SenseBot!
Part II
Semantic approach to search
Semantic search engines use various NLP and text mining methods to “understand” what a Web page is about and extract the meaning from it. A few most popular ones are described in this article. The idea is to give a satisfactory answer to the user’s query, bypassing the need to dig into often unrelated sources. Most of the engines are trying to give a direct and accurate answer to a question raised by a user, e.g. “Who built Empire State Building?”
There is also an interesting article by Amit Singhal on Google Blog. It clearly shows that Google uses linguistic technologies and makes headway in understanding user intent. We believe that they would still use semantic technology only marginally, as opposed to semantic engines that use it as a foundation for their search.
The niche that SenseBot occupies within the semantic search engine family is in attempting to provide an overview and facilitate overall understanding of the subject, as opposed to finding hard facts and answering direct questions.
With the advent of Semantic Web, we expect a boost for semantic search engines. Semantic tagging will help search engines to decrease ambiguity and identify what a given page is about, allowing for higher relevance results. There is also a possibility that semantic search engines may help Semantic Web to materialize – by analyzing existing Web pages and identifying representative tags for them, that can then be transformed into RDF or other formats. This would fall within the bottom-up approach to building Semantic Web (see Alex Iskold’s article on this subject). This is one type of application that SenseBot Web services can be used for.
The challenges that semantic search engines typically face are: ambiguity of analyzing generic content, e.g. Web pages found by Google. Human languages are inherently ambiguous, so “spider” can be construed as a biological species, or a movie title, or a data processing system. The sources about all different meanings would still cohabitate the first page of results. Not surprisingly, the long-awaited demo of Powerset turned out to be based on Wikipedia – a highly controlled and homogenously written set of content, simplifying the task of semantic analysis.
In general, much higher quality can be achieved in content verticals and within an enterprise, where the content is narrowed down, and the taxonomy is defined or at least implicitly present. For example, check out the artery surgery summary. It is quite informative even though based on a generic Web search; yet within a medical portal it could have been of even higher quality.
The glamorous first page and beyond.
It feels intuitive that the first page of search results grabs the bulk of user attention. According to a recent study by iProspect, 68% of search engine users click a search result within the first page of results. One reason for this may be good ranking of sources performed by search engines. However, another plausible reason can be the narrowing attention span of users – exploring another page of results can be seen as a disproportionate burden. If the answer is not found on the first page, users may believe that the chances of finding the answer beyond it are slim and not worth the expense of time. So the search is abandoned, or the user settles with whatever quality answer was found within the first page.
A summary of the first page of results could be very useful for the users, especially for informational type of queries (see examples in Part I). A quick, at-a-glance introduction to the topic through the summary may save time, and also expose better, content-richer sources.It is not surprising that businesses spare no expense on SEO to get their sites on the first page – those coveted 10 listings. For those queries where the knowledge area is relatively narrow and structured, and the number of authoritative sources is small, this may be acceptable. But for less structured or highly dynamic content areas, this means that a valuable source may become effectively invisible to the users just because it is ranked #11.
This is where semantic search engines can really shine, by scouting the back pages and extracting valuable items from them. SenseBot’s In-depth Search can go up to 10 pages of results (100 Web sources) deep, and produce a summary of the most relevant sources. It is eye-opening to see sometimes a little gem of content from the 4th or 5th page of results to take its proud place in a summary.
















