

Welcome to the offices of Dr. Hong Liang Qiao, Founder and CEO of Lexxe, Australia. Dr. Qiao has graciously offered to help us understand semantic search engines.
How can we understand and evaluate a semantic search engine?
Semantic search engines are starting to gain attention, now more than ever. Dr. Riza Berkan, founder and CEO of Hakia gave a very in-depth explanation of semantic search recently, here. Natural Language search technology (or Natural Language Processing; NLP) will have a huge breakthrough, if adequate resources are available. In my view, the advancement of Natural Language search technology depends on a rich set of linguistic data that can be constructed massively. It would be a finite set of linguistic data that can be completed within a few years, just like the decade long Human Genome Project.
Intelligent Search Engines?
If this is eventually done, it would enable search engines to become intelligent enough to understand most of the questions people ask, retrieve more accurate answers, and return more relevant web results than today. A more humanised search interface, more accurate results and more efficient communication between humans and machines will give users a whole new experience that will not be like what it is today. When people test semantic search engines these days, they tend to try them out superficially, often with just 3 or 4 queries. There is nothing wrong with that. However, some could make quite serious conclusions immediately.
Where we are now?
Although everyone is free to do so, it is a bit unfair to semantic search engines, particularly since many are still in the Alpha or Beta stages. Given the current level of linguistic data support and the innovative performance of semantic search engines, one might start to re-think the way he/she sees new technology, hopefully with a more futuristic perspective. Even semantic search could not satisfy everyone’s needs, but there is simply no doubt about this approach for the future of search technology, because queries and information returned by search engines are mostly made up of language. Even video, image, sound, and many other materials are mostly searched via language, although I am aware that some image searches can be done through images alone.
First, let’s take a look at Key Word Search
For Key Word Search, one may perhaps consider the following issues:
1a) Are snippets helpful enough? Can you find what you want without opening new web pages?
1b) In the top ten results, how many snippets are useful and provide the information you want?
1c) If you do need to open the web results (because the snippets couldn’t help you), how many web result pages contain the information you want and in what position of the top ten?
1d) If the semantic search engine provides clusters and if they offer useful information that answers your query and saves you from extra work (e.g. from clicking open the web result links), semantic search engines should score some points here.
1e) Generally speaking, the less effort and time you spend in finding the information, the fewer clicks and reading you have do, the better that search engine is.
Next: Question Answering
Question Answering offers a more humanised way of getting information. It is more like human to human communication. Naturally it has greater potential to become users’ favorite way of information retrieval. What’s the difference between the Question Answering Systems of one search engine and another? How can we judge them?
2a) Short Answers vs. Long Answers:
Short answers aim at finding the exact piece of information a user wants, for example,
Q: When was Queen Elizabeth II born? A: April 21, 1926.
Not one word more, not one word less. “April 21, 1926″ is the exact short answer to this question. Surprisingly, it is harder to retrieve than a long answer such as:
A: Queen Elizabeth II was born April 21, 1926 in London, England. Her father was King George VI and her mother was Queen Elizabeth.
The Question Answering System should first know how time is represented in the text and capture it with enough confidence. The concept of time is not hard to master for a computer system. Unrestricted identity is a big problem. For example,
Q: Who is Barney Pell? A: Founder and CEO of Powerset.
It is hard for a computer system to decide from which word to which word a correct answer can be identified in many sentences. The closer it can get, the more precise it can get, the better the system is. For systems that can extract short answers, finding longer answers at the sentence level is much easier. But it is not so the other way around. Finding short answers actually represents a higher level of technology in Question Answering Systems.
2b) On-the-Fly from Web Pages vs. from Pre-extracted Databases
Some Question Answering Systems are supported by a very large pre-extracted database. They are limited by the question and answer pairs stored in the databases, while others can extract answers from web pages on-the-fly. Those supported by databases tend to be more accurate if an answer is found in the database, but they are mostly restricted to definition style questions. They may also need human editing and updating from time to time. Those without database support are less accurate, but cover a far wider range of questions. For each sentence in a web page, suppose it can generate on average two different question/answer pairs. Each web page has 50 sentences on average. Then for 10 billion web pages indexed by most major search engines in the world, there should be 1 trillion question and answer pairs available to the users. Even though current Question Answering Systems can only get 5% of the questions right, it will still be a huge number – 50 billion questions/answers. Take Lexxe as an example, it is asking users to ask “short” questions, and Lexxe will more likely return an acceptable answer. The total number of answerable questions will go far beyond the database approach. For example, it will answer questions such as:
Q: How long is a piece of string? A: 3 Feet.
Q: When did Colonel Sanders die? A: Dec. 16, 1980.
Technically, it is a lot harder to develop an on-the-fly Question Answering System to answer from web pages, but they will be more useful and do not need any human interference or updating. They go with the changes of the web pages.
Search engines with Question Answering Systems that do on-the-fly search through web pages to get answers are in the Alpha or Beta stages. Their technology is not yet mature, but it offers great potential in the near future. Obviously Powerset is another good example from the demo questions discussed in media reports. What such companies are building are search engines with a language-understanding brain or machine-like human beings. Furthermore, it takes the entire Internet as its memory, which is far larger than any individual’s. For example, once the system is taught how to find answers to “What did someone do?” type of questions, for example:
Q: What did JK Rowling write? A: Harry Potter.
The system, if the data source supply is adequate, should be able to answer all such kinds of questions, like:
Q: What did William Shakespeare write?
Q: What did Mark Twain write?
Likewise, one can change the question’s pronoun, the person, or the verb, to freely form new questions, such as:
Q: Who did Bill Clinton marry?
Q: Who did John Wilkes Booth assassinate?
2c) Syntactic Parsing
Some Question Answering Systems in search engines are still hindered by syntactic parsing, an automatic process that analyses the grammatical structure of queries and sentences in web pages. For example, quite a few search engines can answer:
Q: Who acquired IBM’s PC business? A: Lenovo.
The question does not need any syntactic transformation. But when it comes to:
Q: Who did Bill Gates marry?
The question needs to be transformed to “Bill Gates married”, in which the auxiliary verb (past tense) “did” should be removed and “marry” should be transformed to “married”, before further search can be carried out with this pattern. Many search engines do not conduct such grammatical transformation, because they do not have a parser in it. Therefore, it is hard to get a correct answer.
2d) Question Types
Generally speaking, there are three types of questions in our communication. They are:
Interrogative Questions (those that begin with Who, What, Which, When, Where, Why and How);
Affirmative Questions (or Yes/No questions expecting a Yes or No answer); e.g. “Is Sydney the capital of Australia?”
Alternative Questions (with answers usually inside a question), e.g. “Is Sydney or Canberra the capital of Australia?”
There are certainly other sub-types of questions, but we’ll just omit them here today. Most Question Answering Systems have been focusing on Interrogative Questions, but Lexxe has also touched on Affirmative ones. It is still under experiment and not public yet. I haven’t seen any system that deals with Alternative Questions so far.
2e) Evaluation of Question Answering
Basically, evaluating Question Answering should be quite similar to the way we evaluate Key Word Search results, of course, apart from checking if the answers to the questions are correct or not.
Interestingly, from what I have observed, the web results from Question Answering are often more relevant than Key Word search.
Other Evaluation Issues
3a) Try to test a semantic search engine with both key words and questions;
3b) Try at least 20 sets of key words and 20 questions;
3c) Diversify the test with different kinds of key words and questions;
3d) Compare the same test across several search engines;
3e) Test the search engines every 2-3 months with the same set of questions to see if there is any improvement.
3f) Make sure you are fair in passing judgement, because many people tend to cast doubt over new technologies even before they test them. Or if you are an enthusiast of Semantic search, please don’t get over-excited before you analyze the results.
By and large, Semantic search or Natural Language search is a new technology, which will offer a better solution to search. We hope users and readers will give us their support and advice, which will help promote search technology innovation in the future.
Do you have a question or comment for Dr. Qiao? Leave a comment now!
















