Debate: Arabic / English Search



Every Tuesday night on AltSearchEngines, we invite two vertical search engines to discuss the similarities and differences with their projects.  Tonight is one of the most interesting ones that we have ever had - the exotic (to me) world of Arabic / English search engines featuring two very respected search engines: Onkosh and Tayait.


1) For all of our readers who know English but not Arabic, please summarize the challenges that you have faced in building an Arabic / English search engine and how you solved them.

In order to build a search engine, you must first give your system the ability to understand the relationship between various different factors – specifically, the relationships between individual words and the overall structure of the language.

A computer’s ability to understand and process a language is based on a Natural Language Processing (NLP) – and while many languages have been comprehensively understood through NLP technology, Arabic is one of the few major languages left on earth in which major headway is still being made till today.

It is the nature and complexity within the actual language itself that has caused this delay in its development. There are two major hurdles that need to be surpassed when it comes to the application of NLP technology to Arabic:

Ambiguity: Arabic is what we call a highly inflected language i.e. many words have an incredible number of synonyms as well as derivatives, and hence during search a single keyword may have many different meanings and interpretations

Absence of Vowels: Unlike English, where the vowels in the language are letters themselves, Arabic vowels are accents placed above and below the letters themselves (known as diacritical marks). This in and of itself is not a problem, but in written Arabic these vowels are almost never used – certainly not on the Internet. The pronunciation of the word is gleaned through the structure of the sentence and the inferred meaning of the word, which is learnt from childhood for native speakers. Obviously this lack of vowels in written Arabic increases the problem of ambiguity even more.

Both of these realities about the Arabic language create a level of intricacy that as humans comfortable with the language, we have no problem understanding – but when it must be reduced to a rule-based system creates a level of complexity that is very difficult to process.

In order to solve these problems we partnered with an international company that has dedicated part of its R & D to Arabic language processing for the last 15 years, and brought in several of our own NLP specialists to develop our own NLP module, tama (taya Arabic Morphological Analyzer). Using tama we can choose to search all the inflections, derivatives, and all possible synonyms of the word – this alongside the ability to search for a term in English and return all the Arabic results that are related to that term.

The main challenge lies in the fact that the Arabic language is much more complex than the Romanized-based languages.

The Arabic language has a complex script and rules. Add to that, there are invisible characters (called diacritics or ‘tash.keel’) which can alter the way the word is pronounced, and can have a completely different meaning and root—compared to English, a quite ‘straight forward’ language.

Thus, it would be hard to judge relevancy of Arabic search results. In Onkosh, these issues are handled in a smart way that can practically maintain accuracy, and most importantly maintain search time/performance to much less than a second for 92% of the queries, even with high load of concurrent queries.

It is worth mentioning the fact that the ‘giant’ SE’s are mostly treating Arabic blindly (i.e. doing exact match). This does not help the user seeking relevant Arabic information, rather just exact-match results.

Another main challenge is auto-identifying the Arabic-related portion of the web (index coverage). Onkosh tries as much as possible to include not only the Arabic-language pages, but also the ‘Arabic-related’ content in other languages (especially English and French). Building a smart and Arabic-oriented crawler was in itself very challenging.

2) We (in the U.S.) hear a lot about the U.S. and Chinese markets (Baidu, etc.) Where would you place the Arabic market for Search in a global context? How large is it (how many users)? Is it growing, at what rate, etc.?

Well the easiest way to look at the Arabic market and compare it to other language specific markets around the world is by looking at the total number of Internet users that are currently online and use their historical growth rates to extrapolate how this number will continue to grow.

Historically the Chinese market has been considered one of the largest developing markets in the world – and within the last seven years (from the years 2000 – 2007) they have seen an approximate 720% increase in their total number of internet users – going up from 22.5 to 162 million users.  Japan – considered another developing market considered to have massive potential, has had an increase of 185% – going up from 47 to 87.5 million users online.

The interesting thing about the Arabic market – is that it cannot be divided into a single specific country – as it is a language spoken in varying degrees throughout an entire region of countries (including not only the Middle East – but North Africa as well). The North African region (specifically Morocco, Algeria, Egypt, and Libya) has seen a growth rate of almost 2500% in the last seven years in the total number of internet users.

The Middle East (including Jordan, Kuwait, Lebanon, Saudi Arabia, Syria, and the UAE) has seen another 700% increase. This comes out to an estimated 22 million users online today. Though this total number may not compare to that of China or Japan currently – this is a market with an incredible potential and we only expect the adoption and penetration rates of Internet technology to increase in the future.

[Same question]  A lake, small by nature, can be nothing within a huge ocean. This is the case for the Arabic web as part of the global Internet.

We can safely claim a few estimations based on a pseudo-scientific observation and analysis:

First: the Arabic web is currently somewhere in the size of 200-300 million pages only.

Second: the growth rate is very aggressive.

Third: a big portion of the old Arabic web is not SEO-friendly – this getting fixed as sites are added or revamped, since now SEM started to get the webmasters attention in the Middle East and North Africa arena (MENA), where our target audience primarily relies, and of course we target all the Arabs and Arab-speaking users around the globe mostly resident in the U.S., Great Britain, and Canada.

Referring to the Internet statistics website, at the time of writing this (Nov. 18, 2007), we have these growth rates in Internet penetration. The below analysis cohesively indicates parallel growth in the ‘related’ content over the years 2000-2007 (Please compare to total growth of the world which is 244.7%):

2647% growth in North Africa (stats calculated based on the top six Arab African countries: Egypt, Algeria, Morocco, Sudan, Tunisia, and Libya – Excel sheet attached).

Internet Usage Statistics for Africa

920% growth in Middle East: although the population is about 2.9%, the Internet penetration is growing much faster than the rest of the world. The Middle East is not limited to the Arab countries, but the general indicator does the job.

3) You both have the keyboard icon near your search box, could you explain how it is used, and also, for someone like me, can you explain the term “Bel-3araby ” to a non-Arabic user!

Here in Egypt, as in the rest of the Arabic speaking world, when you buy a keyboard, laptop, cell phone, etc – alongside the normal QWERTY keyboard we also have the Arabic alphabet printed (this is the extent to which the Arabic language is engrained in the community – the majority of SMS’s are even sent in Arabic on specially manufactured phones with Arabic letters printed on their keypads).

That being so, Tayait is not designed to be used only by users living within the Arabic world. We realize that there are a huge number of Arabic speakers throughout the globe – who would love to have access to Internet content in their native language – yet who may not have an Arabic keyboard. We, therefore, offer our online Arabic keyboard as a means through which users can quickly and effectively input search terms of their choice – whether they have an Arabic keyboard or not.

We realize though that sometimes using an online keyboard can be cumbersome for entering search terms – especially for power searchers who initiate many different queries at the same time to try and get the best result – this is why we have offered our Cross Language Functionality – offering our users the ability to use a normal English QWERTY keyboard and input the desired search term in plain English, while tama will take care of returning all the relevant Arabic search results to this English query.

[Same question] The keyboard usage is quiet simple. Once you click the icon, your Arabic queries become a few clicks away, even if your computer does not support Arabic at all. This even helps users who have difficulty or slowness in Arabic typing. Additionally, it is ideal for the Arabs living in the U.S. and other places where it is very rare to find Arabic keyboards. 

As for the feature “Bel-3araby”, this is first an Arabic word pronounced
as ‘bel-a’araby’ and means ‘in Arabic’. The feature is patent-pending,
and the term “Bel-3araby” itself is copyrighted to Onkosh. On that note,
I am very proud to say that I am a co-inventor, among five, of the
patent.

This feature enables you to use your Latin-based keyboard to write
Arabic words the way they are phonetically uttered using the Roman
characters and numerals that became very popular in people’s chats or
mobile messaging. In short, Bel-3araby is a transliteration service
from Romanized alpha-numerical input to Arabic output. It is an
intelligent service employing lots of careful heuristics and AI
techniques tailored from ground up to understand the Arabic user needs
(You may refer to this previous post.)

Many attempts are now in action to imitate “Bel-3araby”, after its
importance was recognized at Onkosh.

4) Are there positive things that you see in the other search engine in this debate? Are you competing for exactly the same users, or are there some differences in either a) Your objectives or b) Your approaches, that you could share -as far as you understand your debate partner?

The fact that there are other Arabic search engines out there in and of itself is a positive thing. We believe that Arabic speakers have been held back long enough from being able to effortlessly search the Internet without having to learn a second language – and in that pursuit its incredibly beneficial for all parties to have a variety of companies and people trying out different things in order to provide the best quality search results for Arabic Internet users all around the world. Though the specifics of Onkosh’s objectives are privy only to them, I think we both would agree that the Arabic Internet has incredible potential – if given the infrastructure to thrive. Offering search in Arabic is one of the most critical parts of that infrastructure – but there is much more to come.

We differentiate ourselves and our search results by not only using excellent search technology- but at the same time we have a team of individuals who make the effort of identifying the best quality Arabic websites for the most common search terms to be crawled – this in the hope to provide our users with the highest quality content on the Arabic Internet today. This means that we actively ensure that the most active blogs, news, Wikipedia, and a whole host of Arabic websites are always crawled and indexed – and we continue to add and develop our database daily using the best content we can find.

Yes definitely. They are doing an excellent job in integrating with Exalead, which I find a very good engine indeed. Tayait also have the Arabic synonym search, which is not yet public in our release. Their product Tama has been there for long indicating good performance and reliability.  Our audience overlaps for sure. I am not sure about their preferred segment. Since the Internet has no boundaries, Onkosh defines its audience as all those who use, understand, and/or are interested in the Arabic-related content around the globe.

5) I find both of your projects very impressive; are you the two dominant Arabic / English search engines, or are there others, perhaps less well known ones, that we should also know about that are also good?

Tayait and Onkosh have been in the news quite recently because we have both come out at around the same time in full force. But we haven’t come from a completely nonexistent industry – there are other companies that have been offering Arabic search for a while now, these firms include, but are not limited to:

Araby.com, Arabo.com, Ayna.com, Ajeeb.com, and 4arabs.com

Once again – we think it’s great that there are other like-minded individuals working to provide better services to the online Arabic speaking community.  

Where do you hope to be two years from now?

In the next two years we obviously plan on devoting considerable efforts towards constantly improving our results – as search is our no 1 priority. But it doesn’t stop there – we want to provide Arabic Internet users the same range and quality of services available on the English Internet and in the rest of the world. We know it’s a considerable goal – but it’s something we believe in and are willing to work towards.

There are attempts for building an Arabic search engine Two could succeed to build a good audience: araby.com and ayna.com. We are aware of some other projects that were announced but not yet released like Sawafi, whose latest news said it will be renamed and launched ‘soon.’ 

Expectation for 2010: Two years from now seems a long time span in the fast-moving SE industry and science. What I can safely promise your respectable readers, is that Onkosh will be aggressively enhancing its services over the coming few months. We aim to be the number one local search engine, and to be the most reliable engine for the Arabic-interested user in general. We are very optimistic, and we hope we can help build better Arabic web.

Onkosh has a mission to help the Arabic-speaking user to start sharing and using the Arabic language in search. We are comprehensive in our services, not only depending on the basic form of web search. Onkosh offers other distinguished search flavors including, but not limited to: news, blogs, forums, and files. Not to forget, Onkosh also has a family filter for safe search, in addition to Onkosh.mobi that brings the Arabic web to your handheld device. At Onkosh, we are cordially happy to see others recognize Onkosh as a ‘role model’ in the Arabic search. We believe we did a good job, and we have received a lot of positive feedback that keeps us motivated to even challenge ourselves and work around the clock on more quality features, and we will continue to raise the bar!

AltSearchEngines:  We owe a great debt of thanks to Hany at Onkosh and Noha at Tayait for all of the time and effort that they donated for this debate. If you found it useful, I hope you will print it or email it for anyone else that you think might benefit from this detailed discussion of Arabic search engines. [but please link back to this post]

Of course, I also encourage you to try both of them, Onkosh and Tayait, today, to see their great features for yourselves. If you have a question or comment for Onkosh or Tayait, please leave a comment and we will ask them to check back and respond as they have time.

9 Responses to “Debate: Arabic / English Search”

  1. ahmed Says:

    i think Onkosh have several features which not included even in Google or Yahoo.
    Onkosh the first one who talk about build Query by directory & bel3arby secction is a greatest one.

  2. Liz Says:

    Excellent post… Very niche… I love these debates.

  3. Nazmi Says:

    I have been through both and I am actually impressed by the features offered by tayait which is a first in Arabic Search Engines. I wish both a lot of success and we will keep an eye on both.

  4. Bassem khairy Says:

    Actually Onkosh Provides a very smart tool which is used to understand arabic in english alphapeticals and its widley used now in the arabian internet normal usage between people

    and boooom You can also serch using this slang smoothly

    another issue it facilitates searching in arabic words from other cultures that doesnt speak arabic or even have it on there OS’s the onlything u v to now is to spell it and write it in english.

    i totaly love it so much it does saves alot of effort for me

  5. Amir Boles Says:

    Frankly i see that tayait.com has a wonderful functionalities and relative results more than onkosh especially as it is better in understanding and dealing with arabic content
    one of the features i like most on tayait.com are the refine search, it really narrows our search and this feature is not included in other arabic search engines,even google.
    Another one is the great way to merge directory search with text search,for example its possible to search for websites that talks about economy in middle east websites ;-)
    its interface is better than onkosh for me, being able to see result thumbnails saves a lot of time for me
    Tayait.com had a very good advanced search that could narrow your search dramatically with little customization
    Really i love it,…

  6. prege Says:

    nice

  7. Aziz Says:

    Problem is not really with the “Arabic” search engines per say, it’s more the “Arabic” content on the internet, I’ve seen all Arabic search engines trying to tackle the lack of content (which sums up pretty much to forums, blogs, and bootleg DVDs/CDs/MP3s)

    I love the way Araby.com tried to “cover” this short by adding Video search and try to include , and love specially what Ayna.com did, adding maps in Arabic (not just interface but actual street level names in Arabic) offers something “new”

    as for the Search Engines per say, Araby and Ayna both use Fast ESP platforms/engines, as for tayait.com they are using Exalead which is a real shame, we didn’t see yet a full on Arabic Search Engine with its own personal crawlers.

    Araby.com being a $$$ driven company with a huge capital, they have the financial capital to hire/invest/research their truly own search engine/crawler, but again, Ayna.com was officially the first search engine in arabic (not a directory search like most arabic websites “SE”)

    features wise, Ayna.com beats the hell out of them, but bug levels and speed, they all beat Ayna.com.

    Interesting market, interesting “war”, but it will all go to waste unless we start having “quality” Arabic content and not just forums and blogs with little to no real informative value!

    Game on!

  8. Qais Says:

    Just to clarify this (Arabo.com, Ajeeb.com, and 4arabs.com) are not search engines, they are a web directories only.

  9. Usama Nada Says:

    is it Taya IT, Onkosh, Arabi, Ayna, Yamli or what ever search engine to be the winner?
    Let the market and final users decide.

    Actually I am biased to taya it engine, (obviously cause i am part of the search line team at taya it)
    Taya IT is working on enhancements, Developments and investing more in the search line.

    ISA We will be the winners :)

Leave a Reply