Federated search facilitates research by helping users find high-quality documents in more specialized or remote corners of the Internet. Federated search applications excel at finding scientific, technical, and legal documents whether they live in free public sites or in subscription sites. This makes federated search a vital technology for students and professional researchers. For this reason, many libraries and corporate research departments provide federated search applications to their students and staff.
To really understand what federated is and how it works we should first provide some background.
Crawling the Web: How Typical Web Search Engines Work
There are two basic approaches to finding content on the Web. The approach that Google and all major search engines use is to “crawl” the Web. Google, over many years, has amassed a list of billions of Web sites. In the early days, it’s likely that Google learned about many Web sites when owners registered their sites with them. Today, Google can find new Web sites through links from sites it already knows about. Google periodically visits the sites (and the sites’ pages) on its list and identifies the links at that site. It then follows each link it finds to arrive at other pages where it starts the process over to find more links. In doing this, Google discovers sites it didn’t know about during previous visits. This process of going from one page to another and then to another is referred to as “crawling,” just like a spider crawls from one thread to another in its web. In fact, Web “spiders” are commonly referred to as “Web crawlers.” When you create a new site, just create a link to it from another site, or get someone to do it for you, and Google’s crawler will discover it.
The trouble with crawling is that this search technique doesn’t find everything. One might believe that through sufficient crawling, one could find all Web pages. In fact, only a small percentage of the Web’s content is accessible to Google. The term “deep Web” refers to the vast portion of the Web that is beyond the reach of the typical “surface Web” crawlers. Surface Web search engines like Google can’t easily fathom the deep Web because most deep Web content has no links to it. How can that be? Consider this example: Let’s say that you are researching the effects of some chemical or hazardous substance on humans. You would be well advised to search the National Library of Medicine’s Toxicology Data Network. Most of the information you would find there you would not find via Google. Why? Because, to find the research articles, you would have typed one or more words in a search box and you clicked on the “search” button. Few, if any, of the articles you found had links to them from any Web site. Google couldn’t find those articles because Google isn’t designed to fill out search forms and click “submit” the way humans do. In particular, Google wouldn’t know what search words to put into the form. Additionally, even if Google did know what to enter into search forms and how to submit them, Google wouldn’t be able to retrieve all of the documents from the source. This would leave Google with very incomplete content from deep Web sources.
What Makes Federated Searches Different? It’s About the Search Forms
While in most cases, Google doesn’t fill out search forms, this is exactly what federated search applications (also known as federated search engines) do. Why doesn’t Google fill out forms? It turns out that filling out forms is a difficult problem. Federated search engine builders have to customize their search software for each Web form they encounter. While Google has a general approach to crawling links from any Web site, federated search engines are programmed with intimate knowledge of each search form. The specialized software must know not only how to fill out the form and how to simulate the pressing of the “search” button, but also how to read the results that the Toxicology Data Network (as in the example above), or any other source, provides. Both are difficult to do well.
The benefits of Federated Search
The essential benefits of federated search to its users include efficiency, quality of search results, and current, relevant content.
Efficiency, Time Savings
Using a federated search engine can be a huge time saver for researchers. Instead of needing to search many sources, one at a time, the federated search engine performs the many searches on the user’s behalf. While federated search engines specialize in finding content that requires form submissions to retrieve, it isn’t the only criterion for being a federated search engine. A federated search engine also associates content from different sources. Federated search uses just one search form to cover numerous sources, and combines the results into a single results page.
Quality of Results
Federated search engines show their value best in environments in which the quality of results matters, such as libraries, corporate research environments, and the federal government. In the case of the federal government, the constituents of the government benefit greatly from such applications. A major difference between a federated search engine and a standard search engine like Google is that the client who contracts for the federated search service selects the sources to search. In almost every case, the sources will be authoritative. Google, on the other hand, has very minimal criteria for source selection. If a Web page doesn’t look like outright junk (i.e., spam) Google will present it among the search results. Thus, the federated search engine acts as a helpful librarian does, directing users to excellent quality.
Most Current Content
In addition to filling out forms and combining documents from multiple sources, another important benefit of federated search engines is that they search content in real time. Real time data is crucial for researchers who are searching for up-to-the-minute content or for content that changes frequently. As soon as the content owner updates their source, the information is available to the searcher on the very next query.
By contrast, with standard search engines/Google, the results are only as current as the last time that Google crawled sites with content that matches your search words. Content you find via Google might be days or weeks old, which can be fine depending on your situation, but can be problematic if you want the most current information.
Continued in Part II tomorrow.
Questions? Leave a comment for Sol.
Sol Lederman is the primary author of the Federated Search Blog, a blog sponsored by Deep Web Technologies and dedicated solely to the federated search industry. He also writes for the U.S. Department of Energy’s Office of Scientific and Technical Information (OSTI) Blog, primarily covering OSTI’s accomplishments and technologies. Sol’s first love is mathematics; he enjoys giving away prizes to people who can solve math problems that he presents through his personal blog, Wild About Math!.
You can read his series on Federated Search on AltSearchEngines here for Part I and here for Part II.

















January 11th, 2009 at 10:44 am
Very edifying. This essay should be required reading in all library science programs (like the one I am enrolled in at the University of Pittsburgh). This is an outstanding primer on a subject that should be better understood by anyone interested in search.
January 11th, 2009 at 1:11 pm
Very interesting article. I’m looking forward to the next article. As a university student in computer science and information systems, this subject is both endlessly fascinating and endlessly frustrating to me. I have used several of the search engines available the University of Wisconsin system. They can indeed find information which is difficult to obtain through other means; however, the user experience is uniformly terrible. This is a field which would greatly benefit from more attention to the user interface, and the presentation of results.
January 12th, 2009 at 7:02 pm
toxnet.nlm.nih.gov has a robots.txt which bans searching.
January 14th, 2009 at 6:06 am
Let me put a more philosophical perspective on the issue. What would be an ideal “search engine”? I’d say the one that finds my answers. Either a simple phone number (at one end of the spectrum) or the meaning of life (on the other EXTREME end). Questions like “what are the investments in biotechnology in Europe in 2008″ fall between the extremes. My ideal search engine is an expert. An expert has the necessary knowledge and can communicate with me.
Of course nobody wants to fill forms. We want answers! Scientifically correct, comprehensible, complete, fast and with a probability of correctness (physicist call it error bars). I think it is just a matter of time when the deep web also will be “crawled”. Filling out forms by using APIs (application program interfaces) should not be so difficult. And what keywords to search for is also clear: the keywords given to e.g. Google. I don’t see this as a major problem. The major problem is the lack of semantics behind “keywords”.
Keyword search is not enough. Searching for “heart diseases” leads (in PubMed.org) to about 50,000 results. Considering “all known” concepts (to an expert – or GoPubMed.com using MeSH = Medical Subject Headings) one will find 850,000 results. Ok, no one can/will screen 50,000 or even 850,000 results. It again needs expert knowledge to drill down to let’s say “Heart Diseases” and “Early Growth Response Protein 1″ (known to experts). The remaining 23 articles (found with GoPubMed) are highly relevant, scientifically correct, complete (as the background knowledge is) and the results are delivered with 6 mouse clicks. Fast. Try to find these articles with one of your favorite search engines.
February 24th, 2009 at 3:41 am
[...] is Part I and Part [...]
September 11th, 2009 at 12:12 pm
[...] to construct the CSS and graphic files for a theme for our upcoming Software-as-a-Service based federated search [...]