Factery Labs Launches New Real-Time Fact Engine

January 26th, 2010 by Charles S. Knight
Posted in News, Realtime | No Comments »

Factery Labs, a search technology company, just announced the launch of Factery Labs’ Real Time Fact (RTF) engine. Delivering facts instead of links, Factery Labs provides users with a simple way to find out what’s going on and what people are talking about on the web. Users can personalize their view to get real-time information that is of particular interest to them.

factery

“This is what real-time search is really about. We developed the Real Time Fact Engine because we wanted a window into what’s happening all around the web, and there wasn’t a central place that captured everything,” said Paul Pedersen, founder and president of Factery Labs. “Other services will give you a list of high volume queries but they don’t tell you what it means or why it’s interesting or important. What people are talking about is constantly changing and Factery Labs lets you figure out at a glance what is happening now.”

Factery Labs works two different ways:
Users can choose from pre-defined categories like “sports,” “politics,” “world,” “entertainment,” or set up their own topic of interest by typing any term into the search box. Each search will continuously refresh with the best and latest facts allowing users to stay completely up to date on what’s happening on the web. Users can then easily share facts via Facebook, Twitter or email.

Factery Labs also provides a “Trends” category for trending topics defined algorithmically by checking Twitter and Google. Factery Labs supplies the meaning behind these trends by exposing the facts found on pages recommended by Twitter users on these topics.

Users can explore facts using a search panel on the left hand side of the page. Any search can be ‘favorited’ and then becomes a tracking query – basically a personal trend that users can monitor each time they return to Factery Labs. Trends appear on the right hand side in a collection of continuously updating trend result panels.

For example, a user particularly interested in the Winter Olympics could type in searches for “Whistler Village”, “ice skating”, and “curling”. The engine tracks these personalized trends and supplies current facts extracted from Twitter and Yahoo! recommended URLs.

Source: Factery Labs

Three Reasons Why Twitter’s New Streaming API Rocks

January 26th, 2010 by Guest Author
Posted in Guest Authors, Realtime | No Comments »

2010-01-26_1302
The links that people share on Twitter are important signals for OneRiot’s realtime search engine. Broadly speaking, the more people share a link to a specific piece of content, and the faster the rate of sharing right now, the higher that content will appear in our search results. (You can read more detail about our ranking algorithm in this white paper.)

streamingapi_final_image

After almost a year of working with the team at Twitter and integrating their Search (aka REST) API, we recently started using the Twitter Streaming API and wanted to share with our developer friends and the greater tech community why we’re pumped about it:

1 – Data volume is fantastic

With Twitter’s Streaming API we are seeing almost 2X the data as we were prior. Stream design paradigm is smart – Twitter is now pushing data in realtime as opposed to 3rd party developers asking for it. Twitter’s old REST API could only be maximized by using multiple threads which would cause duplicate tweets and missing data. Our team had to be very diligent to de-dupe tweets and back-track to reduce the number of missed tweets. Not to mention the complexity of multithreaded programming logic. The new streaming API follows a good design pattern allowing the data to flow in realtime without requiring a second thread. This means less complex programming logic, no more duplicate tweets, and a fully maximized data volume set – a huge improvement!

2 – No more pesky HTTP 503 errors from the search API

The new streaming API allows our data feeds less interruption from HTTP 503 errors (“Service Unavailable”). The old API required us to build a special catch-up thread to make sure we didn’t miss any data during outages. This was a timely and expensive problem. Since implementing the Streaming API we haven’t experienced any service availability issues and have eliminated our “catch-up” process.

3 – It’s easy to integrate

The Streaming API is simply easier to write code for. It took us less than two days to fully integrate the new API with a very small learning curve and a barebones system. I should also point out that the Twitter Streaming API is extremely well documented. (To be fair, so was the last one, but it should be noted that they did a great job with this documentation too!)

No matter what programming language you use (Ruby, Pearl or Java) the integration is seamless. Here’s how we integrate the Twitter Streaming API at OneRiot:

Java is our programming language of choice because it’s fast to develop while delivering high performance. We also use HTTPClient library to connect to the Twitter stream. The tweets are returned in JSON which we parse through right as it comes in the stream. (Side note about JSON: we are pleased that Twitter supports JSON since it’s a lightweight protocol that’s quick to download and easy to read but it’s not as bulky as XML. Oh, and it also has less overhead with clearly structured data.) Lastly, we publish tweets using a traditional publisher subscriber model. Since Twitter doesn’t require a server, we have found that the traditional publisher subscriber model is less daunting than Pubsubhubhub which is more complex and has server requirements.

As you can tell we are big fans of the Twitter Streaming API and would highly recommend any 3rd party developers who have not already converted to do so.

Posted by Nathaniel Fisher here

The six degrees of distribution in search.

January 19th, 2010 by Guest Author
Posted in CEO Views, Guest Authors, Realtime | 1 Comment »

Or, everything you ever wanted to know about Peer-to-Peer (p2p) distributed / decentralized search, but were afraid to ask.

faroo I’ve known Wolf and Gosia Garbe from FAROO for a long time, and so I asked Wold if he would write a guest post explaining the differences between single index Google type search architecture and the decentralized / distributed peer-to-peer model that FAROO is based on.

What follows is a pretty long essay, but it is meant to be. If you want to be up on what challenges Google is going to face in the coming decade, you really must allocate some time to read this article.

A guest post by Wolf Garbe, FAROO

Crisis reveals character, and this is especially true for distributed systems. Everything beyond the standard case may led to a crisis if not considered beforehand.

six_degrees

The network wide scale adds a new dimension to everything, completely changing the perspective and puting many centralized approaches into question.

Joining peers, updates and recovery look different from a bird’s eye view than from the ground:

*When the network size grows the bootstrapping algorithm needs to scale.

*Even if the whole system fails, and all peers want to reconnect at the same time the system should be able to recover gracefully.

*Every system needs to evolve over time, hence software distribution is required to work on large scale, perhaps frequently or immediately.

That’s why it is important to look at the scaling of all operational aspects, not only at the main search functionality. The weakest element defines the overall scalability and reliability of a system.

The benefits of a distributed architecture (as low cost, high availability and autonomy) can be fully used only, if the operational side is also fully distributed.

There should nowhere be a centralized element, which can fail, be attacked or blocked as a single point of failure or which does simply not scale. Not for crawling, not for indexing & search, not for ranking and discovery, not for bootstrap, and not for update.

Let’s have a closer look at those six degrees of distribution:

Distributed Crawling

Sometimes only the crawler is distributed, while the index and the search engine are still centralized. An example is the Grub distributed crawler, once used by Wikia Search of Wikipedia Founder Jimmy Wales.

A distributed crawler itself provides only limited benefit. Transferring the crawled pages back to a central server doesn’t save much of the bandwidth, compared to what the server would need to download the pages itself. Additionally there is overhead for reliably distributing the workload across the unreliable crawler elements.

The benefits of such hybrid approach are rather for applications beyond a search engine: if only selected information are transferred back (like scraping email-addresses), and the spider is harder to detect and block for the webmaster, as the load comes from different ip’s.

Distributed crawling will live up to its promises only as part of fully distributed search engine architecture. Where the crawlers are not controlled by a single instance, but crawling autonomous led solely by wisdom of crowd of its users. Huge network wide effects can be achieved by utilizing the geographic or contextual proximity between distributed index and crawler parts.

With FAROO’s user powered crawling pages which are changing often (e.g. news) are also re-indexed more often. So the FAROO users implicitly control the distributed crawler in a way that frequently changing pages are kept fresh in the distributed index, while preventing unnecessary traffic on rather static pages.

Distributed Discovery

Even for big incumbents in the search engine markets, it is impossible to crawl the whole web (100 billion pages?) within minutes, to discover new content timely (billion pages per day). Only if the crawler is selectively directed to the new created pages, the web scale real time search becomes feasible and efficient, instead looking for the needle in the hay stack.

By aggregating and analyzing all visited web pages of our users for discovery, we utilize the “wisdom of crowds”. Our users are our scouts. They bring in their collective intelligence and turn the crawler there where new pages emerge. In addition to instantly indexing all visited web pages our active, community directed crawler is also deriving its crawler start points from discovered pages.

Beyond real time search this is also important to discover and crawl blind spots in the web. Those blind spots are formed by web pages, which are not connected to the rest of the web. Thus they can’t be found just by traversing links.

Distributed discovery also helps indexing the deep web (sometimes also referred to as hidden web). It consists of web pages that are created solely on demand from a database, if a user searches for a specific product or service. But because there are no incoming links from the web, those pages can’t be discovered and crawled by normal search engines, although they start to work on alternate ways to index the hidden web, which is much bigger than the visible web.

Distributed Index & Search

Storing web scale information is not so much of a problem. Expensive are the huge data centers required for answering millions of queries in parallel. The resulting costs of billion dollars can be ommitted can be omitted only with a fully decentralized search engine like FAROO.

Incumbents already envision 10 million servers. A distributed index scales naturally, as more users are also providing the additional infrastructure required for their queries. It also benefits from the increase of hardware ressources, doubling every two years according to Moore’s Law.

Recycling unused computer resources is also much more sustainable than building new giant data centers, which consume more energy than a whole city.

The indexes of all big search engines are distributed across hundreds of thousands computers, within huge data centers. But by distributing the search index to the same to the edge of the network where already both user and content reside, the data have not anymore to travel forth and back to a central search instance, which is consequently eliminated. This prevents not only a single point of failure, but also combines the index distribution across multiple computers with leveraging the geographic proximity normally achieved by spreading multiple data centers across the globe.

Last but not least a distributed index is the only architecture where privacy is system inherent, as opposite to the policy based approaches of centralized search engines where the privacy policy might be subject to changes.

Zooming in from the macroscopic view, every distributed layer has its own challenges again. E.g. for the index peers usually do not behave like they should: They are overloaded, there is user activity, the resource quota is exhausted, they are behind a NAT, their dynamic IP has changed or they just quit.

Those challenges have been perfectly summarized in “The Eight Fallacies of Distributed Computing”. Yet going into all the details and our solutions would certainly go beyond the scope of this post.

Distributed Ranking

A additional benefit is a distributed attention based ranking, utilizing the wisdom of crowds. Monitoring the browsing habits of the users and aggregating those “implicit” votes across the whole web promises a more democratic and timely ranking (important for real time search).

While most real time search engines are using an explicit voting, we showed in our blog post “The limits of tweet based web search” that implicit voting by analyzing visited web pages is much more effective (by two orders of magnitude!).

This also eliminates shortcomings of a Wikipedia like approach where content is contributed in a highly distributed way, but the audit is still centralized. Implicit voting automatically involves everybody in a truly democratic ranking. The groups of adjudicators and users become identical, therefore pulling together for optimum results.

Distributed Bootstrap

The first time when a new peer want to connect to the p2p network, it has to contact to known peers (super peers, root peers, bootstrap peers, rendezvous peers) to learn about the addresses of the other peers. This is called bootstrap process.

The addresses of the known peers are either shipped in a list together with the client software or they are loaded dynamically from web caches.

The new peers then store the addresses of the peers they learned from the super peer. The second time the new peers can directly connect to those addresses, without contacting the super peer first.

But if a peer has been offline for some time, most of the addresses he stored become invalid because they are dynamic IP addresses. If the peer fails to connect to the p2p network using the stored addresses, he starts again the bootstrap process using the super peers.

Scaling
During a strong network growth many peers are accessing to the super peers in order to connect to the p2p network. Then the super peer becomes the bottleneck and a single point of failure in an otherwise fully decentralized system. If super peers become overloaded, no new peers can join the system, which prevents a further network growth.

Recovery
If the whole p2p network breaks down due to a web wide incident and all peers try to reconnect at the same time this leads to a extreme load on the super peers.
This would prevent a fast recovery, as peers would fail to connect but keep tying and causing additional load.
Those problems have been experienced in practice in the Skype network.

Security
Another issue is that the super peers make the whole p2p network vulnerable because of their centralized nature. Both blocking and observing of the whole p2p network become possible just by blocking/observing the few super peer nodes.

FAROO is using a fully distributed bootstrap algorithm, which

*eliminates the super peers as last centralized element, as bottleneck and single point of failure in an otherwise distributed system.

*provides an organic scaling also for the bootstrap procedure.

*ensures a fast recovery in case of a system wide incident.

*makes the p2p network immune to the blocking or monitoring of super peers.

Distributed Update

The distributed system becomes automatically smarter just by the increasing relevance of the collected attention data.
But you may want to refine the underlying algorithms, to improve the efficiency of the p2p overlay, to extend the data model, or to add new functions. And the example of Windows shows that it might be necessary to apply security patches, network wide, frequently and immediately. Updating p2p clients requires a very efficient software update distribution.

10 million peers and 5 Mbyte client software size would require to distribute 50 Terabytes for a full network update. Even for a 100 Mbit/s network connection a central update would last 50 days, if you manage to evenly distribute updates over time.

FAROO is using a distributed, cell division like update instead, where all peers pass on the DNA of a new version to each other within minutes. Of course there is some signature stuff to ensure the integrity of the network.

Divide and Conquer

By consequently distributing every function we ensured a true scalability of the whole system, while eliminating every single point of failure. Our peers are not outposts of a centralized system, but rather part of distributed Cyborg (combining the power of users, algorithms & resources) living in the net.

This is a system which works on a quiet sunny day, but also on a stormy one. It would be even suitable to extreme mobile scenarios, where peers are scattered across a battlefield or carried by a rescue team.

The system recovers autonomously from a disaster, even if there is no working central instance left, the surviving peers find itself, forming a powerful working distributed system again once they awake. If you have seen Terminator reassembling after run over by a truck you get the idea

In biology organisms naturally deal with the rise and falls of its cells, which as simple elements form superior systems. We believe that evolution works in search too, and that the future belongs to multicellulars

Watch the Golden Globes Live With Collecta

January 15th, 2010 by Charles S. Knight
Posted in Realtime, Updates | No Comments »

golden_globeDon’t want to miss a second of action from the 67th Golden Globe Awards? Collecta puts you in the front row, so you can follow everything that happens at the Globes live — from the red carpet to the awards ceremony and parties. The streaming real-time search platform provides a uniquely comprehensive and instantaneous glimpse into every comment, reaction, trend, and image from across the web. Collecta’s new ‘Hot Now’ feature will portray a rich snapshot of the Golden Globes — displaying the most representative sampling of stories, photos, comments, and tweets. Additionally, by leaving a streaming search running throughout the Globes, people can watch the stories as they unfold.

Collecta scours both social media, such as Twitter and WordPress, and established news sources in real time. Collecta’s broad content network also includes entertainment focused sites — including TMZ, Perez Hilton, TheInsider, Extra, E! Online, LA Times, and Examiner.com — to pull together all the best Golden Globes gossip and news the second it has been published.

2010-01-15_2141

Film and TV fans can participate with a global audience, by seeing comments and reactions in literal real time. Whether fans are shocked by an award or acceptance speech, a dress on the red carpet or a joke from Ricky Gervais, Collecta lets them see others’ reactions in real time.

As Gerry Campbell, CEO of Collecta, explained, “The streaming, real time nature of the results pulls people right into the action. They are suddenly part of a larger community who are happy, surprised, and disappointed by the same events.”

Pulseni mixes Web Search with Twitter Results

January 4th, 2010 by Steffen Schilke
Posted in Realtime | No Comments »

pulseniHello and welcome to Pulseni!

I’m sure you’re wondering what Pulseni is – afterall it’s not like the name of the site really gives much away eh? Well don’t worry let me explain.*  On Twitter people share millions of links, and most of the time these links are given at the end of a tweet with little actual explanation of what is being linked to. Pulseni takes these links and then brings them into context with a title and description of the page.

We looked at Twitter and saw what no doubt many see – alot of very handy and up-to-date links without meaning. Type a search into the search box and then you will get standard web search results – except these are results that are being shared right now on Twitter. Fantastic for news events or anything that is time sensitive!

*Explain the name, please! -editor

Source: Pulseni.com