Jiksearch
"Some of the people want to query about democracy, but most of them just want to know about their pop stars." --Google [1]
Jiksearch is a distributed peer-to-peer (P2P) web searching platform. At least in theory.
Why? All search engines are subject to government and/or corporate control. The likelihood that this will not be abused on a larger scale than today in the future is low.
The project was directly inspired by Google's decision to kowtow to the Chinese government's demands that they censor politcally unacceptable pages (See also [2]). Other search engines have been effectively helping to interfere with human rights in China for some time, but Google's decision came as a shock because, until very recently, they were perceived to be an exciting company with people who care about things other than money. Inevitably, of course, this had to change.
Its easy to dismiss China as a special case of totalitarianism and take refuge in the fact that one lives in free country X. But recent trends all point towards more government control of the internet, in virtually all countries with functional governments that are large enough to care. The US, UK, Germany and France are of course high on the list. In the not too distant future, government control of the internet may become the norm rather than the exception. Even without this, the control that a company such as Google has over the internet is a cause for concern, as their inital highminded ideals are likely to be more and more superseded by the basic profit motive and conserve-control mentality that seems inevitable.
While competitors may challenge Google in the long run, the basic problem is not with Google but with the need for a vast cetralised database containing in essence the whole internet. Large centralised databases are irresistable to those who wish to control other people's lives.
Hence, the only real solution is a decentralised web search platform. These pages contain some inital ideas towards this ideal.
First, some history of an earlier idea for decentralised web searching. The only existing reference to an actaul implementation for a distributed web searching application that I could find (not that I did a very exhaustive search, help me if i missed something) was to something called InfraSearch in the year 2000. According to c|net, this started out as a group of opensource programmers from gnutella fame, who saw the obvious potential in the technology of the filesharing P2P networks for web searches. Brilliant idea, but what happened? Very little. The project seems to have paid off for the few programmers involved, but nothing actually developed out of it since SUN Microsystems bought the project, stripped the useful technology for their JXTA project and forgot all about the original idea of decentralised web searching by "generalising" the technology. (Is this at all accurate?)
Next, some ideas on how a decentralised web searching platform would work. We are thinking in terms of an application that would be installed on clients computers, possibly in the form of browser extensions. The browser then simultanously exposes web search functionality while contributing resources (CPU, memory, hard disk space and internet bandwidth) to the decentralised search system. As an added bonus, sites that are actually visited by the user of the browser can automatically be included in the system. However, some crawling will be required, especially at first and to ensure that the system is up to date. Dedicated servers should also be available for "serious" contributors, but the idea of supernodes goes against the principle of decentralisation, so they should essentially have the same status as any other node.
Decentralised searching is not a trivial task. Users construct queries concsisting of multiple keywords (keys) that must be found in combination on a given web page (page). Say a user constructs a query consisting of "Microsoft Windows", not as a phrase but as separate search terms. Now ideally, the user wants to see a list of pages containing both words, out of all possible pages on the entire internet, ranked according to relevance. There are too many pages containing "windows" to store a database of them on a single servent (P2P lingo for a node acting as both client and server simultaneously). The same holds for "Microsoft". Even if list of pages could be generated for each of those, how are the two lists of terms to be combined? Sending all of those pages to the client for combination is impossible, as is sending all of the pages from one server to the other for combination.
The Kademlia protocol is designed to overcome the problem of decentralised searching in an imperfect network, e.g. where hosts become unreachable at random times (see also [3] and [4]). It works by assigning each servent a random code, which is then fixed. This is used as a hash for determining the distance between any key and between the host and other hosts (with other keys). In this way, hosts can form connections based on nearness, and hosts can index infromation based on their nearness to it. A client wishing to search simply locates hosts that are near the results (since the client can easily calculate a hash of the search terms) and asks at the most appropriate place. (This description is probably inaccurate and incomplete --MvS).
Kademlia is designed for locating a particular file on a P2P filesharing network, and as such the abovementioned problem of combining multiple terms is much less apparent.
One possible solution to the problem of combination is to have two separate layers of data, with seperation being done differently in each. In the first, the keys are distributed while in the second, pages are distributed. The key-seperation layer is able to direct queries to hosts that are likely to be able to provide matches by searching full pages (i.e. the second layer). The first layer can only search on a single key at a time. Key searches and page searches should probably be done in the same way. Each host therefore keeps some web pages (completely) and also some keys, and a search will generate a combined list of hash values that are near pages that contain the key (layer 1) or it may have actual results (layer 2). In the first case, the query is passed on to another host (based on the returned hash value), while in the second case the results is simply passed to the user. Ultimately the client aggregates results as they come in and presents them to the user in real-time. Part of client's database may be dedicated to searches commonly requested of it (e.g. directly from the user in front of it), but then the problem is how to make use of this information, since it is unlikely to be "near" the servent's randomly generated hash value. One way is to send a notification that this servent may be used as a layer 2 host - notifications are sent to the appropriate layer 1 hosts that this host contains information on each of the keys. The receiving (layer 1) host associates the hash value of the notifier with likely keys in its own database. Only hash values are sent in notifications. But this may be inefficient as it breaks the principle of hash value nearness - if the host goes down or purges the page then hundreds of notifications have been sent in vain. Unless the most important pages are also passed on to hosts that are near (sparingly), so that eventually self-organising clusters form that are strongly mapped (in layer 1 space) to relevant keys. This also ensures that often used searches on the internet are duplicated widely around the world.
Of course, "relevance thresholds" and TTL schemes should be used to prevent terabytes of search results from being sent to the user. Having keys and pages that normally go together on the same host would be a bonus, hence the hash algorithm may need to be adjusted dynamically to optimise according to local conditions. Some type of artificial intelligence, e.g. genetic algorithms, could be useful. In general, bandwidth will need to be very carefully managed.
Alternatively, entire multiple-word queries can simply be assigned a single hash value and then passed to a single host. This means must somehow be able to return results for complex queries based on a small dataset (each host only has access to a small dataset as a rule).
Blah, lots of work obviously.
Where does the project stand currently? Effectively this is an initial feasiblity appraisal. We need to know about the technologies involved (a list should be compiled), theoretic material should be studied (some links below) and similar efforts should be investigated.
Clearly, distributed searching is an old topic, and clearly it is a difficult one.
Links:
A company that makes offline websearch software: [6]. This may be relevant simply because it may contain information on a method to select which information is important enough to include in an index. Obviously any aproach where a single company decides on the contents of the index is an abomination of the highest degree.