PhD Thesis Proposal Defence "Meta-search and Distributed Search Systems" By Mr. Yipeng Shen Abstract The Web, which contains billions of pages, has become one of the major information resources nowadays. Since the Web is huge and dynamic, it is difficult for a single search engine to index all of the web pages and yet keep its index database up-to-date. Meta-search and distributed search systems incorporating many single search engines can alleviate the problems associated with a single search engine. A meta-search system is a middleware for a large number of underlying search engines. It receives queries from users and redirects them to one or more of the participant search engines for processing. The ability for the meta-search engine to select the most relevant search engines determines the quality of the final result. To facilitate the selection process, the document space covered by each search engine must be described not only concisely but also precisely. We propose to cluster a search engine's document space into clusters and keep a descriptor for each cluster. The cluster descriptors can provide a finer and more accurate representation of the document space, and hence enable the meta-search engine to improve the selection of relevant search engines. Furthermore, we propose a metric space model to evaluate the relevance of the underlying search engines in the meta-search system when the similarity measurement used in the system satisfies the distance condition. The top search engines containing the most similar documents can be effectively identified by the estimation of the shortest distance between the query and some estimated most similar documents contained in the clusters in each search engine's index database. We also propose to develop a peer-to-peer distributed search system where a large number of autonomous search engines are logically connected in a flat non-hierarchical architecture (network). We model the distributed search process as Markov Decision Processes (MDPs). The estimated relevance of a server to a query is regarded as the reward in the MDP model. Once the MDP policies representing the global knowledge are obtained at each server through the asynchronous value iteration, the most relevant servers to a given query can be efficiently identified despite the lack of centralized control and global knowledge at each autonomous server. Date: Thursday, 15 November 2001 Time: 3:00p.m.-5:00p.m. Venue: Room 4480 Lifts 25-26 Committee Members: Prof. Dik-Lun Lee (Supervisor) Dr. Dimitris Papadias (Chairman) Prof. Frederick Lochovsky Prof. Hongjun Lu Dr. Wilfred Ng **** ALL are Welcome ****