PhD Thesis Proposal Defence

"Meta-search and Distributed Search Systems"

By

Mr. Yipeng Shen

Abstract

The Web, which contains billions of pages, has become one of the major
information resources nowadays. Since the Web is huge and dynamic, it is
difficult for a single search engine to index all of the web pages and yet
keep its index database up-to-date. Meta-search and distributed search
systems incorporating many single search engines can alleviate the
problems associated with a single search engine.

A meta-search system is a middleware for a large number of underlying
search engines. It receives queries from users and redirects them to one
or more of the participant search engines for processing. The ability for
the meta-search engine to select the most relevant search engines
determines the quality of the final result. To facilitate the selection
process, the document space covered by each search engine must be
described not only concisely but also precisely. We propose to cluster a
search engine's document space into clusters and keep a descriptor for
each cluster. The cluster descriptors can provide a finer and more
accurate representation of the document space, and hence enable the
meta-search engine to improve the selection of relevant search engines.

Furthermore, we propose a metric space model to evaluate the relevance of
the underlying search engines in the meta-search system when the
similarity measurement used in the system satisfies the distance
condition. The top search engines containing the most similar documents
can be effectively identified by the estimation of the shortest distance
between the query and some estimated most similar documents contained in
the clusters in each search engine's index database.

We also propose to develop a peer-to-peer distributed search system where
a large number of autonomous search engines are logically connected in a
flat non-hierarchical architecture (network). We model the distributed
search process as Markov Decision Processes (MDPs). The estimated
relevance of a server to a query is regarded as the reward in the MDP
model. Once the MDP policies representing the global knowledge are
obtained at each server through the asynchronous value iteration, the most
relevant servers to a given query can be efficiently identified despite
the lack of centralized control and global knowledge at each autonomous
server.

Date:				Thursday, 15 November 2001

Time:				3:00p.m.-5:00p.m.

Venue:				Room 4480
				Lifts 25-26

Committee Members:		Prof. Dik-Lun Lee (Supervisor)
				Dr. Dimitris Papadias (Chairman)
				Prof. Frederick Lochovsky
				Prof. Hongjun Lu
				Dr. Wilfred Ng


**** ALL are Welcome ****