More about HKUST
Efficient Keyword Search in Archival Collections
Speaker: Dr. Torsten SUEL Principal Research Scientist at Yahoo and Associate Professor Department of Computer and Information Science Polytechnic University, Brooklyn, NY Title: "Efficient Keyword Search in Archival Collections" Date: Monday, 28 April 2008 Time: 4:00pm - 5:00pm Venue: Lecture Theatre F (Leung Yat Sing Lecture Theatre, near lift nos. 25/26) HKUST Abstract: Current web search engines focus on searching only the most recent snapshot of the web. In many cases, however, it would be desirable to search over collections that include many different crawls and thus many different versions of each document. Important examples are the Internet Archive, which has collected multiple snapshots of the web since 1995, Wikipedia, which keeps track of all versions of each article, or versioning file systems and revision control systems. Since the sizes of such archival collections are often much larger than the latest snapshot, this presents us with significant performance challenges. Current search engines use many techniques for index compression and optimized query execution, but these techniques do not exploit the significant similarities between different versions of a document, or between related documents. In this talk, we discuss challenges and research issues in searching and mining archival text collections. We then propose a framework for indexing and query processing in archival collections and, more generally, any collections with a sufficient amount of similarity between documents or versions. This approach results in significant reductions in index size and query processing costs on such collections, and it is orthogonal to and can be combined with existing techniques. It also supports highly efficient updates, both locally and over a network. We present experimental results based on general web crawls and Wikipedia data. [This is joint work with Jiangong Zhang] ************************ Biography: Torsten SUEL is a Principal Research Scientist at Yahoo! Research, and an Associate Professor in the Department of Computer and Information Science at Polytechnic University in Brooklyn, NY. He received a Diplom degree from the Technical University of Braunschweig (Germany), and a Ph.D. from the University of Texas at Austin. After postdoctoral research at the NEC Research Institute, UC Berkeley, and Bell Labs, he joined Polytechnic University in the Fall of 1998. His main research interests are in the areas of web search engines and web data mining, algorithms, databases, and distributed systems.