More about HKUST
Theory and Practice: Similarity Measurements on Large-Scale Graphs
PhD Thesis Proposal Defence Title: "Theory and Practice: Similarity Measurements on Large-Scale Graphs" by Miss Dandan LIN Abstract: Measuring node-to-node similarity in a graph is a fundamental tool for various graph-based mining applications such as image/information retrieval, friend/item recommendation, link prediction, text mining and community detection. Thus, studying similarity measures has been recognized as an important research problem in the data mining community. In this thesis proposal, we focus on two representative similarity measurements in the literature: Random Walk with Restart (RWR) and Manifold Ranking (MR), both of which provide good metrics for measuring the proximity of two nodes in the graph by considering the local structure and the global structure in the graph. Specifically, given a graph G and a pair of nodes s, t in G, RWR (resp. MR) computes a "similarity score" of t with respect to (w.r.t) s, which is called the RWR (resp. MR) score. The larger the RWR (resp. MR) score, the more similar to s node t is. However, for both measurements, existing best-known algorithms for computing RWR/MR scores have one of the following issues: (1) some of them require to build a bulky index, (2) some of them do not have any theoretical bound on the output, and (3) some of them are very time-consuming, making both of them difficult to be used in real world applications where the graphs are always in a large scale with millions of nodes and billions of edges. In this thesis proposal, we firstly propose an index-free algorithm called Residue-Accumulated approach (ResAcc) which returns RWR scores with a theoretical guarantee efficiently. Our experimental evaluations on large-scale real graphs show that ResAcc is up to 4 times faster than the best-known previous algorithm, guaranteeing the same accuracy. Under typical settings, the bestknown algorithm ran around 1000 seconds on a large dataset containing 41.7 million nodes, which is too time-consuming, while ResAcc finished in 275 seconds with the same accuracy. Moreover, ResAcc is up to 6 orders of magnitude more accurate than the best-known algorithm in practice with the same execution time, which is considered as a substantial improvement. Secondly, we study the top-k image retrieval (which returns the k images from the image datasets which are the most similar to a given query image) with manifold ranking (MR), which could give more accurate results than existing deep learning-based approaches. More importantly, compared with RWR, MR achieves higher precision on the top-k results since MR treats each edge in the graph with different weights while RWR treats equally. Besides, we propose 2 algorithms, namely Monte Carlo-based MR (MCMR) and MCMR+, for top-k image retrieval. We are the first one to propose an index-free manifold ranking image retrieval with the output theoretical bound. More importantly, our algorithms give the first best-known time complexity result of O(n log n) where n is the total number of images in the database compared with the existing best-known result of O(n^2) in the literature of computing the exact top-k results with quality guarantee. Lastly, our experimental results show that MCMR+ outperforms existing algorithms by up to 4 orders of magnitude in terms of query time. Date: Tuesday, 26 May 2020 Time: 3:00pm - 5:00pm Zoom Meeting: https://hkust.zoom.us/j/99399188287 Committee Members: Dr. Raymond Wong (Supervisor) Dr. Wilfred Ng (Chairperson) Prof. Dik-Lun Lee Prof. Dit-Yan Yeung **** ALL are Welcome ****