Theory and Practice: Similarity Measurements on Large-Scale Graphs

PhD Thesis Proposal Defence


Title: "Theory and Practice: Similarity Measurements on Large-Scale Graphs"

by

Miss Dandan LIN


Abstract:

Measuring node-to-node similarity in a graph is a fundamental tool for 
various graph-based mining applications such as image/information 
retrieval, friend/item recommendation, link prediction, text mining and 
community detection. Thus, studying similarity measures has been 
recognized as an important research problem in the data mining community. 
In this thesis proposal, we focus on two representative similarity 
measurements in the literature: Random Walk with Restart (RWR) and 
Manifold Ranking (MR), both of which provide good metrics for measuring 
the proximity of two nodes in the graph by considering the local structure 
and the global structure in the graph. Specifically, given a graph G and a 
pair of nodes s, t in G, RWR (resp. MR) computes a "similarity score" of 
t with respect to (w.r.t) s, which is called the RWR (resp. MR) score. The 
larger the RWR (resp. MR) score, the more similar to s node t is. However, 
for both measurements, existing best-known algorithms for computing RWR/MR 
scores have one of the following issues: (1) some of them require to build 
a bulky index, (2) some of them do not have any theoretical bound on the 
output, and (3) some of them are very time-consuming, making both of them 
difficult to be used in real world applications where the graphs are 
always in a large scale with millions of nodes and billions of edges.

In this thesis proposal, we firstly propose an index-free algorithm called 
Residue-Accumulated approach (ResAcc) which returns RWR scores with a 
theoretical guarantee efficiently. Our experimental evaluations on 
large-scale real graphs show that ResAcc is up to 4 times faster than the 
best-known previous algorithm, guaranteeing the same accuracy. Under 
typical settings, the bestknown algorithm ran around 1000 seconds on a 
large dataset containing 41.7 million nodes, which is too time-consuming, 
while ResAcc finished in 275 seconds with the same accuracy. Moreover, 
ResAcc is up to 6 orders of magnitude more accurate than the best-known 
algorithm in practice with the same execution time, which is considered as 
a substantial improvement.

Secondly, we study the top-k image retrieval (which returns the k images 
from the image datasets which are the most similar to a given query image) 
with manifold ranking (MR), which could give more accurate results than 
existing deep learning-based approaches. More importantly, compared with 
RWR, MR achieves higher precision on the top-k results since MR treats 
each edge in the graph with different weights while RWR treats equally. 
Besides, we propose 2 algorithms, namely Monte Carlo-based MR (MCMR) and 
MCMR+, for top-k image retrieval. We are the first one to propose an 
index-free manifold ranking image retrieval with the output theoretical 
bound. More importantly, our algorithms give the first best-known time 
complexity result of O(n log n) where n is the total number of images in 
the database compared with the existing best-known result of O(n^2) in the 
literature of computing the exact top-k results with quality guarantee. 
Lastly, our experimental results show that MCMR+ outperforms existing 
algorithms by up to 4 orders of magnitude in terms of query time.


Date:                   Tuesday, 26 May 2020

Time:                   3:00pm - 5:00pm

Zoom Meeting:           https://hkust.zoom.us/j/99399188287

Committee Members:      Dr. Raymond Wong (Supervisor)
                        Dr. Wilfred Ng (Chairperson)
                        Prof. Dik-Lun Lee
                        Prof. Dit-Yan Yeung


**** ALL are Welcome ****