SOME RESEARCH ISSUES IN HASH FUNCTION LEARNING

PhD Thesis Proposal Defence


Title: "SOME RESEARCH ISSUES IN HASH FUNCTION LEARNING"

by

Mr. Yi ZHEN


ABSTRACT:

Over the past decade, hashing-based methods for large-scale similarity
search have sparked considerable research interest in the database,
data mining and information retrieval communities. These methods
achieve very fast search speed by indexing data with binary codes.
Although lots of hash functions for various similarity metrics have
been proposed, they are argued to generate very long codes due to
their data independence nature. In recent years, machine learning
techniques have been applied to learn hash functions from data,
forming a new research topic called hash function learning. In this
proposal, we study two important issues in hash function learning. On
one hand, existing supervised or semi-supervised hash function
learning methods, which learn hash functions from labeled data, can be
regarded to be passive because they assume that the labeled data are
provided in advance. Given that the data labeling process can be very
costly in practice and the contribution of labeled data to hash
function learning can be quite different, it may be more cost
effective for the hash function learning methods to select labeled
data from which to learn. To this end, we propose a novel framework,
termed active hashing, to actively select the most informative data to
label for hash function learning. Under the framework, we develop one
simple method which queries labels of data that current hash functions
are most uncertain on. Experiments conducted on two real data sets
show obvious improvement of our active hashing algorithm over previous
passive hashing methods. On the other hand, most of existing hash
function learning methods only work on uni-modal data, which are
obviously not the case in many applications, e.g., multimedia
retrieval and cross-lingual document analysis. To apply hashing
function learning to multimodal data, we develop three methods under
the framework of multimodal hashing which hashes data points of
multiple modalities into one common Hamming space. For paired data,
the first method is based on spectral analysis of multimodal data
correlations. For general data, we pro- pose one non-probabilistic
model which uses normalized Hamming distance to approximate the
distance in original input space, and one probabilistic model that can
generate intra-modal and inter-modal similarities based on hash codes.
The effectiveness of our models is validated through preliminary
comparative study. The proposal will also discuss some ongoing
research issues currently under investigation and set up a timetable
for the thesis.


Date:                   Friday, 16 December 2011

Time:                   2:00pm - 4:00pm

Venue:                  Room 3304
                         lifts 17/18

Committee Members:      Prof. Dit-Yan Yeung (Supervisor)
                         Prof. Qiang Yang (Chairperson)
 			Prof. James Kwok
 			Prof. Nevin Zhang


**** ALL are Welcome ****